* nfs-backed mmap file results in 1000s of WRITEs per second
@ 2013-09-05 16:21 Quentin Barnes
2013-09-05 17:03 ` Malahal Naineni
0 siblings, 1 reply; 20+ messages in thread
From: Quentin Barnes @ 2013-09-05 16:21 UTC (permalink / raw)
To: linux-nfs
If two (or more) processes are doing nothing more than writing to
the memory addresses of an mmapped shared file on an NFS mounted
file system, it results in the kernel scribbling WRITEs to the
server as fast as it can (1000s per second) even while no syscalls
are going on.
The problems happens on NFS clients mounting NFSv3 or NFSv4. I've
reproduced this on the 3.11 kernel, and it happens as far back as
RHEL6 (2.6.32 based), however, it is not a problem on RHEL5 (2.6.18
based). (All x86_64 systems.) I didn't try anything in between.
I've created a self-contained program below that will demonstrate
the problem (call it "t1"). Assuming /mnt has an NFS file system:
$ t1 /mnt/mynfsfile 1 # Fork 1 writer, kernel behaves normally
$ t1 /mnt/mynfsfile 2 # Fork 2 writers, kernel goes crazy WRITEing
Just run "watch -d nfsstat" in another window while running the two
writer test and watch the WRITE count explode.
I don't see anything particularly wrong with what the example code
is doing with its use of mmap. Is there anything undefined about
the code that would explain this behavior, or is this a NFS bug
that's really lived this long?
Quentin
#include <sys/stat.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <string.h>
#include <unistd.h>
int
kill_children()
{
int cnt = 0;
siginfo_t infop;
signal(SIGINT, SIG_IGN);
kill(0, SIGINT);
while (waitid(P_ALL, 0, &infop, WEXITED) != -1) ++cnt;
return cnt;
}
void
sighandler(int sig)
{
printf("Cleaning up all children.\n");
int cnt = kill_children();
printf("Cleaned up %d child%s.\n", cnt, cnt == 1 ? "" : "ren");
exit(0);
}
int
do_child(volatile int *iaddr)
{
while (1) *iaddr = 1;
}
int
main(int argc, char **argv)
{
const char *path;
int fd;
ssize_t wlen;
int *ip;
int fork_count = 1;
if (argc == 1) {
fprintf(stderr, "Usage: %s {filename} [fork_count].\n",
argv[0]);
return 1;
}
path = argv[1];
if (argc > 2) {
int fc = atoi(argv[2]);
if (fc >= 0)
fork_count = fc;
}
fd = open(path, O_CREAT|O_TRUNC|O_RDWR|O_APPEND, S_IRUSR|S_IWUSR);
if (fd < 0) {
fprintf(stderr, "Open of '%s' failed: %s (%d)\n",
path, strerror(errno), errno);
return 1;
}
wlen = write(fd, &(int){0}, sizeof(int));
if (wlen != sizeof(int)) {
if (wlen < 0)
fprintf(stderr, "Write of '%s' failed: %s (%d)\n",
path, strerror(errno), errno);
else
fprintf(stderr, "Short write to '%s'\n", path);
return 1;
}
ip = (int *)mmap(NULL, sizeof(int), PROT_READ|PROT_WRITE,
MAP_SHARED, fd, 0);
if (ip == MAP_FAILED) {
fprintf(stderr, "Mmap of '%s' failed: %s (%d)\n",
path, strerror(errno), errno);
return 1;
}
signal(SIGINT, sighandler);
while (fork_count-- > 0) {
switch(fork()) {
case -1:
fprintf(stderr, "Fork failed: %s (%d)\n",
strerror(errno), errno);
kill_children();
return 1;
case 0: /* child */
signal(SIGINT, SIG_DFL);
do_child(ip);
break;
default: /* parent */
break;
}
}
printf("Press ^C to terminate test.\n");
pause();
return 0;
}
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: nfs-backed mmap file results in 1000s of WRITEs per second 2013-09-05 16:21 nfs-backed mmap file results in 1000s of WRITEs per second Quentin Barnes @ 2013-09-05 17:03 ` Malahal Naineni 2013-09-05 19:11 ` Quentin Barnes 0 siblings, 1 reply; 20+ messages in thread From: Malahal Naineni @ 2013-09-05 17:03 UTC (permalink / raw) To: Quentin Barnes; +Cc: linux-nfs Neil Brown posted a patch couple days ago for this! http://thread.gmane.org/gmane.linux.nfs/58473 Regards, Malahal. Quentin Barnes [qbarnes@gmail.com] wrote: > If two (or more) processes are doing nothing more than writing to > the memory addresses of an mmapped shared file on an NFS mounted > file system, it results in the kernel scribbling WRITEs to the > server as fast as it can (1000s per second) even while no syscalls > are going on. > > The problems happens on NFS clients mounting NFSv3 or NFSv4. I've > reproduced this on the 3.11 kernel, and it happens as far back as > RHEL6 (2.6.32 based), however, it is not a problem on RHEL5 (2.6.18 > based). (All x86_64 systems.) I didn't try anything in between. > > I've created a self-contained program below that will demonstrate > the problem (call it "t1"). Assuming /mnt has an NFS file system: > > $ t1 /mnt/mynfsfile 1 # Fork 1 writer, kernel behaves normally > $ t1 /mnt/mynfsfile 2 # Fork 2 writers, kernel goes crazy WRITEing > > Just run "watch -d nfsstat" in another window while running the two > writer test and watch the WRITE count explode. > > I don't see anything particularly wrong with what the example code > is doing with its use of mmap. Is there anything undefined about > the code that would explain this behavior, or is this a NFS bug > that's really lived this long? > > Quentin > > > > #include <sys/stat.h> > #include <sys/mman.h> > #include <sys/stat.h> > #include <sys/wait.h> > #include <errno.h> > #include <fcntl.h> > #include <stdio.h> > #include <stdlib.h> > #include <signal.h> > #include <string.h> > #include <unistd.h> > > int > kill_children() > { > int cnt = 0; > siginfo_t infop; > > signal(SIGINT, SIG_IGN); > kill(0, SIGINT); > while (waitid(P_ALL, 0, &infop, WEXITED) != -1) ++cnt; > > return cnt; > } > > void > sighandler(int sig) > { > printf("Cleaning up all children.\n"); > int cnt = kill_children(); > printf("Cleaned up %d child%s.\n", cnt, cnt == 1 ? "" : "ren"); > > exit(0); > } > > int > do_child(volatile int *iaddr) > { > while (1) *iaddr = 1; > } > > int > main(int argc, char **argv) > { > const char *path; > int fd; > ssize_t wlen; > int *ip; > int fork_count = 1; > > if (argc == 1) { > fprintf(stderr, "Usage: %s {filename} [fork_count].\n", > argv[0]); > return 1; > } > > path = argv[1]; > > if (argc > 2) { > int fc = atoi(argv[2]); > if (fc >= 0) > fork_count = fc; > } > > fd = open(path, O_CREAT|O_TRUNC|O_RDWR|O_APPEND, S_IRUSR|S_IWUSR); > if (fd < 0) { > fprintf(stderr, "Open of '%s' failed: %s (%d)\n", > path, strerror(errno), errno); > return 1; > } > > wlen = write(fd, &(int){0}, sizeof(int)); > if (wlen != sizeof(int)) { > if (wlen < 0) > fprintf(stderr, "Write of '%s' failed: %s (%d)\n", > path, strerror(errno), errno); > else > fprintf(stderr, "Short write to '%s'\n", path); > return 1; > } > > ip = (int *)mmap(NULL, sizeof(int), PROT_READ|PROT_WRITE, > MAP_SHARED, fd, 0); > if (ip == MAP_FAILED) { > fprintf(stderr, "Mmap of '%s' failed: %s (%d)\n", > path, strerror(errno), errno); > return 1; > } > > signal(SIGINT, sighandler); > > while (fork_count-- > 0) { > switch(fork()) { > case -1: > fprintf(stderr, "Fork failed: %s (%d)\n", > strerror(errno), errno); > kill_children(); > return 1; > case 0: /* child */ > signal(SIGINT, SIG_DFL); > do_child(ip); > break; > default: /* parent */ > break; > } > } > > printf("Press ^C to terminate test.\n"); > pause(); > > return 0; > } > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: nfs-backed mmap file results in 1000s of WRITEs per second 2013-09-05 17:03 ` Malahal Naineni @ 2013-09-05 19:11 ` Quentin Barnes 2013-09-05 20:02 ` Myklebust, Trond 0 siblings, 1 reply; 20+ messages in thread From: Quentin Barnes @ 2013-09-05 19:11 UTC (permalink / raw) To: linux-nfs On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote: > Neil Brown posted a patch couple days ago for this! > > http://thread.gmane.org/gmane.linux.nfs/58473 I tried Neil's patch on a v3.11 kernel. The rebuilt kernel still exhibited the same 1000s of WRITEs/sec problem. Any other ideas? > Regards, Malahal. > > Quentin Barnes [qbarnes@gmail.com] wrote: > > If two (or more) processes are doing nothing more than writing to > > the memory addresses of an mmapped shared file on an NFS mounted > > file system, it results in the kernel scribbling WRITEs to the > > server as fast as it can (1000s per second) even while no syscalls > > are going on. > > > > The problems happens on NFS clients mounting NFSv3 or NFSv4. I've > > reproduced this on the 3.11 kernel, and it happens as far back as > > RHEL6 (2.6.32 based), however, it is not a problem on RHEL5 (2.6.18 > > based). (All x86_64 systems.) I didn't try anything in between. > > > > I've created a self-contained program below that will demonstrate > > the problem (call it "t1"). Assuming /mnt has an NFS file system: > > > > $ t1 /mnt/mynfsfile 1 # Fork 1 writer, kernel behaves normally > > $ t1 /mnt/mynfsfile 2 # Fork 2 writers, kernel goes crazy WRITEing > > > > Just run "watch -d nfsstat" in another window while running the two > > writer test and watch the WRITE count explode. > > > > I don't see anything particularly wrong with what the example code > > is doing with its use of mmap. Is there anything undefined about > > the code that would explain this behavior, or is this a NFS bug > > that's really lived this long? > > > > Quentin > > > > > > > > #include <sys/stat.h> > > #include <sys/mman.h> > > #include <sys/stat.h> > > #include <sys/wait.h> > > #include <errno.h> > > #include <fcntl.h> > > #include <stdio.h> > > #include <stdlib.h> > > #include <signal.h> > > #include <string.h> > > #include <unistd.h> > > > > int > > kill_children() > > { > > int cnt = 0; > > siginfo_t infop; > > > > signal(SIGINT, SIG_IGN); > > kill(0, SIGINT); > > while (waitid(P_ALL, 0, &infop, WEXITED) != -1) ++cnt; > > > > return cnt; > > } > > > > void > > sighandler(int sig) > > { > > printf("Cleaning up all children.\n"); > > int cnt = kill_children(); > > printf("Cleaned up %d child%s.\n", cnt, cnt == 1 ? "" : "ren"); > > > > exit(0); > > } > > > > int > > do_child(volatile int *iaddr) > > { > > while (1) *iaddr = 1; > > } > > > > int > > main(int argc, char **argv) > > { > > const char *path; > > int fd; > > ssize_t wlen; > > int *ip; > > int fork_count = 1; > > > > if (argc == 1) { > > fprintf(stderr, "Usage: %s {filename} [fork_count].\n", > > argv[0]); > > return 1; > > } > > > > path = argv[1]; > > > > if (argc > 2) { > > int fc = atoi(argv[2]); > > if (fc >= 0) > > fork_count = fc; > > } > > > > fd = open(path, O_CREAT|O_TRUNC|O_RDWR|O_APPEND, S_IRUSR|S_IWUSR); > > if (fd < 0) { > > fprintf(stderr, "Open of '%s' failed: %s (%d)\n", > > path, strerror(errno), errno); > > return 1; > > } > > > > wlen = write(fd, &(int){0}, sizeof(int)); > > if (wlen != sizeof(int)) { > > if (wlen < 0) > > fprintf(stderr, "Write of '%s' failed: %s (%d)\n", > > path, strerror(errno), errno); > > else > > fprintf(stderr, "Short write to '%s'\n", path); > > return 1; > > } > > > > ip = (int *)mmap(NULL, sizeof(int), PROT_READ|PROT_WRITE, > > MAP_SHARED, fd, 0); > > if (ip == MAP_FAILED) { > > fprintf(stderr, "Mmap of '%s' failed: %s (%d)\n", > > path, strerror(errno), errno); > > return 1; > > } > > > > signal(SIGINT, sighandler); > > > > while (fork_count-- > 0) { > > switch(fork()) { > > case -1: > > fprintf(stderr, "Fork failed: %s (%d)\n", > > strerror(errno), errno); > > kill_children(); > > return 1; > > case 0: /* child */ > > signal(SIGINT, SIG_DFL); > > do_child(ip); > > break; > > default: /* parent */ > > break; > > } > > } > > > > printf("Press ^C to terminate test.\n"); > > pause(); > > > > return 0; > > } > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > Quentin ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: nfs-backed mmap file results in 1000s of WRITEs per second 2013-09-05 19:11 ` Quentin Barnes @ 2013-09-05 20:02 ` Myklebust, Trond 2013-09-05 21:36 ` Quentin Barnes 0 siblings, 1 reply; 20+ messages in thread From: Myklebust, Trond @ 2013-09-05 20:02 UTC (permalink / raw) To: Quentin Barnes; +Cc: linux-nfs@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 5213 bytes --] On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote: > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote: > > Neil Brown posted a patch couple days ago for this! > > > > http://thread.gmane.org/gmane.linux.nfs/58473 > > I tried Neil's patch on a v3.11 kernel. The rebuilt kernel still > exhibited the same 1000s of WRITEs/sec problem. > > Any other ideas? Yes. Please try the attached patch. > > Regards, Malahal. > > > > Quentin Barnes [qbarnes@gmail.com] wrote: > > > If two (or more) processes are doing nothing more than writing to > > > the memory addresses of an mmapped shared file on an NFS mounted > > > file system, it results in the kernel scribbling WRITEs to the > > > server as fast as it can (1000s per second) even while no syscalls > > > are going on. > > > > > > The problems happens on NFS clients mounting NFSv3 or NFSv4. I've > > > reproduced this on the 3.11 kernel, and it happens as far back as > > > RHEL6 (2.6.32 based), however, it is not a problem on RHEL5 (2.6.18 > > > based). (All x86_64 systems.) I didn't try anything in between. > > > > > > I've created a self-contained program below that will demonstrate > > > the problem (call it "t1"). Assuming /mnt has an NFS file system: > > > > > > $ t1 /mnt/mynfsfile 1 # Fork 1 writer, kernel behaves normally > > > $ t1 /mnt/mynfsfile 2 # Fork 2 writers, kernel goes crazy WRITEing > > > > > > Just run "watch -d nfsstat" in another window while running the two > > > writer test and watch the WRITE count explode. > > > > > > I don't see anything particularly wrong with what the example code > > > is doing with its use of mmap. Is there anything undefined about > > > the code that would explain this behavior, or is this a NFS bug > > > that's really lived this long? > > > > > > Quentin > > > > > > > > > > > > #include <sys/stat.h> > > > #include <sys/mman.h> > > > #include <sys/stat.h> > > > #include <sys/wait.h> > > > #include <errno.h> > > > #include <fcntl.h> > > > #include <stdio.h> > > > #include <stdlib.h> > > > #include <signal.h> > > > #include <string.h> > > > #include <unistd.h> > > > > > > int > > > kill_children() > > > { > > > int cnt = 0; > > > siginfo_t infop; > > > > > > signal(SIGINT, SIG_IGN); > > > kill(0, SIGINT); > > > while (waitid(P_ALL, 0, &infop, WEXITED) != -1) ++cnt; > > > > > > return cnt; > > > } > > > > > > void > > > sighandler(int sig) > > > { > > > printf("Cleaning up all children.\n"); > > > int cnt = kill_children(); > > > printf("Cleaned up %d child%s.\n", cnt, cnt == 1 ? "" : "ren"); > > > > > > exit(0); > > > } > > > > > > int > > > do_child(volatile int *iaddr) > > > { > > > while (1) *iaddr = 1; > > > } > > > > > > int > > > main(int argc, char **argv) > > > { > > > const char *path; > > > int fd; > > > ssize_t wlen; > > > int *ip; > > > int fork_count = 1; > > > > > > if (argc == 1) { > > > fprintf(stderr, "Usage: %s {filename} [fork_count].\n", > > > argv[0]); > > > return 1; > > > } > > > > > > path = argv[1]; > > > > > > if (argc > 2) { > > > int fc = atoi(argv[2]); > > > if (fc >= 0) > > > fork_count = fc; > > > } > > > > > > fd = open(path, O_CREAT|O_TRUNC|O_RDWR|O_APPEND, S_IRUSR|S_IWUSR); > > > if (fd < 0) { > > > fprintf(stderr, "Open of '%s' failed: %s (%d)\n", > > > path, strerror(errno), errno); > > > return 1; > > > } > > > > > > wlen = write(fd, &(int){0}, sizeof(int)); > > > if (wlen != sizeof(int)) { > > > if (wlen < 0) > > > fprintf(stderr, "Write of '%s' failed: %s (%d)\n", > > > path, strerror(errno), errno); > > > else > > > fprintf(stderr, "Short write to '%s'\n", path); > > > return 1; > > > } > > > > > > ip = (int *)mmap(NULL, sizeof(int), PROT_READ|PROT_WRITE, > > > MAP_SHARED, fd, 0); > > > if (ip == MAP_FAILED) { > > > fprintf(stderr, "Mmap of '%s' failed: %s (%d)\n", > > > path, strerror(errno), errno); > > > return 1; > > > } > > > > > > signal(SIGINT, sighandler); > > > > > > while (fork_count-- > 0) { > > > switch(fork()) { > > > case -1: > > > fprintf(stderr, "Fork failed: %s (%d)\n", > > > strerror(errno), errno); > > > kill_children(); > > > return 1; > > > case 0: /* child */ > > > signal(SIGINT, SIG_DFL); > > > do_child(ip); > > > break; > > > default: /* parent */ > > > break; > > > } > > > } > > > > > > printf("Press ^C to terminate test.\n"); > > > pause(); > > > > > > return 0; > > > } > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > Quentin > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-NFS-Don-t-check-lock-owner-compatibility-in-writes-u.patch --] [-- Type: text/x-patch; name="0001-NFS-Don-t-check-lock-owner-compatibility-in-writes-u.patch", Size: 1257 bytes --] From 903ebaeefae78e6e03f3719aafa8fd5dd22d3288 Mon Sep 17 00:00:00 2001 From: Trond Myklebust <Trond.Myklebust@netapp.com> Date: Thu, 5 Sep 2013 15:52:51 -0400 Subject: [PATCH] NFS: Don't check lock owner compatibility in writes unless file is locked If we're doing buffered writes, and there is no file locking involved, then we don't have to worry about whether or not the lock owner information is identical. By relaxing this check, we ensure that fork()ed child processes can write to a page without having to first sync dirty data that was written by the parent to disk. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- fs/nfs/write.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 40979e8..ac1dc33 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -863,7 +863,7 @@ int nfs_flush_incompatible(struct file *file, struct page *page) return 0; l_ctx = req->wb_lock_context; do_flush = req->wb_page != page || req->wb_context != ctx; - if (l_ctx) { + if (l_ctx && ctx->dentry->d_inode->i_flock != NULL) { do_flush |= l_ctx->lockowner.l_owner != current->files || l_ctx->lockowner.l_pid != current->tgid; } -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: nfs-backed mmap file results in 1000s of WRITEs per second 2013-09-05 20:02 ` Myklebust, Trond @ 2013-09-05 21:36 ` Quentin Barnes 2013-09-05 21:57 ` Myklebust, Trond 2013-09-05 22:07 ` Myklebust, Trond 0 siblings, 2 replies; 20+ messages in thread From: Quentin Barnes @ 2013-09-05 21:36 UTC (permalink / raw) To: Myklebust, Trond; +Cc: linux-nfs@vger.kernel.org On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote: > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote: > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote: > > > Neil Brown posted a patch couple days ago for this! > > > > > > http://thread.gmane.org/gmane.linux.nfs/58473 > > > > I tried Neil's patch on a v3.11 kernel. The rebuilt kernel still > > exhibited the same 1000s of WRITEs/sec problem. > > > > Any other ideas? > > Yes. Please try the attached patch. Great! That did the trick! Do you feel this patch could be worthy of pushing it upstream in its current state or was it just to verify a theory? In comparing the nfs_flush_incompatible() implementations between RHEL5 and v3.11 (without your patch), the guts of the algorithm seem more or less logically equivalent to me on whether or not to flush the page. Also, when and where nfs_flush_incompatible() is invoked seems the same. Would you provide a very brief pointer to clue me in as to why this problem didn't also manifest circa 2.6.18 days? Quentin > > > Regards, Malahal. > > > > > > Quentin Barnes [qbarnes@gmail.com] wrote: > > > > If two (or more) processes are doing nothing more than writing to > > > > the memory addresses of an mmapped shared file on an NFS mounted > > > > file system, it results in the kernel scribbling WRITEs to the > > > > server as fast as it can (1000s per second) even while no syscalls > > > > are going on. > > > > > > > > The problems happens on NFS clients mounting NFSv3 or NFSv4. I've > > > > reproduced this on the 3.11 kernel, and it happens as far back as > > > > RHEL6 (2.6.32 based), however, it is not a problem on RHEL5 (2.6.18 > > > > based). (All x86_64 systems.) I didn't try anything in between. > > > > > > > > I've created a self-contained program below that will demonstrate > > > > the problem (call it "t1"). Assuming /mnt has an NFS file system: > > > > > > > > $ t1 /mnt/mynfsfile 1 # Fork 1 writer, kernel behaves normally > > > > $ t1 /mnt/mynfsfile 2 # Fork 2 writers, kernel goes crazy WRITEing > > > > > > > > Just run "watch -d nfsstat" in another window while running the two > > > > writer test and watch the WRITE count explode. > > > > > > > > I don't see anything particularly wrong with what the example code > > > > is doing with its use of mmap. Is there anything undefined about > > > > the code that would explain this behavior, or is this a NFS bug > > > > that's really lived this long? > > > > > > > > Quentin > > > > > > > > > > > > > > > > #include <sys/stat.h> > > > > #include <sys/mman.h> > > > > #include <sys/stat.h> > > > > #include <sys/wait.h> > > > > #include <errno.h> > > > > #include <fcntl.h> > > > > #include <stdio.h> > > > > #include <stdlib.h> > > > > #include <signal.h> > > > > #include <string.h> > > > > #include <unistd.h> > > > > > > > > int > > > > kill_children() > > > > { > > > > int cnt = 0; > > > > siginfo_t infop; > > > > > > > > signal(SIGINT, SIG_IGN); > > > > kill(0, SIGINT); > > > > while (waitid(P_ALL, 0, &infop, WEXITED) != -1) ++cnt; > > > > > > > > return cnt; > > > > } > > > > > > > > void > > > > sighandler(int sig) > > > > { > > > > printf("Cleaning up all children.\n"); > > > > int cnt = kill_children(); > > > > printf("Cleaned up %d child%s.\n", cnt, cnt == 1 ? "" : "ren"); > > > > > > > > exit(0); > > > > } > > > > > > > > int > > > > do_child(volatile int *iaddr) > > > > { > > > > while (1) *iaddr = 1; > > > > } > > > > > > > > int > > > > main(int argc, char **argv) > > > > { > > > > const char *path; > > > > int fd; > > > > ssize_t wlen; > > > > int *ip; > > > > int fork_count = 1; > > > > > > > > if (argc == 1) { > > > > fprintf(stderr, "Usage: %s {filename} [fork_count].\n", > > > > argv[0]); > > > > return 1; > > > > } > > > > > > > > path = argv[1]; > > > > > > > > if (argc > 2) { > > > > int fc = atoi(argv[2]); > > > > if (fc >= 0) > > > > fork_count = fc; > > > > } > > > > > > > > fd = open(path, O_CREAT|O_TRUNC|O_RDWR|O_APPEND, S_IRUSR|S_IWUSR); > > > > if (fd < 0) { > > > > fprintf(stderr, "Open of '%s' failed: %s (%d)\n", > > > > path, strerror(errno), errno); > > > > return 1; > > > > } > > > > > > > > wlen = write(fd, &(int){0}, sizeof(int)); > > > > if (wlen != sizeof(int)) { > > > > if (wlen < 0) > > > > fprintf(stderr, "Write of '%s' failed: %s (%d)\n", > > > > path, strerror(errno), errno); > > > > else > > > > fprintf(stderr, "Short write to '%s'\n", path); > > > > return 1; > > > > } > > > > > > > > ip = (int *)mmap(NULL, sizeof(int), PROT_READ|PROT_WRITE, > > > > MAP_SHARED, fd, 0); > > > > if (ip == MAP_FAILED) { > > > > fprintf(stderr, "Mmap of '%s' failed: %s (%d)\n", > > > > path, strerror(errno), errno); > > > > return 1; > > > > } > > > > > > > > signal(SIGINT, sighandler); > > > > > > > > while (fork_count-- > 0) { > > > > switch(fork()) { > > > > case -1: > > > > fprintf(stderr, "Fork failed: %s (%d)\n", > > > > strerror(errno), errno); > > > > kill_children(); > > > > return 1; > > > > case 0: /* child */ > > > > signal(SIGINT, SIG_DFL); > > > > do_child(ip); > > > > break; > > > > default: /* parent */ > > > > break; > > > > } > > > > } > > > > > > > > printf("Press ^C to terminate test.\n"); > > > > pause(); > > > > > > > > return 0; > > > > } > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > > > > the body of a message to majordomo@vger.kernel.org > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > Quentin > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > Trond Myklebust > Linux NFS client maintainer > > NetApp > Trond.Myklebust@netapp.com > www.netapp.com > From 903ebaeefae78e6e03f3719aafa8fd5dd22d3288 Mon Sep 17 00:00:00 2001 > From: Trond Myklebust <Trond.Myklebust@netapp.com> > Date: Thu, 5 Sep 2013 15:52:51 -0400 > Subject: [PATCH] NFS: Don't check lock owner compatibility in writes unless > file is locked > > If we're doing buffered writes, and there is no file locking involved, > then we don't have to worry about whether or not the lock owner information > is identical. > By relaxing this check, we ensure that fork()ed child processes can write > to a page without having to first sync dirty data that was written > by the parent to disk. > > Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> > --- > fs/nfs/write.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > index 40979e8..ac1dc33 100644 > --- a/fs/nfs/write.c > +++ b/fs/nfs/write.c > @@ -863,7 +863,7 @@ int nfs_flush_incompatible(struct file *file, struct page *page) > return 0; > l_ctx = req->wb_lock_context; > do_flush = req->wb_page != page || req->wb_context != ctx; > - if (l_ctx) { > + if (l_ctx && ctx->dentry->d_inode->i_flock != NULL) { > do_flush |= l_ctx->lockowner.l_owner != current->files > || l_ctx->lockowner.l_pid != current->tgid; > } > -- > 1.8.3.1 > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: nfs-backed mmap file results in 1000s of WRITEs per second 2013-09-05 21:36 ` Quentin Barnes @ 2013-09-05 21:57 ` Myklebust, Trond 2013-09-05 22:34 ` Quentin Barnes 2013-09-05 22:07 ` Myklebust, Trond 1 sibling, 1 reply; 20+ messages in thread From: Myklebust, Trond @ 2013-09-05 21:57 UTC (permalink / raw) To: Quentin Barnes; +Cc: linux-nfs@vger.kernel.org T24gVGh1LCAyMDEzLTA5LTA1IGF0IDE2OjM2IC0wNTAwLCBRdWVudGluIEJhcm5lcyB3cm90ZToN Cj4gT24gVGh1LCBTZXAgMDUsIDIwMTMgYXQgMDg6MDI6MDFQTSArMDAwMCwgTXlrbGVidXN0LCBU cm9uZCB3cm90ZToNCj4gPiBPbiBUaHUsIDIwMTMtMDktMDUgYXQgMTQ6MTEgLTA1MDAsIFF1ZW50 aW4gQmFybmVzIHdyb3RlOg0KPiA+ID4gT24gVGh1LCBTZXAgMDUsIDIwMTMgYXQgMTI6MDM6MDNQ TSAtMDUwMCwgTWFsYWhhbCBOYWluZW5pIHdyb3RlOg0KPiA+ID4gPiBOZWlsIEJyb3duIHBvc3Rl ZCBhIHBhdGNoIGNvdXBsZSBkYXlzIGFnbyBmb3IgdGhpcyENCj4gPiA+ID4gDQo+ID4gPiA+IGh0 dHA6Ly90aHJlYWQuZ21hbmUub3JnL2dtYW5lLmxpbnV4Lm5mcy81ODQ3Mw0KPiA+ID4gDQo+ID4g PiBJIHRyaWVkIE5laWwncyBwYXRjaCBvbiBhIHYzLjExIGtlcm5lbC4gIFRoZSByZWJ1aWx0IGtl cm5lbCBzdGlsbA0KPiA+ID4gZXhoaWJpdGVkIHRoZSBzYW1lIDEwMDBzIG9mIFdSSVRFcy9zZWMg cHJvYmxlbS4NCj4gPiA+IA0KPiA+ID4gQW55IG90aGVyIGlkZWFzPw0KPiA+IA0KPiA+IFllcy4g UGxlYXNlIHRyeSB0aGUgYXR0YWNoZWQgcGF0Y2guDQo+IA0KPiBHcmVhdCEgIFRoYXQgZGlkIHRo ZSB0cmljayENCj4gDQo+IERvIHlvdSBmZWVsIHRoaXMgcGF0Y2ggY291bGQgYmUgd29ydGh5IG9m IHB1c2hpbmcgaXQgdXBzdHJlYW0gaW4gaXRzDQo+IGN1cnJlbnQgc3RhdGUgb3Igd2FzIGl0IGp1 c3QgdG8gdmVyaWZ5IGEgdGhlb3J5Pw0KPiANCj4gDQo+IEluIGNvbXBhcmluZyB0aGUgbmZzX2Zs dXNoX2luY29tcGF0aWJsZSgpIGltcGxlbWVudGF0aW9ucyBiZXR3ZWVuDQo+IFJIRUw1IGFuZCB2 My4xMSAod2l0aG91dCB5b3VyIHBhdGNoKSwgdGhlIGd1dHMgb2YgdGhlIGFsZ29yaXRobSBzZWVt DQo+IG1vcmUgb3IgbGVzcyBsb2dpY2FsbHkgZXF1aXZhbGVudCB0byBtZSBvbiB3aGV0aGVyIG9y IG5vdCB0byBmbHVzaA0KPiB0aGUgcGFnZS4gIEFsc28sIHdoZW4gYW5kIHdoZXJlIG5mc19mbHVz aF9pbmNvbXBhdGlibGUoKSBpcyBpbnZva2VkDQo+IHNlZW1zIHRoZSBzYW1lLiAgV291bGQgeW91 IHByb3ZpZGUgYSB2ZXJ5IGJyaWVmIHBvaW50ZXIgdG8gY2x1ZSBtZQ0KPiBpbiBhcyB0byB3aHkg dGhpcyBwcm9ibGVtIGRpZG4ndCBhbHNvIG1hbmlmZXN0IGNpcmNhIDIuNi4xOCBkYXlzPw0KDQpU aGVyZSB3YXMgbm8gbmZzX3ZtX3BhZ2VfbWt3cml0ZSgpIHRvIGhhbmRsZSBwYWdlIGZhdWx0cyBp biB0aGUgMi42LjE4DQpkYXlzLCBhbmQgc28gdGhlIHJpc2sgd2FzIHRoYXQgeW91ciBtbWFwcGVk IHdyaXRlcyBjb3VsZCBlbmQgdXAgYmVpbmcNCnNlbnQgd2l0aCB0aGUgd3JvbmcgY3JlZGVudGlh bHMuDQoNCi0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXIN Cg0KTmV0QXBwDQpUcm9uZC5NeWtsZWJ1c3RAbmV0YXBwLmNvbQ0Kd3d3Lm5ldGFwcC5jb20NCg== ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: nfs-backed mmap file results in 1000s of WRITEs per second 2013-09-05 21:57 ` Myklebust, Trond @ 2013-09-05 22:34 ` Quentin Barnes 2013-09-06 13:36 ` Jeff Layton 0 siblings, 1 reply; 20+ messages in thread From: Quentin Barnes @ 2013-09-05 22:34 UTC (permalink / raw) To: Myklebust, Trond; +Cc: linux-nfs@vger.kernel.org On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote: > On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote: > > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote: > > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote: > > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote: > > > > > Neil Brown posted a patch couple days ago for this! > > > > > > > > > > http://thread.gmane.org/gmane.linux.nfs/58473 > > > > > > > > I tried Neil's patch on a v3.11 kernel. The rebuilt kernel still > > > > exhibited the same 1000s of WRITEs/sec problem. > > > > > > > > Any other ideas? > > > > > > Yes. Please try the attached patch. > > > > Great! That did the trick! > > > > Do you feel this patch could be worthy of pushing it upstream in its > > current state or was it just to verify a theory? > > > > > > In comparing the nfs_flush_incompatible() implementations between > > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem > > more or less logically equivalent to me on whether or not to flush > > the page. Also, when and where nfs_flush_incompatible() is invoked > > seems the same. Would you provide a very brief pointer to clue me > > in as to why this problem didn't also manifest circa 2.6.18 days? > > There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18 > days, and so the risk was that your mmapped writes could end up being > sent with the wrong credentials. Ah! You're right that nfs_vm_page_mkwrite() was missing from the original 2.6.18, so that makes sense, however, Red Hat had backported that function starting with their RHEL5.9(*) kernels, yet the problem doesn't manifest on RHEL5.9. Maybe the answer lies somewhere in RHEL5.9's do_wp_page(), or up that call path, but glancing through it, it all looks pretty close though. (*) That was the source I using when comparing with the 3.11 source when studying your patch since it was the last kernel known to me without the problem. > -- > Trond Myklebust > Linux NFS client maintainer > > NetApp > Trond.Myklebust@netapp.com > www.netapp.com Quentin ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: nfs-backed mmap file results in 1000s of WRITEs per second 2013-09-05 22:34 ` Quentin Barnes @ 2013-09-06 13:36 ` Jeff Layton 2013-09-06 15:00 ` Myklebust, Trond 2013-09-06 16:48 ` Quentin Barnes 0 siblings, 2 replies; 20+ messages in thread From: Jeff Layton @ 2013-09-06 13:36 UTC (permalink / raw) To: Quentin Barnes; +Cc: Myklebust, Trond, linux-nfs@vger.kernel.org On Thu, 5 Sep 2013 17:34:20 -0500 Quentin Barnes <qbarnes@gmail.com> wrote: > On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote: > > On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote: > > > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote: > > > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote: > > > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote: > > > > > > Neil Brown posted a patch couple days ago for this! > > > > > > > > > > > > http://thread.gmane.org/gmane.linux.nfs/58473 > > > > > > > > > > I tried Neil's patch on a v3.11 kernel. The rebuilt kernel still > > > > > exhibited the same 1000s of WRITEs/sec problem. > > > > > > > > > > Any other ideas? > > > > > > > > Yes. Please try the attached patch. > > > > > > Great! That did the trick! > > > > > > Do you feel this patch could be worthy of pushing it upstream in its > > > current state or was it just to verify a theory? > > > > > > > > > In comparing the nfs_flush_incompatible() implementations between > > > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem > > > more or less logically equivalent to me on whether or not to flush > > > the page. Also, when and where nfs_flush_incompatible() is invoked > > > seems the same. Would you provide a very brief pointer to clue me > > > in as to why this problem didn't also manifest circa 2.6.18 days? > > > > There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18 > > days, and so the risk was that your mmapped writes could end up being > > sent with the wrong credentials. > > Ah! You're right that nfs_vm_page_mkwrite() was missing from > the original 2.6.18, so that makes sense, however, Red Hat had > backported that function starting with their RHEL5.9(*) kernels, > yet the problem doesn't manifest on RHEL5.9. Maybe the answer lies > somewhere in RHEL5.9's do_wp_page(), or up that call path, but > glancing through it, it all looks pretty close though. > > > (*) That was the source I using when comparing with the 3.11 source > when studying your patch since it was the last kernel known to me > without the problem. > I'm pretty sure RHEL5 has a similar problem, but it's unclear to me why you're not seeing it there. I have a RHBZ open vs. RHEL5 but it's marked private at the moment (I'll see about opening it up). I brought this up upstream about a year ago with this strawman patch: http://article.gmane.org/gmane.linux.nfs/51240 ...at the time Trond said he was working on a set of patches to track the open/lock stateid on a per-req basis. Did that approach not pan out? Also, do you need to do a similar fix to nfs_can_coalesce_requests? Thanks, -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: nfs-backed mmap file results in 1000s of WRITEs per second 2013-09-06 13:36 ` Jeff Layton @ 2013-09-06 15:00 ` Myklebust, Trond 2013-09-06 15:04 ` Jeff Layton 2013-09-06 16:48 ` Quentin Barnes 1 sibling, 1 reply; 20+ messages in thread From: Myklebust, Trond @ 2013-09-06 15:00 UTC (permalink / raw) To: Jeff Layton; +Cc: Quentin Barnes, linux-nfs@vger.kernel.org T24gRnJpLCAyMDEzLTA5LTA2IGF0IDA5OjM2IC0wNDAwLCBKZWZmIExheXRvbiB3cm90ZToNCj4g T24gVGh1LCA1IFNlcCAyMDEzIDE3OjM0OjIwIC0wNTAwDQo+IFF1ZW50aW4gQmFybmVzIDxxYmFy bmVzQGdtYWlsLmNvbT4gd3JvdGU6DQo+IA0KPiA+IE9uIFRodSwgU2VwIDA1LCAyMDEzIGF0IDA5 OjU3OjI0UE0gKzAwMDAsIE15a2xlYnVzdCwgVHJvbmQgd3JvdGU6DQo+ID4gPiBPbiBUaHUsIDIw MTMtMDktMDUgYXQgMTY6MzYgLTA1MDAsIFF1ZW50aW4gQmFybmVzIHdyb3RlOg0KPiA+ID4gPiBP biBUaHUsIFNlcCAwNSwgMjAxMyBhdCAwODowMjowMVBNICswMDAwLCBNeWtsZWJ1c3QsIFRyb25k IHdyb3RlOg0KPiA+ID4gPiA+IE9uIFRodSwgMjAxMy0wOS0wNSBhdCAxNDoxMSAtMDUwMCwgUXVl bnRpbiBCYXJuZXMgd3JvdGU6DQo+ID4gPiA+ID4gPiBPbiBUaHUsIFNlcCAwNSwgMjAxMyBhdCAx MjowMzowM1BNIC0wNTAwLCBNYWxhaGFsIE5haW5lbmkgd3JvdGU6DQo+ID4gPiA+ID4gPiA+IE5l aWwgQnJvd24gcG9zdGVkIGEgcGF0Y2ggY291cGxlIGRheXMgYWdvIGZvciB0aGlzIQ0KPiA+ID4g PiA+ID4gPiANCj4gPiA+ID4gPiA+ID4gaHR0cDovL3RocmVhZC5nbWFuZS5vcmcvZ21hbmUubGlu dXgubmZzLzU4NDczDQo+ID4gPiA+ID4gPiANCj4gPiA+ID4gPiA+IEkgdHJpZWQgTmVpbCdzIHBh dGNoIG9uIGEgdjMuMTEga2VybmVsLiAgVGhlIHJlYnVpbHQga2VybmVsIHN0aWxsDQo+ID4gPiA+ ID4gPiBleGhpYml0ZWQgdGhlIHNhbWUgMTAwMHMgb2YgV1JJVEVzL3NlYyBwcm9ibGVtLg0KPiA+ ID4gPiA+ID4gDQo+ID4gPiA+ID4gPiBBbnkgb3RoZXIgaWRlYXM/DQo+ID4gPiA+ID4gDQo+ID4g PiA+ID4gWWVzLiBQbGVhc2UgdHJ5IHRoZSBhdHRhY2hlZCBwYXRjaC4NCj4gPiA+ID4gDQo+ID4g PiA+IEdyZWF0ISAgVGhhdCBkaWQgdGhlIHRyaWNrIQ0KPiA+ID4gPiANCj4gPiA+ID4gRG8geW91 IGZlZWwgdGhpcyBwYXRjaCBjb3VsZCBiZSB3b3J0aHkgb2YgcHVzaGluZyBpdCB1cHN0cmVhbSBp biBpdHMNCj4gPiA+ID4gY3VycmVudCBzdGF0ZSBvciB3YXMgaXQganVzdCB0byB2ZXJpZnkgYSB0 aGVvcnk/DQo+ID4gPiA+IA0KPiA+ID4gPiANCj4gPiA+ID4gSW4gY29tcGFyaW5nIHRoZSBuZnNf Zmx1c2hfaW5jb21wYXRpYmxlKCkgaW1wbGVtZW50YXRpb25zIGJldHdlZW4NCj4gPiA+ID4gUkhF TDUgYW5kIHYzLjExICh3aXRob3V0IHlvdXIgcGF0Y2gpLCB0aGUgZ3V0cyBvZiB0aGUgYWxnb3Jp dGhtIHNlZW0NCj4gPiA+ID4gbW9yZSBvciBsZXNzIGxvZ2ljYWxseSBlcXVpdmFsZW50IHRvIG1l IG9uIHdoZXRoZXIgb3Igbm90IHRvIGZsdXNoDQo+ID4gPiA+IHRoZSBwYWdlLiAgQWxzbywgd2hl biBhbmQgd2hlcmUgbmZzX2ZsdXNoX2luY29tcGF0aWJsZSgpIGlzIGludm9rZWQNCj4gPiA+ID4g c2VlbXMgdGhlIHNhbWUuICBXb3VsZCB5b3UgcHJvdmlkZSBhIHZlcnkgYnJpZWYgcG9pbnRlciB0 byBjbHVlIG1lDQo+ID4gPiA+IGluIGFzIHRvIHdoeSB0aGlzIHByb2JsZW0gZGlkbid0IGFsc28g bWFuaWZlc3QgY2lyY2EgMi42LjE4IGRheXM/DQo+ID4gPiANCj4gPiA+IFRoZXJlIHdhcyBubyBu ZnNfdm1fcGFnZV9ta3dyaXRlKCkgdG8gaGFuZGxlIHBhZ2UgZmF1bHRzIGluIHRoZSAyLjYuMTgN Cj4gPiA+IGRheXMsIGFuZCBzbyB0aGUgcmlzayB3YXMgdGhhdCB5b3VyIG1tYXBwZWQgd3JpdGVz IGNvdWxkIGVuZCB1cCBiZWluZw0KPiA+ID4gc2VudCB3aXRoIHRoZSB3cm9uZyBjcmVkZW50aWFs cy4NCj4gPiANCj4gPiBBaCEgIFlvdSdyZSByaWdodCB0aGF0IG5mc192bV9wYWdlX21rd3JpdGUo KSB3YXMgbWlzc2luZyBmcm9tDQo+ID4gdGhlIG9yaWdpbmFsIDIuNi4xOCwgc28gdGhhdCBtYWtl cyBzZW5zZSwgaG93ZXZlciwgUmVkIEhhdCBoYWQNCj4gPiBiYWNrcG9ydGVkIHRoYXQgZnVuY3Rp b24gc3RhcnRpbmcgd2l0aCB0aGVpciBSSEVMNS45KCopIGtlcm5lbHMsDQo+ID4geWV0IHRoZSBw cm9ibGVtIGRvZXNuJ3QgbWFuaWZlc3Qgb24gUkhFTDUuOS4gIE1heWJlIHRoZSBhbnN3ZXIgbGll cw0KPiA+IHNvbWV3aGVyZSBpbiBSSEVMNS45J3MgZG9fd3BfcGFnZSgpLCBvciB1cCB0aGF0IGNh bGwgcGF0aCwgYnV0DQo+ID4gZ2xhbmNpbmcgdGhyb3VnaCBpdCwgaXQgYWxsIGxvb2tzIHByZXR0 eSBjbG9zZSB0aG91Z2guDQo+ID4gDQo+ID4gDQo+ID4gKCopIFRoYXQgd2FzIHRoZSBzb3VyY2Ug SSB1c2luZyB3aGVuIGNvbXBhcmluZyB3aXRoIHRoZSAzLjExIHNvdXJjZQ0KPiA+IHdoZW4gc3R1 ZHlpbmcgeW91ciBwYXRjaCBzaW5jZSBpdCB3YXMgdGhlIGxhc3Qga2VybmVsIGtub3duIHRvIG1l DQo+ID4gd2l0aG91dCB0aGUgcHJvYmxlbS4NCj4gPiANCj4gDQo+IEknbSBwcmV0dHkgc3VyZSBS SEVMNSBoYXMgYSBzaW1pbGFyIHByb2JsZW0sIGJ1dCBpdCdzIHVuY2xlYXIgdG8gbWUgd2h5DQo+ IHlvdSdyZSBub3Qgc2VlaW5nIGl0IHRoZXJlLiBJIGhhdmUgYSBSSEJaIG9wZW4gdnMuIFJIRUw1 IGJ1dCBpdCdzIG1hcmtlZA0KPiBwcml2YXRlIGF0IHRoZSBtb21lbnQgKEknbGwgc2VlIGFib3V0 IG9wZW5pbmcgaXQgdXApLiBJIGJyb3VnaHQgdGhpcyB1cA0KPiB1cHN0cmVhbSBhYm91dCBhIHll YXIgYWdvIHdpdGggdGhpcyBzdHJhd21hbiBwYXRjaDoNCj4gDQo+ICAgICBodHRwOi8vYXJ0aWNs ZS5nbWFuZS5vcmcvZ21hbmUubGludXgubmZzLzUxMjQwDQo+IA0KPiAuLi5hdCB0aGUgdGltZSBU cm9uZCBzYWlkIGhlIHdhcyB3b3JraW5nIG9uIGEgc2V0IG9mIHBhdGNoZXMgdG8gdHJhY2sNCj4g dGhlIG9wZW4vbG9jayBzdGF0ZWlkIG9uIGEgcGVyLXJlcSBiYXNpcy4gRGlkIHRoYXQgYXBwcm9h Y2ggbm90IHBhbg0KPiBvdXQ/DQoNCldlJ3ZlIGFjaGlldmVkIHdoYXQgd2Ugd2FudGVkIHRvIGRv IChOZWlsJ3MgbG9jayByZWNvdmVyeSBwYXRjaCkgd2l0aG91dA0KdGhhdCBtYWNoaW5lcnksIHNv IGZvciBub3csIHdlJ3JlIGRyb3BwaW5nIHRoYXQuDQoNCj4gQWxzbywgZG8geW91IG5lZWQgdG8g ZG8gYSBzaW1pbGFyIGZpeCB0byBuZnNfY2FuX2NvYWxlc2NlX3JlcXVlc3RzPw0KDQpZZXMuIEdv b2QgcG9pbnQhDQoNCi0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMgY2xpZW50IG1haW50 YWluZXINCg0KTmV0QXBwDQpUcm9uZC5NeWtsZWJ1c3RAbmV0YXBwLmNvbQ0Kd3d3Lm5ldGFwcC5j b20NCg== ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: nfs-backed mmap file results in 1000s of WRITEs per second 2013-09-06 15:00 ` Myklebust, Trond @ 2013-09-06 15:04 ` Jeff Layton 2013-09-06 15:39 ` Myklebust, Trond 0 siblings, 1 reply; 20+ messages in thread From: Jeff Layton @ 2013-09-06 15:04 UTC (permalink / raw) To: Myklebust, Trond; +Cc: Quentin Barnes, linux-nfs@vger.kernel.org On Fri, 6 Sep 2013 15:00:56 +0000 "Myklebust, Trond" <Trond.Myklebust@netapp.com> wrote: > On Fri, 2013-09-06 at 09:36 -0400, Jeff Layton wrote: > > On Thu, 5 Sep 2013 17:34:20 -0500 > > Quentin Barnes <qbarnes@gmail.com> wrote: > > > > > On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote: > > > > On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote: > > > > > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote: > > > > > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote: > > > > > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote: > > > > > > > > Neil Brown posted a patch couple days ago for this! > > > > > > > > > > > > > > > > http://thread.gmane.org/gmane.linux.nfs/58473 > > > > > > > > > > > > > > I tried Neil's patch on a v3.11 kernel. The rebuilt kernel still > > > > > > > exhibited the same 1000s of WRITEs/sec problem. > > > > > > > > > > > > > > Any other ideas? > > > > > > > > > > > > Yes. Please try the attached patch. > > > > > > > > > > Great! That did the trick! > > > > > > > > > > Do you feel this patch could be worthy of pushing it upstream in its > > > > > current state or was it just to verify a theory? > > > > > > > > > > > > > > > In comparing the nfs_flush_incompatible() implementations between > > > > > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem > > > > > more or less logically equivalent to me on whether or not to flush > > > > > the page. Also, when and where nfs_flush_incompatible() is invoked > > > > > seems the same. Would you provide a very brief pointer to clue me > > > > > in as to why this problem didn't also manifest circa 2.6.18 days? > > > > > > > > There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18 > > > > days, and so the risk was that your mmapped writes could end up being > > > > sent with the wrong credentials. > > > > > > Ah! You're right that nfs_vm_page_mkwrite() was missing from > > > the original 2.6.18, so that makes sense, however, Red Hat had > > > backported that function starting with their RHEL5.9(*) kernels, > > > yet the problem doesn't manifest on RHEL5.9. Maybe the answer lies > > > somewhere in RHEL5.9's do_wp_page(), or up that call path, but > > > glancing through it, it all looks pretty close though. > > > > > > > > > (*) That was the source I using when comparing with the 3.11 source > > > when studying your patch since it was the last kernel known to me > > > without the problem. > > > > > > > I'm pretty sure RHEL5 has a similar problem, but it's unclear to me why > > you're not seeing it there. I have a RHBZ open vs. RHEL5 but it's marked > > private at the moment (I'll see about opening it up). I brought this up > > upstream about a year ago with this strawman patch: > > > > http://article.gmane.org/gmane.linux.nfs/51240 > > > > ...at the time Trond said he was working on a set of patches to track > > the open/lock stateid on a per-req basis. Did that approach not pan > > out? > > We've achieved what we wanted to do (Neil's lock recovery patch) without > that machinery, so for now, we're dropping that. > > > Also, do you need to do a similar fix to nfs_can_coalesce_requests? > > Yes. Good point! > Cool. FWIW, here's the original bug that was opened against RHEL5: https://bugzilla.redhat.com/show_bug.cgi?id=736578 ...the reproducer that Max cooked up is not doing mmapped I/O so there may be a difference there, but I haven't looked closely at why that is. -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: nfs-backed mmap file results in 1000s of WRITEs per second 2013-09-06 15:04 ` Jeff Layton @ 2013-09-06 15:39 ` Myklebust, Trond 2013-09-08 14:25 ` William Dauchy 0 siblings, 1 reply; 20+ messages in thread From: Myklebust, Trond @ 2013-09-06 15:39 UTC (permalink / raw) To: Jeff Layton; +Cc: Quentin Barnes, linux-nfs@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 972 bytes --] On Fri, 2013-09-06 at 11:04 -0400, Jeff Layton wrote: > On Fri, 6 Sep 2013 15:00:56 +0000 > "Myklebust, Trond" <Trond.Myklebust@netapp.com> wrote: > > > Also, do you need to do a similar fix to nfs_can_coalesce_requests? > > > > Yes. Good point! > > > > Cool. FWIW, here's the original bug that was opened against RHEL5: > > https://bugzilla.redhat.com/show_bug.cgi?id=736578 > > ...the reproducer that Max cooked up is not doing mmapped I/O so there > may be a difference there, but I haven't looked closely at why that is. Here is the patch... There should be no big differences between mmapped I/O and ordinary I/O now that we have page_mkwrite() to set up the request. The only difference that I can think of offhand is when you use locking: for mmap() writes, NFS can only guarantee data integrity if you lock the entire page. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-NFS-Don-t-check-lock-owner-compatability-unless-file.patch --] [-- Type: text/x-patch; name="0001-NFS-Don-t-check-lock-owner-compatability-unless-file.patch", Size: 2262 bytes --] From 4109bb7496640aa97a12904527ba8e3a19b7ce7a Mon Sep 17 00:00:00 2001 From: Trond Myklebust <Trond.Myklebust@netapp.com> Date: Fri, 6 Sep 2013 11:09:38 -0400 Subject: [PATCH] NFS: Don't check lock owner compatability unless file is locked (part 2) When coalescing requests into a single READ or WRITE RPC call, and there is no file locking involved, we don't have to refuse coalescing for requests where the lock owner information doesn't match. Reported-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- fs/nfs/pagelist.c | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c index 29cfb7a..2ffebf2 100644 --- a/fs/nfs/pagelist.c +++ b/fs/nfs/pagelist.c @@ -328,6 +328,19 @@ void nfs_pageio_init(struct nfs_pageio_descriptor *desc, } EXPORT_SYMBOL_GPL(nfs_pageio_init); +static bool nfs_match_open_context(const struct nfs_open_context *ctx1, + const struct nfs_open_context *ctx2) +{ + return ctx1->cred == ctx2->cred && ctx1->state == ctx2->state; +} + +static bool nfs_match_lock_context(const struct nfs_lock_context *l1, + const struct nfs_lock_context *l2) +{ + return l1->lockowner.l_owner == l2->lockowner.l_owner + && l1->lockowner.l_pid == l2->lockowner.l_pid; +} + /** * nfs_can_coalesce_requests - test two requests for compatibility * @prev: pointer to nfs_page @@ -343,13 +356,10 @@ static bool nfs_can_coalesce_requests(struct nfs_page *prev, struct nfs_page *req, struct nfs_pageio_descriptor *pgio) { - if (req->wb_context->cred != prev->wb_context->cred) - return false; - if (req->wb_lock_context->lockowner.l_owner != prev->wb_lock_context->lockowner.l_owner) - return false; - if (req->wb_lock_context->lockowner.l_pid != prev->wb_lock_context->lockowner.l_pid) + if (!nfs_match_open_context(req->wb_context, prev->wb_context)) return false; - if (req->wb_context->state != prev->wb_context->state) + if (req->wb_context->dentry->d_inode->i_flock != NULL && + !nfs_match_lock_context(req->wb_lock_context, prev->wb_lock_context)) return false; if (req->wb_pgbase != 0) return false; -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: nfs-backed mmap file results in 1000s of WRITEs per second 2013-09-06 15:39 ` Myklebust, Trond @ 2013-09-08 14:25 ` William Dauchy 0 siblings, 0 replies; 20+ messages in thread From: William Dauchy @ 2013-09-08 14:25 UTC (permalink / raw) To: Myklebust, Trond; +Cc: Jeff Layton, Quentin Barnes, linux-nfs@vger.kernel.org On Fri, Sep 6, 2013 at 5:39 PM, Myklebust, Trond <Trond.Myklebust@netapp.com> wrote: > There should be no big differences between mmapped I/O and ordinary I/O > now that we have page_mkwrite() to set up the request. The only > difference that I can think of offhand is when you use locking: for > mmap() writes, NFS can only guarantee data integrity if you lock the > entire page. Tested the two patches on top of a 3.10.x. Looks ok so far. -- William ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: nfs-backed mmap file results in 1000s of WRITEs per second 2013-09-06 13:36 ` Jeff Layton 2013-09-06 15:00 ` Myklebust, Trond @ 2013-09-06 16:48 ` Quentin Barnes 2013-09-07 14:51 ` Jeff Layton 2013-09-09 13:04 ` Jeff Layton 1 sibling, 2 replies; 20+ messages in thread From: Quentin Barnes @ 2013-09-06 16:48 UTC (permalink / raw) To: Jeff Layton; +Cc: Myklebust, Trond, linux-nfs@vger.kernel.org Jeff, can your try out my test program in the base note on your RHEL5.9 or later RHEL5.x kernels? I reverified that running the test on a 2.6.18-348.16.1.el5 x86_64 kernel (latest released RHEL5.9) does not show the problem for me. Based on what you and Trond have said in this thread though, I'm really curious why it doesn't have the problem. On Fri, Sep 6, 2013 at 8:36 AM, Jeff Layton <jlayton@redhat.com> wrote: > On Thu, 5 Sep 2013 17:34:20 -0500 > Quentin Barnes <qbarnes@gmail.com> wrote: > >> On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote: >> > On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote: >> > > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote: >> > > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote: >> > > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote: >> > > > > > Neil Brown posted a patch couple days ago for this! >> > > > > > >> > > > > > http://thread.gmane.org/gmane.linux.nfs/58473 >> > > > > >> > > > > I tried Neil's patch on a v3.11 kernel. The rebuilt kernel still >> > > > > exhibited the same 1000s of WRITEs/sec problem. >> > > > > >> > > > > Any other ideas? >> > > > >> > > > Yes. Please try the attached patch. >> > > >> > > Great! That did the trick! >> > > >> > > Do you feel this patch could be worthy of pushing it upstream in its >> > > current state or was it just to verify a theory? >> > > >> > > >> > > In comparing the nfs_flush_incompatible() implementations between >> > > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem >> > > more or less logically equivalent to me on whether or not to flush >> > > the page. Also, when and where nfs_flush_incompatible() is invoked >> > > seems the same. Would you provide a very brief pointer to clue me >> > > in as to why this problem didn't also manifest circa 2.6.18 days? >> > >> > There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18 >> > days, and so the risk was that your mmapped writes could end up being >> > sent with the wrong credentials. >> >> Ah! You're right that nfs_vm_page_mkwrite() was missing from >> the original 2.6.18, so that makes sense, however, Red Hat had >> backported that function starting with their RHEL5.9(*) kernels, >> yet the problem doesn't manifest on RHEL5.9. Maybe the answer lies >> somewhere in RHEL5.9's do_wp_page(), or up that call path, but >> glancing through it, it all looks pretty close though. >> >> >> (*) That was the source I using when comparing with the 3.11 source >> when studying your patch since it was the last kernel known to me >> without the problem. >> > > I'm pretty sure RHEL5 has a similar problem, but it's unclear to me why > you're not seeing it there. I have a RHBZ open vs. RHEL5 but it's marked > private at the moment (I'll see about opening it up). I brought this up > upstream about a year ago with this strawman patch: > > http://article.gmane.org/gmane.linux.nfs/51240 > > ...at the time Trond said he was working on a set of patches to track > the open/lock stateid on a per-req basis. Did that approach not pan > out? > > Also, do you need to do a similar fix to nfs_can_coalesce_requests? > > Thanks, > -- > Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: nfs-backed mmap file results in 1000s of WRITEs per second 2013-09-06 16:48 ` Quentin Barnes @ 2013-09-07 14:51 ` Jeff Layton 2013-09-07 15:00 ` Myklebust, Trond 2013-09-09 13:04 ` Jeff Layton 1 sibling, 1 reply; 20+ messages in thread From: Jeff Layton @ 2013-09-07 14:51 UTC (permalink / raw) To: Quentin Barnes; +Cc: Myklebust, Trond, linux-nfs@vger.kernel.org On Fri, 6 Sep 2013 11:48:45 -0500 Quentin Barnes <qbarnes@gmail.com> wrote: > Jeff, can your try out my test program in the base note on your > RHEL5.9 or later RHEL5.x kernels? > > I reverified that running the test on a 2.6.18-348.16.1.el5 x86_64 > kernel (latest released RHEL5.9) does not show the problem for me. > Based on what you and Trond have said in this thread though, I'm > really curious why it doesn't have the problem. > > On Fri, Sep 6, 2013 at 8:36 AM, Jeff Layton <jlayton@redhat.com> wrote: > > On Thu, 5 Sep 2013 17:34:20 -0500 > > Quentin Barnes <qbarnes@gmail.com> wrote: > > > >> On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote: > >> > On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote: > >> > > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote: > >> > > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote: > >> > > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote: > >> > > > > > Neil Brown posted a patch couple days ago for this! > >> > > > > > > >> > > > > > http://thread.gmane.org/gmane.linux.nfs/58473 > >> > > > > > >> > > > > I tried Neil's patch on a v3.11 kernel. The rebuilt kernel still > >> > > > > exhibited the same 1000s of WRITEs/sec problem. > >> > > > > > >> > > > > Any other ideas? > >> > > > > >> > > > Yes. Please try the attached patch. > >> > > > >> > > Great! That did the trick! > >> > > > >> > > Do you feel this patch could be worthy of pushing it upstream in its > >> > > current state or was it just to verify a theory? > >> > > > >> > > > >> > > In comparing the nfs_flush_incompatible() implementations between > >> > > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem > >> > > more or less logically equivalent to me on whether or not to flush > >> > > the page. Also, when and where nfs_flush_incompatible() is invoked > >> > > seems the same. Would you provide a very brief pointer to clue me > >> > > in as to why this problem didn't also manifest circa 2.6.18 days? > >> > > >> > There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18 > >> > days, and so the risk was that your mmapped writes could end up being > >> > sent with the wrong credentials. > >> > >> Ah! You're right that nfs_vm_page_mkwrite() was missing from > >> the original 2.6.18, so that makes sense, however, Red Hat had > >> backported that function starting with their RHEL5.9(*) kernels, > >> yet the problem doesn't manifest on RHEL5.9. Maybe the answer lies > >> somewhere in RHEL5.9's do_wp_page(), or up that call path, but > >> glancing through it, it all looks pretty close though. > >> > >> > >> (*) That was the source I using when comparing with the 3.11 source > >> when studying your patch since it was the last kernel known to me > >> without the problem. > >> > > > > I'm pretty sure RHEL5 has a similar problem, but it's unclear to me why > > you're not seeing it there. I have a RHBZ open vs. RHEL5 but it's marked > > private at the moment (I'll see about opening it up). I brought this up > > upstream about a year ago with this strawman patch: > > > > http://article.gmane.org/gmane.linux.nfs/51240 > > > > ...at the time Trond said he was working on a set of patches to track > > the open/lock stateid on a per-req basis. Did that approach not pan > > out? > > > > Also, do you need to do a similar fix to nfs_can_coalesce_requests? > > Yes, I see the same behavior you do. With a recent kernel I see a ton of WRITE requests go out, with RHEL5 hardly any. I guess I'm a little confused as to the reverse question. Why are we seeing this data get flushed out so quickly in recent kernels from just changes to the mmaped pages? My understanding has always been that when a page is cleaned, we set the WP bit on it, and then when it goes dirty we clear it and also call page_mkwrite (not necessarily in that order). So here we have two processes that mmap the same page, and then are furiously writing to it. The kernel shouldn't really care or be aware of that thrashing until that page gets flushed out for some reason (msync() call or VM pressure). IOW, RHEL5 behaves the way I'd expect. What's unclear to me is why more recent kernels don't behave that way. -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: nfs-backed mmap file results in 1000s of WRITEs per second 2013-09-07 14:51 ` Jeff Layton @ 2013-09-07 15:00 ` Myklebust, Trond 0 siblings, 0 replies; 20+ messages in thread From: Myklebust, Trond @ 2013-09-07 15:00 UTC (permalink / raw) To: Jeff Layton; +Cc: Quentin Barnes, linux-nfs@vger.kernel.org T24gU2F0LCAyMDEzLTA5LTA3IGF0IDEwOjUxIC0wNDAwLCBKZWZmIExheXRvbiB3cm90ZToNCj4g T24gRnJpLCA2IFNlcCAyMDEzIDExOjQ4OjQ1IC0wNTAwDQo+IFF1ZW50aW4gQmFybmVzIDxxYmFy bmVzQGdtYWlsLmNvbT4gd3JvdGU6DQo+IA0KPiA+IEplZmYsIGNhbiB5b3VyIHRyeSBvdXQgbXkg dGVzdCBwcm9ncmFtIGluIHRoZSBiYXNlIG5vdGUgb24geW91cg0KPiA+IFJIRUw1Ljkgb3IgbGF0 ZXIgUkhFTDUueCBrZXJuZWxzPw0KPiA+IA0KPiA+IEkgcmV2ZXJpZmllZCB0aGF0IHJ1bm5pbmcg dGhlIHRlc3Qgb24gYSAyLjYuMTgtMzQ4LjE2LjEuZWw1IHg4Nl82NA0KPiA+IGtlcm5lbCAobGF0 ZXN0IHJlbGVhc2VkIFJIRUw1LjkpIGRvZXMgbm90IHNob3cgdGhlIHByb2JsZW0gZm9yIG1lLg0K PiA+IEJhc2VkIG9uIHdoYXQgeW91IGFuZCBUcm9uZCBoYXZlIHNhaWQgaW4gdGhpcyB0aHJlYWQg dGhvdWdoLCBJJ20NCj4gPiByZWFsbHkgY3VyaW91cyB3aHkgaXQgZG9lc24ndCBoYXZlIHRoZSBw cm9ibGVtLg0KPiA+IA0KPiA+IE9uIEZyaSwgU2VwIDYsIDIwMTMgYXQgODozNiBBTSwgSmVmZiBM YXl0b24gPGpsYXl0b25AcmVkaGF0LmNvbT4gd3JvdGU6DQo+ID4gPiBPbiBUaHUsIDUgU2VwIDIw MTMgMTc6MzQ6MjAgLTA1MDANCj4gPiA+IFF1ZW50aW4gQmFybmVzIDxxYmFybmVzQGdtYWlsLmNv bT4gd3JvdGU6DQo+ID4gPg0KPiA+ID4+IE9uIFRodSwgU2VwIDA1LCAyMDEzIGF0IDA5OjU3OjI0 UE0gKzAwMDAsIE15a2xlYnVzdCwgVHJvbmQgd3JvdGU6DQo+ID4gPj4gPiBPbiBUaHUsIDIwMTMt MDktMDUgYXQgMTY6MzYgLTA1MDAsIFF1ZW50aW4gQmFybmVzIHdyb3RlOg0KPiA+ID4+ID4gPiBP biBUaHUsIFNlcCAwNSwgMjAxMyBhdCAwODowMjowMVBNICswMDAwLCBNeWtsZWJ1c3QsIFRyb25k IHdyb3RlOg0KPiA+ID4+ID4gPiA+IE9uIFRodSwgMjAxMy0wOS0wNSBhdCAxNDoxMSAtMDUwMCwg UXVlbnRpbiBCYXJuZXMgd3JvdGU6DQo+ID4gPj4gPiA+ID4gPiBPbiBUaHUsIFNlcCAwNSwgMjAx MyBhdCAxMjowMzowM1BNIC0wNTAwLCBNYWxhaGFsIE5haW5lbmkgd3JvdGU6DQo+ID4gPj4gPiA+ ID4gPiA+IE5laWwgQnJvd24gcG9zdGVkIGEgcGF0Y2ggY291cGxlIGRheXMgYWdvIGZvciB0aGlz IQ0KPiA+ID4+ID4gPiA+ID4gPg0KPiA+ID4+ID4gPiA+ID4gPiBodHRwOi8vdGhyZWFkLmdtYW5l Lm9yZy9nbWFuZS5saW51eC5uZnMvNTg0NzMNCj4gPiA+PiA+ID4gPiA+DQo+ID4gPj4gPiA+ID4g PiBJIHRyaWVkIE5laWwncyBwYXRjaCBvbiBhIHYzLjExIGtlcm5lbC4gIFRoZSByZWJ1aWx0IGtl cm5lbCBzdGlsbA0KPiA+ID4+ID4gPiA+ID4gZXhoaWJpdGVkIHRoZSBzYW1lIDEwMDBzIG9mIFdS SVRFcy9zZWMgcHJvYmxlbS4NCj4gPiA+PiA+ID4gPiA+DQo+ID4gPj4gPiA+ID4gPiBBbnkgb3Ro ZXIgaWRlYXM/DQo+ID4gPj4gPiA+ID4NCj4gPiA+PiA+ID4gPiBZZXMuIFBsZWFzZSB0cnkgdGhl IGF0dGFjaGVkIHBhdGNoLg0KPiA+ID4+ID4gPg0KPiA+ID4+ID4gPiBHcmVhdCEgIFRoYXQgZGlk IHRoZSB0cmljayENCj4gPiA+PiA+ID4NCj4gPiA+PiA+ID4gRG8geW91IGZlZWwgdGhpcyBwYXRj aCBjb3VsZCBiZSB3b3J0aHkgb2YgcHVzaGluZyBpdCB1cHN0cmVhbSBpbiBpdHMNCj4gPiA+PiA+ ID4gY3VycmVudCBzdGF0ZSBvciB3YXMgaXQganVzdCB0byB2ZXJpZnkgYSB0aGVvcnk/DQo+ID4g Pj4gPiA+DQo+ID4gPj4gPiA+DQo+ID4gPj4gPiA+IEluIGNvbXBhcmluZyB0aGUgbmZzX2ZsdXNo X2luY29tcGF0aWJsZSgpIGltcGxlbWVudGF0aW9ucyBiZXR3ZWVuDQo+ID4gPj4gPiA+IFJIRUw1 IGFuZCB2My4xMSAod2l0aG91dCB5b3VyIHBhdGNoKSwgdGhlIGd1dHMgb2YgdGhlIGFsZ29yaXRo bSBzZWVtDQo+ID4gPj4gPiA+IG1vcmUgb3IgbGVzcyBsb2dpY2FsbHkgZXF1aXZhbGVudCB0byBt ZSBvbiB3aGV0aGVyIG9yIG5vdCB0byBmbHVzaA0KPiA+ID4+ID4gPiB0aGUgcGFnZS4gIEFsc28s IHdoZW4gYW5kIHdoZXJlIG5mc19mbHVzaF9pbmNvbXBhdGlibGUoKSBpcyBpbnZva2VkDQo+ID4g Pj4gPiA+IHNlZW1zIHRoZSBzYW1lLiAgV291bGQgeW91IHByb3ZpZGUgYSB2ZXJ5IGJyaWVmIHBv aW50ZXIgdG8gY2x1ZSBtZQ0KPiA+ID4+ID4gPiBpbiBhcyB0byB3aHkgdGhpcyBwcm9ibGVtIGRp ZG4ndCBhbHNvIG1hbmlmZXN0IGNpcmNhIDIuNi4xOCBkYXlzPw0KPiA+ID4+ID4NCj4gPiA+PiA+ IFRoZXJlIHdhcyBubyBuZnNfdm1fcGFnZV9ta3dyaXRlKCkgdG8gaGFuZGxlIHBhZ2UgZmF1bHRz IGluIHRoZSAyLjYuMTgNCj4gPiA+PiA+IGRheXMsIGFuZCBzbyB0aGUgcmlzayB3YXMgdGhhdCB5 b3VyIG1tYXBwZWQgd3JpdGVzIGNvdWxkIGVuZCB1cCBiZWluZw0KPiA+ID4+ID4gc2VudCB3aXRo IHRoZSB3cm9uZyBjcmVkZW50aWFscy4NCj4gPiA+Pg0KPiA+ID4+IEFoISAgWW91J3JlIHJpZ2h0 IHRoYXQgbmZzX3ZtX3BhZ2VfbWt3cml0ZSgpIHdhcyBtaXNzaW5nIGZyb20NCj4gPiA+PiB0aGUg b3JpZ2luYWwgMi42LjE4LCBzbyB0aGF0IG1ha2VzIHNlbnNlLCBob3dldmVyLCBSZWQgSGF0IGhh ZA0KPiA+ID4+IGJhY2twb3J0ZWQgdGhhdCBmdW5jdGlvbiBzdGFydGluZyB3aXRoIHRoZWlyIFJI RUw1LjkoKikga2VybmVscywNCj4gPiA+PiB5ZXQgdGhlIHByb2JsZW0gZG9lc24ndCBtYW5pZmVz dCBvbiBSSEVMNS45LiAgTWF5YmUgdGhlIGFuc3dlciBsaWVzDQo+ID4gPj4gc29tZXdoZXJlIGlu IFJIRUw1LjkncyBkb193cF9wYWdlKCksIG9yIHVwIHRoYXQgY2FsbCBwYXRoLCBidXQNCj4gPiA+ PiBnbGFuY2luZyB0aHJvdWdoIGl0LCBpdCBhbGwgbG9va3MgcHJldHR5IGNsb3NlIHRob3VnaC4N Cj4gPiA+Pg0KPiA+ID4+DQo+ID4gPj4gKCopIFRoYXQgd2FzIHRoZSBzb3VyY2UgSSB1c2luZyB3 aGVuIGNvbXBhcmluZyB3aXRoIHRoZSAzLjExIHNvdXJjZQ0KPiA+ID4+IHdoZW4gc3R1ZHlpbmcg eW91ciBwYXRjaCBzaW5jZSBpdCB3YXMgdGhlIGxhc3Qga2VybmVsIGtub3duIHRvIG1lDQo+ID4g Pj4gd2l0aG91dCB0aGUgcHJvYmxlbS4NCj4gPiA+Pg0KPiA+ID4NCj4gPiA+IEknbSBwcmV0dHkg c3VyZSBSSEVMNSBoYXMgYSBzaW1pbGFyIHByb2JsZW0sIGJ1dCBpdCdzIHVuY2xlYXIgdG8gbWUg d2h5DQo+ID4gPiB5b3UncmUgbm90IHNlZWluZyBpdCB0aGVyZS4gSSBoYXZlIGEgUkhCWiBvcGVu IHZzLiBSSEVMNSBidXQgaXQncyBtYXJrZWQNCj4gPiA+IHByaXZhdGUgYXQgdGhlIG1vbWVudCAo SSdsbCBzZWUgYWJvdXQgb3BlbmluZyBpdCB1cCkuIEkgYnJvdWdodCB0aGlzIHVwDQo+ID4gPiB1 cHN0cmVhbSBhYm91dCBhIHllYXIgYWdvIHdpdGggdGhpcyBzdHJhd21hbiBwYXRjaDoNCj4gPiA+ DQo+ID4gPiAgICAgaHR0cDovL2FydGljbGUuZ21hbmUub3JnL2dtYW5lLmxpbnV4Lm5mcy81MTI0 MA0KPiA+ID4NCj4gPiA+IC4uLmF0IHRoZSB0aW1lIFRyb25kIHNhaWQgaGUgd2FzIHdvcmtpbmcg b24gYSBzZXQgb2YgcGF0Y2hlcyB0byB0cmFjaw0KPiA+ID4gdGhlIG9wZW4vbG9jayBzdGF0ZWlk IG9uIGEgcGVyLXJlcSBiYXNpcy4gRGlkIHRoYXQgYXBwcm9hY2ggbm90IHBhbg0KPiA+ID4gb3V0 Pw0KPiA+ID4NCj4gPiA+IEFsc28sIGRvIHlvdSBuZWVkIHRvIGRvIGEgc2ltaWxhciBmaXggdG8g bmZzX2Nhbl9jb2FsZXNjZV9yZXF1ZXN0cz8NCj4gPiA+DQo+IA0KPiBZZXMsIEkgc2VlIHRoZSBz YW1lIGJlaGF2aW9yIHlvdSBkby4gV2l0aCBhIHJlY2VudCBrZXJuZWwgSSBzZWUgYSB0b24NCj4g b2YgV1JJVEUgcmVxdWVzdHMgZ28gb3V0LCB3aXRoIFJIRUw1IGhhcmRseSBhbnkuDQo+IA0KPiBJ IGd1ZXNzIEknbSBhIGxpdHRsZSBjb25mdXNlZCBhcyB0byB0aGUgcmV2ZXJzZSBxdWVzdGlvbi4g V2h5IGFyZSB3ZQ0KPiBzZWVpbmcgdGhpcyBkYXRhIGdldCBmbHVzaGVkIG91dCBzbyBxdWlja2x5 IGluIHJlY2VudCBrZXJuZWxzIGZyb20ganVzdA0KPiBjaGFuZ2VzIHRvIHRoZSBtbWFwZWQgcGFn ZXM/DQo+IA0KPiBNeSB1bmRlcnN0YW5kaW5nIGhhcyBhbHdheXMgYmVlbiB0aGF0IHdoZW4gYSBw YWdlIGlzIGNsZWFuZWQsIHdlIHNldA0KPiB0aGUgV1AgYml0IG9uIGl0LCBhbmQgdGhlbiB3aGVu IGl0IGdvZXMgZGlydHkgd2UgY2xlYXIgaXQgYW5kIGFsc28NCj4gY2FsbCBwYWdlX21rd3JpdGUg KG5vdCBuZWNlc3NhcmlseSBpbiB0aGF0IG9yZGVyKS4NCj4gDQo+IFNvIGhlcmUgd2UgaGF2ZSB0 d28gcHJvY2Vzc2VzIHRoYXQgbW1hcCB0aGUgc2FtZSBwYWdlLCBhbmQgdGhlbiBhcmUNCj4gZnVy aW91c2x5IHdyaXRpbmcgdG8gaXQuIFRoZSBrZXJuZWwgc2hvdWxkbid0IHJlYWxseSBjYXJlIG9y IGJlIGF3YXJlDQo+IG9mIHRoYXQgdGhyYXNoaW5nIHVudGlsIHRoYXQgcGFnZSBnZXRzIGZsdXNo ZWQgb3V0IGZvciBzb21lIHJlYXNvbg0KPiAobXN5bmMoKSBjYWxsIG9yIFZNIHByZXNzdXJlKS4N Cg0KZm9yaygpIGlzIG5vdCBzdXBwb3NlZCB0byBzaGFyZSBwYWdlIHRhYmxlcyBiZXR3ZWVuIHBh cmVudCBhbmQgY2hpbGQNCnByb2Nlc3MuIFNob3VsZG4ndCB0aGF0IGFsc28gaW1wbHkgdGhhdCB0 aGUgcGFnZSB3cml0ZSBwcm90ZWN0IGJpdHMgYXJlDQpub3Qgc2hhcmVkPw0KDQpJT1c6IGEgd3Jp dGUgcHJvdGVjdCBwYWdlIGZhdWx0IGluIHRoZSBwYXJlbnQgcHJvY2VzcyB0aGF0IHNldHMNCnBh Z2VfbWt3cml0ZSgpIHNob3VsZCBub3QgcHJldmVudCBhIHNpbWlsYXIgd3JpdGUgcHJvdGVjdCBw YWdlIGZhdWx0IGluDQp0aGUgY2hpbGQgcHJvY2VzcyAoYW5kIHN1YnNlcXVlbnQgY2FsbCB0byBw YWdlX21rd3JpdGUoKSkuDQoNCi4uLm9yIGlzIG15IHVuZGVyc3RhbmRpbmcgb2YgdGhlIHBhZ2Ug ZmF1bHQgc2VtYW50aWNzIHdyb25nPw0KDQo+IElPVywgUkhFTDUgYmVoYXZlcyB0aGUgd2F5IEkn ZCBleHBlY3QuIFdoYXQncyB1bmNsZWFyIHRvIG1lIGlzIHdoeSBtb3JlDQo+IHJlY2VudCBrZXJu ZWxzIGRvbid0IGJlaGF2ZSB0aGF0IHdheS4NCg0KLS0gDQpUcm9uZCBNeWtsZWJ1c3QNCkxpbnV4 IE5GUyBjbGllbnQgbWFpbnRhaW5lcg0KDQpOZXRBcHANClRyb25kLk15a2xlYnVzdEBuZXRhcHAu Y29tDQp3d3cubmV0YXBwLmNvbQ0K ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: nfs-backed mmap file results in 1000s of WRITEs per second 2013-09-06 16:48 ` Quentin Barnes 2013-09-07 14:51 ` Jeff Layton @ 2013-09-09 13:04 ` Jeff Layton 2013-09-09 17:32 ` Quentin Barnes 1 sibling, 1 reply; 20+ messages in thread From: Jeff Layton @ 2013-09-09 13:04 UTC (permalink / raw) To: Quentin Barnes; +Cc: Myklebust, Trond, linux-nfs@vger.kernel.org On Fri, 6 Sep 2013 11:48:45 -0500 Quentin Barnes <qbarnes@gmail.com> wrote: > Jeff, can your try out my test program in the base note on your > RHEL5.9 or later RHEL5.x kernels? > > I reverified that running the test on a 2.6.18-348.16.1.el5 x86_64 > kernel (latest released RHEL5.9) does not show the problem for me. > Based on what you and Trond have said in this thread though, I'm > really curious why it doesn't have the problem. > I can confirm what you see on RHEL5. One difference is that RHEL5's page_mkwrite handler does not do wait_on_page_writeback. That was added as part of the stable pages work that went in a while back, so that may be the main difference. Adding that in doesn't seem to materially change things though. In any case, what I see is that the initial program just ends up with a two calls to nfs_vm_page_mkwrite(). They both push out a WRITE and then things settle down (likely because the page is still marked dirty). Eventually, another write occurs and the dirty page gets pushed out to the server in a small flurry of WRITEs to the same range.Then, things settle down again until there's another small flurry of activity. My suspicion is that there is a race condition involved here, but I'm unclear on where it is. I'm not 100% convinced this is a bug, but page fault semantics aren't my strong suit. You may want to consider opening a "formal" RH support case if you have interest in getting Trond's patch backported, and/or following up on why RHEL5 behaves the way it does. > On Fri, Sep 6, 2013 at 8:36 AM, Jeff Layton <jlayton@redhat.com> wrote: > > On Thu, 5 Sep 2013 17:34:20 -0500 > > Quentin Barnes <qbarnes@gmail.com> wrote: > > > >> On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote: > >> > On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote: > >> > > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote: > >> > > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote: > >> > > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote: > >> > > > > > Neil Brown posted a patch couple days ago for this! > >> > > > > > > >> > > > > > http://thread.gmane.org/gmane.linux.nfs/58473 > >> > > > > > >> > > > > I tried Neil's patch on a v3.11 kernel. The rebuilt kernel still > >> > > > > exhibited the same 1000s of WRITEs/sec problem. > >> > > > > > >> > > > > Any other ideas? > >> > > > > >> > > > Yes. Please try the attached patch. > >> > > > >> > > Great! That did the trick! > >> > > > >> > > Do you feel this patch could be worthy of pushing it upstream in its > >> > > current state or was it just to verify a theory? > >> > > > >> > > > >> > > In comparing the nfs_flush_incompatible() implementations between > >> > > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem > >> > > more or less logically equivalent to me on whether or not to flush > >> > > the page. Also, when and where nfs_flush_incompatible() is invoked > >> > > seems the same. Would you provide a very brief pointer to clue me > >> > > in as to why this problem didn't also manifest circa 2.6.18 days? > >> > > >> > There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18 > >> > days, and so the risk was that your mmapped writes could end up being > >> > sent with the wrong credentials. > >> > >> Ah! You're right that nfs_vm_page_mkwrite() was missing from > >> the original 2.6.18, so that makes sense, however, Red Hat had > >> backported that function starting with their RHEL5.9(*) kernels, > >> yet the problem doesn't manifest on RHEL5.9. Maybe the answer lies > >> somewhere in RHEL5.9's do_wp_page(), or up that call path, but > >> glancing through it, it all looks pretty close though. > >> > >> > >> (*) That was the source I using when comparing with the 3.11 source > >> when studying your patch since it was the last kernel known to me > >> without the problem. > >> > > > > I'm pretty sure RHEL5 has a similar problem, but it's unclear to me why > > you're not seeing it there. I have a RHBZ open vs. RHEL5 but it's marked > > private at the moment (I'll see about opening it up). I brought this up > > upstream about a year ago with this strawman patch: > > > > http://article.gmane.org/gmane.linux.nfs/51240 > > > > ...at the time Trond said he was working on a set of patches to track > > the open/lock stateid on a per-req basis. Did that approach not pan > > out? > > > > Also, do you need to do a similar fix to nfs_can_coalesce_requests? > > > > Thanks, > > -- > > Jeff Layton <jlayton@redhat.com> > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: nfs-backed mmap file results in 1000s of WRITEs per second 2013-09-09 13:04 ` Jeff Layton @ 2013-09-09 17:32 ` Quentin Barnes 2013-09-09 17:47 ` Myklebust, Trond 0 siblings, 1 reply; 20+ messages in thread From: Quentin Barnes @ 2013-09-09 17:32 UTC (permalink / raw) To: Jeff Layton; +Cc: Myklebust, Trond, linux-nfs@vger.kernel.org On Mon, Sep 09, 2013 at 09:04:24AM -0400, Jeff Layton wrote: > On Fri, 6 Sep 2013 11:48:45 -0500 > Quentin Barnes <qbarnes@gmail.com> wrote: > > > Jeff, can your try out my test program in the base note on your > > RHEL5.9 or later RHEL5.x kernels? > > > > I reverified that running the test on a 2.6.18-348.16.1.el5 x86_64 > > kernel (latest released RHEL5.9) does not show the problem for me. > > Based on what you and Trond have said in this thread though, I'm > > really curious why it doesn't have the problem. > > I can confirm what you see on RHEL5. One difference is that RHEL5's > page_mkwrite handler does not do wait_on_page_writeback. That was added > as part of the stable pages work that went in a while back, so that may > be the main difference. Adding that in doesn't seem to materially > change things though. Good to know you confirmed the behavior I saw on RHEL5 (just so that I know it's not some random variable in play I had overlooked). > In any case, what I see is that the initial program just ends up with a > two calls to nfs_vm_page_mkwrite(). They both push out a WRITE and then > things settle down (likely because the page is still marked dirty). > > Eventually, another write occurs and the dirty page gets pushed out to > the server in a small flurry of WRITEs to the same range.Then, things > settle down again until there's another small flurry of activity. > > My suspicion is that there is a race condition involved here, but I'm > unclear on where it is. I'm not 100% convinced this is a bug, but page > fault semantics aren't my strong suit. As a test on RHEL6, I made a trivial systemtap script for kprobing nfs_vm_page_mkwrite() and nfs_flush_incompatible(). I wanted to make sure this bug was limited to just the nfs module and was not a result of some mm behavior change. With the bug unfixed running the test program, nfs_vm_page_mkwrite() and nfs_flush_incompatible() are called repeatedly at a very high rate (hence all the WRITEs). After Trond's patch, the two functions are called just at the program's initialization and then called only every 30 seconds or so. It looks like to me from the code flow that there must be something nfs_wb_page() does that resets the need for mm to keeping reinvoking nfs_vm_page_mkwrite(). I didn't look any deeper than that though for now. Maybe a race in how nfs_wb_page() updates status you're thinking of? > You may want to consider opening a "formal" RH support case if you have > interest in getting Trond's patch backported, and/or following up on > why RHEL5 behaves the way it does. Yes, I'll be doing that. When I do, I'll send you an email with the case ticket. Before filing it though, I want to hear back from the group that had the original problem to make sure Trond's patch fully addresses their problem (besides just the trivial test program). > -- > Jeff Layton <jlayton@redhat.com> Quentin ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: nfs-backed mmap file results in 1000s of WRITEs per second 2013-09-09 17:32 ` Quentin Barnes @ 2013-09-09 17:47 ` Myklebust, Trond 2013-09-09 18:21 ` Jeff Layton 0 siblings, 1 reply; 20+ messages in thread From: Myklebust, Trond @ 2013-09-09 17:47 UTC (permalink / raw) To: Quentin Barnes; +Cc: Jeff Layton, linux-nfs@vger.kernel.org T24gTW9uLCAyMDEzLTA5LTA5IGF0IDEyOjMyIC0wNTAwLCBRdWVudGluIEJhcm5lcyB3cm90ZToN Cj4gT24gTW9uLCBTZXAgMDksIDIwMTMgYXQgMDk6MDQ6MjRBTSAtMDQwMCwgSmVmZiBMYXl0b24g d3JvdGU6DQo+ID4gT24gRnJpLCA2IFNlcCAyMDEzIDExOjQ4OjQ1IC0wNTAwDQo+ID4gUXVlbnRp biBCYXJuZXMgPHFiYXJuZXNAZ21haWwuY29tPiB3cm90ZToNCj4gPiANCj4gPiA+IEplZmYsIGNh biB5b3VyIHRyeSBvdXQgbXkgdGVzdCBwcm9ncmFtIGluIHRoZSBiYXNlIG5vdGUgb24geW91cg0K PiA+ID4gUkhFTDUuOSBvciBsYXRlciBSSEVMNS54IGtlcm5lbHM/DQo+ID4gPiANCj4gPiA+IEkg cmV2ZXJpZmllZCB0aGF0IHJ1bm5pbmcgdGhlIHRlc3Qgb24gYSAyLjYuMTgtMzQ4LjE2LjEuZWw1 IHg4Nl82NA0KPiA+ID4ga2VybmVsIChsYXRlc3QgcmVsZWFzZWQgUkhFTDUuOSkgZG9lcyBub3Qg c2hvdyB0aGUgcHJvYmxlbSBmb3IgbWUuDQo+ID4gPiBCYXNlZCBvbiB3aGF0IHlvdSBhbmQgVHJv bmQgaGF2ZSBzYWlkIGluIHRoaXMgdGhyZWFkIHRob3VnaCwgSSdtDQo+ID4gPiByZWFsbHkgY3Vy aW91cyB3aHkgaXQgZG9lc24ndCBoYXZlIHRoZSBwcm9ibGVtLg0KPiA+IA0KPiA+IEkgY2FuIGNv bmZpcm0gd2hhdCB5b3Ugc2VlIG9uIFJIRUw1LiBPbmUgZGlmZmVyZW5jZSBpcyB0aGF0IFJIRUw1 J3MNCj4gPiBwYWdlX21rd3JpdGUgaGFuZGxlciBkb2VzIG5vdCBkbyB3YWl0X29uX3BhZ2Vfd3Jp dGViYWNrLiBUaGF0IHdhcyBhZGRlZA0KPiA+IGFzIHBhcnQgb2YgdGhlIHN0YWJsZSBwYWdlcyB3 b3JrIHRoYXQgd2VudCBpbiBhIHdoaWxlIGJhY2ssIHNvIHRoYXQgbWF5IA0KPiA+IGJlIHRoZSBt YWluIGRpZmZlcmVuY2UuIEFkZGluZyB0aGF0IGluIGRvZXNuJ3Qgc2VlbSB0byBtYXRlcmlhbGx5 DQo+ID4gY2hhbmdlIHRoaW5ncyB0aG91Z2guDQo+IA0KPiBHb29kIHRvIGtub3cgeW91IGNvbmZp cm1lZCB0aGUgYmVoYXZpb3IgSSBzYXcgb24gUkhFTDUgKGp1c3Qgc28gdGhhdA0KPiBJIGtub3cg aXQncyBub3Qgc29tZSByYW5kb20gdmFyaWFibGUgaW4gcGxheSBJIGhhZCBvdmVybG9va2VkKS4N Cj4gDQo+ID4gSW4gYW55IGNhc2UsIHdoYXQgSSBzZWUgaXMgdGhhdCB0aGUgaW5pdGlhbCBwcm9n cmFtIGp1c3QgZW5kcyB1cCB3aXRoIGENCj4gPiB0d28gY2FsbHMgdG8gbmZzX3ZtX3BhZ2VfbWt3 cml0ZSgpLiBUaGV5IGJvdGggcHVzaCBvdXQgYSBXUklURSBhbmQgdGhlbg0KPiA+IHRoaW5ncyBz ZXR0bGUgZG93biAobGlrZWx5IGJlY2F1c2UgdGhlIHBhZ2UgaXMgc3RpbGwgbWFya2VkIGRpcnR5 KS4NCj4gPiANCj4gPiBFdmVudHVhbGx5LCBhbm90aGVyIHdyaXRlIG9jY3VycyBhbmQgdGhlIGRp cnR5IHBhZ2UgZ2V0cyBwdXNoZWQgb3V0IHRvDQo+ID4gdGhlIHNlcnZlciBpbiBhIHNtYWxsIGZs dXJyeSBvZiBXUklURXMgdG8gdGhlIHNhbWUgcmFuZ2UuVGhlbiwgdGhpbmdzDQo+ID4gc2V0dGxl IGRvd24gYWdhaW4gdW50aWwgdGhlcmUncyBhbm90aGVyIHNtYWxsIGZsdXJyeSBvZiBhY3Rpdml0 eS4NCj4gPiANCj4gPiBNeSBzdXNwaWNpb24gaXMgdGhhdCB0aGVyZSBpcyBhIHJhY2UgY29uZGl0 aW9uIGludm9sdmVkIGhlcmUsIGJ1dCBJJ20NCj4gPiB1bmNsZWFyIG9uIHdoZXJlIGl0IGlzLiBJ J20gbm90IDEwMCUgY29udmluY2VkIHRoaXMgaXMgYSBidWcsIGJ1dCBwYWdlDQo+ID4gZmF1bHQg c2VtYW50aWNzIGFyZW4ndCBteSBzdHJvbmcgc3VpdC4NCj4gDQo+IEFzIGEgdGVzdCBvbiBSSEVM NiwgSSBtYWRlIGEgdHJpdmlhbCBzeXN0ZW10YXAgc2NyaXB0IGZvciBrcHJvYmluZw0KPiBuZnNf dm1fcGFnZV9ta3dyaXRlKCkgYW5kIG5mc19mbHVzaF9pbmNvbXBhdGlibGUoKS4gIEkgd2FudGVk IHRvDQo+IG1ha2Ugc3VyZSB0aGlzIGJ1ZyB3YXMgbGltaXRlZCB0byBqdXN0IHRoZSBuZnMgbW9k dWxlIGFuZCB3YXMgbm90IGENCj4gcmVzdWx0IG9mIHNvbWUgbW0gYmVoYXZpb3IgY2hhbmdlLg0K PiANCj4gV2l0aCB0aGUgYnVnIHVuZml4ZWQgcnVubmluZyB0aGUgdGVzdCBwcm9ncmFtLCBuZnNf dm1fcGFnZV9ta3dyaXRlKCkNCj4gYW5kIG5mc19mbHVzaF9pbmNvbXBhdGlibGUoKSBhcmUgY2Fs bGVkIHJlcGVhdGVkbHkgYXQgYSB2ZXJ5IGhpZ2ggcmF0ZQ0KPiAoaGVuY2UgYWxsIHRoZSBXUklU RXMpLg0KPiANCj4gQWZ0ZXIgVHJvbmQncyBwYXRjaCwgdGhlIHR3byBmdW5jdGlvbnMgYXJlIGNh bGxlZCBqdXN0IGF0IHRoZQ0KPiBwcm9ncmFtJ3MgaW5pdGlhbGl6YXRpb24gYW5kIHRoZW4gY2Fs bGVkIG9ubHkgZXZlcnkgMzAgc2Vjb25kcyBvcg0KPiBzby4NCj4gDQo+IEl0IGxvb2tzIGxpa2Ug dG8gbWUgZnJvbSB0aGUgY29kZSBmbG93IHRoYXQgdGhlcmUgbXVzdCBiZSBzb21ldGhpbmcNCj4g bmZzX3diX3BhZ2UoKSBkb2VzIHRoYXQgcmVzZXRzIHRoZSBuZWVkIGZvciBtbSB0byBrZWVwaW5n IHJlaW52b2tpbmcNCj4gbmZzX3ZtX3BhZ2VfbWt3cml0ZSgpLiAgSSBkaWRuJ3QgbG9vayBhbnkg ZGVlcGVyIHRoYW4gdGhhdCB0aG91Z2gNCj4gZm9yIG5vdy4gIE1heWJlIGEgcmFjZSBpbiBob3cg bmZzX3diX3BhZ2UoKSB1cGRhdGVzIHN0YXR1cyB5b3UncmUNCj4gdGhpbmtpbmcgb2Y/DQoNCklu IFJIRUwtNSwgbmZzX3diX3BhZ2UoKSBpcyBqdXN0IGEgd3JhcHBlciB0byBuZnNfc3luY19pbm9k ZV93YWl0KCksDQp3aGljaCBkb2VzIF9ub3RfIGNhbGwgY2xlYXJfcGFnZV9kaXJ0eV9mb3JfaW8o KSAoYW5kIGhlbmNlIGRvZXMgbm90IGNhbGwNCnBhZ2VfbWtjbGVhbigpKS4NCg0KVGhhdCB3b3Vs ZCBleHBsYWluIGl0Li4uDQoNCi0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMgY2xpZW50 IG1haW50YWluZXINCg0KTmV0QXBwDQpUcm9uZC5NeWtsZWJ1c3RAbmV0YXBwLmNvbQ0Kd3d3Lm5l dGFwcC5jb20NCg== ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: nfs-backed mmap file results in 1000s of WRITEs per second 2013-09-09 17:47 ` Myklebust, Trond @ 2013-09-09 18:21 ` Jeff Layton 0 siblings, 0 replies; 20+ messages in thread From: Jeff Layton @ 2013-09-09 18:21 UTC (permalink / raw) To: Myklebust, Trond; +Cc: Quentin Barnes, linux-nfs@vger.kernel.org On Mon, 9 Sep 2013 17:47:48 +0000 "Myklebust, Trond" <Trond.Myklebust@netapp.com> wrote: > On Mon, 2013-09-09 at 12:32 -0500, Quentin Barnes wrote: > > On Mon, Sep 09, 2013 at 09:04:24AM -0400, Jeff Layton wrote: > > > On Fri, 6 Sep 2013 11:48:45 -0500 > > > Quentin Barnes <qbarnes@gmail.com> wrote: > > > > > > > Jeff, can your try out my test program in the base note on your > > > > RHEL5.9 or later RHEL5.x kernels? > > > > > > > > I reverified that running the test on a 2.6.18-348.16.1.el5 x86_64 > > > > kernel (latest released RHEL5.9) does not show the problem for me. > > > > Based on what you and Trond have said in this thread though, I'm > > > > really curious why it doesn't have the problem. > > > > > > I can confirm what you see on RHEL5. One difference is that RHEL5's > > > page_mkwrite handler does not do wait_on_page_writeback. That was added > > > as part of the stable pages work that went in a while back, so that may > > > be the main difference. Adding that in doesn't seem to materially > > > change things though. > > > > Good to know you confirmed the behavior I saw on RHEL5 (just so that > > I know it's not some random variable in play I had overlooked). > > > > > In any case, what I see is that the initial program just ends up with a > > > two calls to nfs_vm_page_mkwrite(). They both push out a WRITE and then > > > things settle down (likely because the page is still marked dirty). > > > > > > Eventually, another write occurs and the dirty page gets pushed out to > > > the server in a small flurry of WRITEs to the same range.Then, things > > > settle down again until there's another small flurry of activity. > > > > > > My suspicion is that there is a race condition involved here, but I'm > > > unclear on where it is. I'm not 100% convinced this is a bug, but page > > > fault semantics aren't my strong suit. > > > > As a test on RHEL6, I made a trivial systemtap script for kprobing > > nfs_vm_page_mkwrite() and nfs_flush_incompatible(). I wanted to > > make sure this bug was limited to just the nfs module and was not a > > result of some mm behavior change. > > > > With the bug unfixed running the test program, nfs_vm_page_mkwrite() > > and nfs_flush_incompatible() are called repeatedly at a very high rate > > (hence all the WRITEs). > > > > After Trond's patch, the two functions are called just at the > > program's initialization and then called only every 30 seconds or > > so. > > > > It looks like to me from the code flow that there must be something > > nfs_wb_page() does that resets the need for mm to keeping reinvoking > > nfs_vm_page_mkwrite(). I didn't look any deeper than that though > > for now. Maybe a race in how nfs_wb_page() updates status you're > > thinking of? > > In RHEL-5, nfs_wb_page() is just a wrapper to nfs_sync_inode_wait(), > which does _not_ call clear_page_dirty_for_io() (and hence does not call > page_mkclean()). > > That would explain it... > Thanks Trond, that does explain it. FWIW, at this point in the RHEL5 lifecycle I'd be disinclined to make any changes to that code without some strong justification. Backporting Trond's recent patch for RHEL6 and making sure that RHEL7 has it sounds quite reasonable though. -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: nfs-backed mmap file results in 1000s of WRITEs per second 2013-09-05 21:36 ` Quentin Barnes 2013-09-05 21:57 ` Myklebust, Trond @ 2013-09-05 22:07 ` Myklebust, Trond 1 sibling, 0 replies; 20+ messages in thread From: Myklebust, Trond @ 2013-09-05 22:07 UTC (permalink / raw) To: Quentin Barnes; +Cc: linux-nfs@vger.kernel.org T24gVGh1LCAyMDEzLTA5LTA1IGF0IDE2OjM2IC0wNTAwLCBRdWVudGluIEJhcm5lcyB3cm90ZToN Cj4gT24gVGh1LCBTZXAgMDUsIDIwMTMgYXQgMDg6MDI6MDFQTSArMDAwMCwgTXlrbGVidXN0LCBU cm9uZCB3cm90ZToNCj4gPiBPbiBUaHUsIDIwMTMtMDktMDUgYXQgMTQ6MTEgLTA1MDAsIFF1ZW50 aW4gQmFybmVzIHdyb3RlOg0KPiA+ID4gT24gVGh1LCBTZXAgMDUsIDIwMTMgYXQgMTI6MDM6MDNQ TSAtMDUwMCwgTWFsYWhhbCBOYWluZW5pIHdyb3RlOg0KPiA+ID4gPiBOZWlsIEJyb3duIHBvc3Rl ZCBhIHBhdGNoIGNvdXBsZSBkYXlzIGFnbyBmb3IgdGhpcyENCj4gPiA+ID4gDQo+ID4gPiA+IGh0 dHA6Ly90aHJlYWQuZ21hbmUub3JnL2dtYW5lLmxpbnV4Lm5mcy81ODQ3Mw0KPiA+ID4gDQo+ID4g PiBJIHRyaWVkIE5laWwncyBwYXRjaCBvbiBhIHYzLjExIGtlcm5lbC4gIFRoZSByZWJ1aWx0IGtl cm5lbCBzdGlsbA0KPiA+ID4gZXhoaWJpdGVkIHRoZSBzYW1lIDEwMDBzIG9mIFdSSVRFcy9zZWMg cHJvYmxlbS4NCj4gPiA+IA0KPiA+ID4gQW55IG90aGVyIGlkZWFzPw0KPiA+IA0KPiA+IFllcy4g UGxlYXNlIHRyeSB0aGUgYXR0YWNoZWQgcGF0Y2guDQo+IA0KPiBHcmVhdCEgIFRoYXQgZGlkIHRo ZSB0cmljayENCj4gDQo+IERvIHlvdSBmZWVsIHRoaXMgcGF0Y2ggY291bGQgYmUgd29ydGh5IG9m IHB1c2hpbmcgaXQgdXBzdHJlYW0gaW4gaXRzDQo+IGN1cnJlbnQgc3RhdGUgb3Igd2FzIGl0IGp1 c3QgdG8gdmVyaWZ5IGEgdGhlb3J5Pw0KDQpJdCBzaG91bGQgYmUgc2FmZSB0byBtZXJnZSB0aGlz LiBJdCBpcyBhIHZhbGlkIG9wdGltaXNhdGlvbiBpbiB0aGUgY2FzZQ0Kd2hlcmUgdGhlcmUgYXJl IG5vIGxvY2tzIGFwcGxpZWQuDQoNCi0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMgY2xp ZW50IG1haW50YWluZXINCg0KTmV0QXBwDQpUcm9uZC5NeWtsZWJ1c3RAbmV0YXBwLmNvbQ0Kd3d3 Lm5ldGFwcC5jb20NCg== ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2013-09-09 18:21 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-09-05 16:21 nfs-backed mmap file results in 1000s of WRITEs per second Quentin Barnes 2013-09-05 17:03 ` Malahal Naineni 2013-09-05 19:11 ` Quentin Barnes 2013-09-05 20:02 ` Myklebust, Trond 2013-09-05 21:36 ` Quentin Barnes 2013-09-05 21:57 ` Myklebust, Trond 2013-09-05 22:34 ` Quentin Barnes 2013-09-06 13:36 ` Jeff Layton 2013-09-06 15:00 ` Myklebust, Trond 2013-09-06 15:04 ` Jeff Layton 2013-09-06 15:39 ` Myklebust, Trond 2013-09-08 14:25 ` William Dauchy 2013-09-06 16:48 ` Quentin Barnes 2013-09-07 14:51 ` Jeff Layton 2013-09-07 15:00 ` Myklebust, Trond 2013-09-09 13:04 ` Jeff Layton 2013-09-09 17:32 ` Quentin Barnes 2013-09-09 17:47 ` Myklebust, Trond 2013-09-09 18:21 ` Jeff Layton 2013-09-05 22:07 ` Myklebust, Trond
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).