nfs-backed mmap file results in 1000s of WRITEs per second

linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* nfs-backed mmap file results in 1000s of WRITEs per second
@ 2013-09-05 16:21 Quentin Barnes
  2013-09-05 17:03 ` Malahal Naineni
  0 siblings, 1 reply; 20+ messages in thread
From: Quentin Barnes @ 2013-09-05 16:21 UTC (permalink / raw)
  To: linux-nfs

If two (or more) processes are doing nothing more than writing to
the memory addresses of an mmapped shared file on an NFS mounted
file system, it results in the kernel scribbling WRITEs to the
server as fast as it can (1000s per second) even while no syscalls
are going on.

The problems happens on NFS clients mounting NFSv3 or NFSv4.  I've
reproduced this on the 3.11 kernel, and it happens as far back as
RHEL6 (2.6.32 based), however, it is not a problem on RHEL5 (2.6.18
based).  (All x86_64 systems.)  I didn't try anything in between.

I've created a self-contained program below that will demonstrate
the problem (call it "t1").  Assuming /mnt has an NFS file system:

  $ t1 /mnt/mynfsfile 1    # Fork 1 writer, kernel behaves normally
  $ t1 /mnt/mynfsfile 2    # Fork 2 writers, kernel goes crazy WRITEing

Just run "watch -d nfsstat" in another window while running the two
writer test and watch the WRITE count explode.

I don't see anything particularly wrong with what the example code
is doing with its use of mmap.  Is there anything undefined about
the code that would explain this behavior, or is this a NFS bug
that's really lived this long?

Quentin



#include <sys/stat.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <string.h>
#include <unistd.h>

int
kill_children()
{
	int		cnt = 0;
	siginfo_t	infop;

	signal(SIGINT, SIG_IGN);
	kill(0, SIGINT);
	while (waitid(P_ALL, 0, &infop, WEXITED) != -1) ++cnt;

	return cnt;
}

void
sighandler(int sig)
{
	printf("Cleaning up all children.\n");
	int cnt = kill_children();
	printf("Cleaned up %d child%s.\n", cnt, cnt == 1 ? "" : "ren");

	exit(0);
}

int
do_child(volatile int *iaddr)
{
	while (1) *iaddr = 1;
}

int
main(int argc, char **argv)
{
	const char	*path;
	int		fd;
	ssize_t		wlen;
	int		*ip;
	int		fork_count = 1;

	if (argc == 1) {
		fprintf(stderr, "Usage: %s {filename} [fork_count].\n",
			argv[0]);
		return 1;
	}

	path = argv[1];

	if (argc > 2) {
		int fc = atoi(argv[2]);
		if (fc >= 0)
			fork_count = fc;
	}

	fd = open(path, O_CREAT|O_TRUNC|O_RDWR|O_APPEND, S_IRUSR|S_IWUSR);
	if (fd < 0) {
		fprintf(stderr, "Open of '%s' failed: %s (%d)\n",
			path, strerror(errno), errno);
		return 1;
	}

	wlen = write(fd, &(int){0}, sizeof(int));
	if (wlen != sizeof(int)) {
		if (wlen < 0)
			fprintf(stderr, "Write of '%s' failed: %s (%d)\n",
				path, strerror(errno), errno);
		else
			fprintf(stderr, "Short write to '%s'\n", path);
		return 1;
	}

	ip = (int *)mmap(NULL, sizeof(int), PROT_READ|PROT_WRITE,
			   MAP_SHARED, fd, 0);
	if (ip == MAP_FAILED) {
		fprintf(stderr, "Mmap of '%s' failed: %s (%d)\n",
			path, strerror(errno), errno);
		return 1;
	}

	signal(SIGINT, sighandler);

	while (fork_count-- > 0) {
		switch(fork()) {
		case -1:
			fprintf(stderr, "Fork failed: %s (%d)\n",
				strerror(errno), errno);
			kill_children();
			return 1;
		case 0:   /* child  */
			signal(SIGINT, SIG_DFL);
			do_child(ip);
			break;
		default:  /* parent */
			break;
		}
	}

	printf("Press ^C to terminate test.\n");
	pause();

	return 0;
}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nfs-backed mmap file results in 1000s of WRITEs per second
  2013-09-05 16:21 nfs-backed mmap file results in 1000s of WRITEs per second Quentin Barnes
@ 2013-09-05 17:03 ` Malahal Naineni
  2013-09-05 19:11   ` Quentin Barnes
  0 siblings, 1 reply; 20+ messages in thread
From: Malahal Naineni @ 2013-09-05 17:03 UTC (permalink / raw)
  To: Quentin Barnes; +Cc: linux-nfs

Neil Brown posted a patch couple days ago for this!

http://thread.gmane.org/gmane.linux.nfs/58473

Regards, Malahal.

Quentin Barnes [qbarnes@gmail.com] wrote:
> If two (or more) processes are doing nothing more than writing to
> the memory addresses of an mmapped shared file on an NFS mounted
> file system, it results in the kernel scribbling WRITEs to the
> server as fast as it can (1000s per second) even while no syscalls
> are going on.
> 
> The problems happens on NFS clients mounting NFSv3 or NFSv4.  I've
> reproduced this on the 3.11 kernel, and it happens as far back as
> RHEL6 (2.6.32 based), however, it is not a problem on RHEL5 (2.6.18
> based).  (All x86_64 systems.)  I didn't try anything in between.
> 
> I've created a self-contained program below that will demonstrate
> the problem (call it "t1").  Assuming /mnt has an NFS file system:
> 
>   $ t1 /mnt/mynfsfile 1    # Fork 1 writer, kernel behaves normally
>   $ t1 /mnt/mynfsfile 2    # Fork 2 writers, kernel goes crazy WRITEing
> 
> Just run "watch -d nfsstat" in another window while running the two
> writer test and watch the WRITE count explode.
> 
> I don't see anything particularly wrong with what the example code
> is doing with its use of mmap.  Is there anything undefined about
> the code that would explain this behavior, or is this a NFS bug
> that's really lived this long?
> 
> Quentin
> 
> 
> 
> #include <sys/stat.h>
> #include <sys/mman.h>
> #include <sys/stat.h>
> #include <sys/wait.h>
> #include <errno.h>
> #include <fcntl.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <signal.h>
> #include <string.h>
> #include <unistd.h>
> 
> int
> kill_children()
> {
> 	int		cnt = 0;
> 	siginfo_t	infop;
> 
> 	signal(SIGINT, SIG_IGN);
> 	kill(0, SIGINT);
> 	while (waitid(P_ALL, 0, &infop, WEXITED) != -1) ++cnt;
> 
> 	return cnt;
> }
> 
> void
> sighandler(int sig)
> {
> 	printf("Cleaning up all children.\n");
> 	int cnt = kill_children();
> 	printf("Cleaned up %d child%s.\n", cnt, cnt == 1 ? "" : "ren");
> 
> 	exit(0);
> }
> 
> int
> do_child(volatile int *iaddr)
> {
> 	while (1) *iaddr = 1;
> }
> 
> int
> main(int argc, char **argv)
> {
> 	const char	*path;
> 	int		fd;
> 	ssize_t		wlen;
> 	int		*ip;
> 	int		fork_count = 1;
> 
> 	if (argc == 1) {
> 		fprintf(stderr, "Usage: %s {filename} [fork_count].\n",
> 			argv[0]);
> 		return 1;
> 	}
> 
> 	path = argv[1];
> 
> 	if (argc > 2) {
> 		int fc = atoi(argv[2]);
> 		if (fc >= 0)
> 			fork_count = fc;
> 	}
> 
> 	fd = open(path, O_CREAT|O_TRUNC|O_RDWR|O_APPEND, S_IRUSR|S_IWUSR);
> 	if (fd < 0) {
> 		fprintf(stderr, "Open of '%s' failed: %s (%d)\n",
> 			path, strerror(errno), errno);
> 		return 1;
> 	}
> 
> 	wlen = write(fd, &(int){0}, sizeof(int));
> 	if (wlen != sizeof(int)) {
> 		if (wlen < 0)
> 			fprintf(stderr, "Write of '%s' failed: %s (%d)\n",
> 				path, strerror(errno), errno);
> 		else
> 			fprintf(stderr, "Short write to '%s'\n", path);
> 		return 1;
> 	}
> 
> 	ip = (int *)mmap(NULL, sizeof(int), PROT_READ|PROT_WRITE,
> 			   MAP_SHARED, fd, 0);
> 	if (ip == MAP_FAILED) {
> 		fprintf(stderr, "Mmap of '%s' failed: %s (%d)\n",
> 			path, strerror(errno), errno);
> 		return 1;
> 	}
> 
> 	signal(SIGINT, sighandler);
> 
> 	while (fork_count-- > 0) {
> 		switch(fork()) {
> 		case -1:
> 			fprintf(stderr, "Fork failed: %s (%d)\n",
> 				strerror(errno), errno);
> 			kill_children();
> 			return 1;
> 		case 0:   /* child  */
> 			signal(SIGINT, SIG_DFL);
> 			do_child(ip);
> 			break;
> 		default:  /* parent */
> 			break;
> 		}
> 	}
> 
> 	printf("Press ^C to terminate test.\n");
> 	pause();
> 
> 	return 0;
> }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nfs-backed mmap file results in 1000s of WRITEs per second
  2013-09-05 17:03 ` Malahal Naineni
@ 2013-09-05 19:11   ` Quentin Barnes
  2013-09-05 20:02     ` Myklebust, Trond
  0 siblings, 1 reply; 20+ messages in thread
From: Quentin Barnes @ 2013-09-05 19:11 UTC (permalink / raw)
  To: linux-nfs

On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote:
> Neil Brown posted a patch couple days ago for this!
> 
> http://thread.gmane.org/gmane.linux.nfs/58473

I tried Neil's patch on a v3.11 kernel.  The rebuilt kernel still
exhibited the same 1000s of WRITEs/sec problem.

Any other ideas?

> Regards, Malahal.
> 
> Quentin Barnes [qbarnes@gmail.com] wrote:
> > If two (or more) processes are doing nothing more than writing to
> > the memory addresses of an mmapped shared file on an NFS mounted
> > file system, it results in the kernel scribbling WRITEs to the
> > server as fast as it can (1000s per second) even while no syscalls
> > are going on.
> > 
> > The problems happens on NFS clients mounting NFSv3 or NFSv4.  I've
> > reproduced this on the 3.11 kernel, and it happens as far back as
> > RHEL6 (2.6.32 based), however, it is not a problem on RHEL5 (2.6.18
> > based).  (All x86_64 systems.)  I didn't try anything in between.
> > 
> > I've created a self-contained program below that will demonstrate
> > the problem (call it "t1").  Assuming /mnt has an NFS file system:
> > 
> >   $ t1 /mnt/mynfsfile 1    # Fork 1 writer, kernel behaves normally
> >   $ t1 /mnt/mynfsfile 2    # Fork 2 writers, kernel goes crazy WRITEing
> > 
> > Just run "watch -d nfsstat" in another window while running the two
> > writer test and watch the WRITE count explode.
> > 
> > I don't see anything particularly wrong with what the example code
> > is doing with its use of mmap.  Is there anything undefined about
> > the code that would explain this behavior, or is this a NFS bug
> > that's really lived this long?
> > 
> > Quentin
> > 
> > 
> > 
> > #include <sys/stat.h>
> > #include <sys/mman.h>
> > #include <sys/stat.h>
> > #include <sys/wait.h>
> > #include <errno.h>
> > #include <fcntl.h>
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <signal.h>
> > #include <string.h>
> > #include <unistd.h>
> > 
> > int
> > kill_children()
> > {
> > 	int		cnt = 0;
> > 	siginfo_t	infop;
> > 
> > 	signal(SIGINT, SIG_IGN);
> > 	kill(0, SIGINT);
> > 	while (waitid(P_ALL, 0, &infop, WEXITED) != -1) ++cnt;
> > 
> > 	return cnt;
> > }
> > 
> > void
> > sighandler(int sig)
> > {
> > 	printf("Cleaning up all children.\n");
> > 	int cnt = kill_children();
> > 	printf("Cleaned up %d child%s.\n", cnt, cnt == 1 ? "" : "ren");
> > 
> > 	exit(0);
> > }
> > 
> > int
> > do_child(volatile int *iaddr)
> > {
> > 	while (1) *iaddr = 1;
> > }
> > 
> > int
> > main(int argc, char **argv)
> > {
> > 	const char	*path;
> > 	int		fd;
> > 	ssize_t		wlen;
> > 	int		*ip;
> > 	int		fork_count = 1;
> > 
> > 	if (argc == 1) {
> > 		fprintf(stderr, "Usage: %s {filename} [fork_count].\n",
> > 			argv[0]);
> > 		return 1;
> > 	}
> > 
> > 	path = argv[1];
> > 
> > 	if (argc > 2) {
> > 		int fc = atoi(argv[2]);
> > 		if (fc >= 0)
> > 			fork_count = fc;
> > 	}
> > 
> > 	fd = open(path, O_CREAT|O_TRUNC|O_RDWR|O_APPEND, S_IRUSR|S_IWUSR);
> > 	if (fd < 0) {
> > 		fprintf(stderr, "Open of '%s' failed: %s (%d)\n",
> > 			path, strerror(errno), errno);
> > 		return 1;
> > 	}
> > 
> > 	wlen = write(fd, &(int){0}, sizeof(int));
> > 	if (wlen != sizeof(int)) {
> > 		if (wlen < 0)
> > 			fprintf(stderr, "Write of '%s' failed: %s (%d)\n",
> > 				path, strerror(errno), errno);
> > 		else
> > 			fprintf(stderr, "Short write to '%s'\n", path);
> > 		return 1;
> > 	}
> > 
> > 	ip = (int *)mmap(NULL, sizeof(int), PROT_READ|PROT_WRITE,
> > 			   MAP_SHARED, fd, 0);
> > 	if (ip == MAP_FAILED) {
> > 		fprintf(stderr, "Mmap of '%s' failed: %s (%d)\n",
> > 			path, strerror(errno), errno);
> > 		return 1;
> > 	}
> > 
> > 	signal(SIGINT, sighandler);
> > 
> > 	while (fork_count-- > 0) {
> > 		switch(fork()) {
> > 		case -1:
> > 			fprintf(stderr, "Fork failed: %s (%d)\n",
> > 				strerror(errno), errno);
> > 			kill_children();
> > 			return 1;
> > 		case 0:   /* child  */
> > 			signal(SIGINT, SIG_DFL);
> > 			do_child(ip);
> > 			break;
> > 		default:  /* parent */
> > 			break;
> > 		}
> > 	}
> > 
> > 	printf("Press ^C to terminate test.\n");
> > 	pause();
> > 
> > 	return 0;
> > }
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 

Quentin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nfs-backed mmap file results in 1000s of WRITEs per second
  2013-09-05 19:11   ` Quentin Barnes
@ 2013-09-05 20:02     ` Myklebust, Trond
  2013-09-05 21:36       ` Quentin Barnes
  0 siblings, 1 reply; 20+ messages in thread
From: Myklebust, Trond @ 2013-09-05 20:02 UTC (permalink / raw)
  To: Quentin Barnes; +Cc: linux-nfs@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 5213 bytes --]

On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote:
> On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote:
> > Neil Brown posted a patch couple days ago for this!
> > 
> > http://thread.gmane.org/gmane.linux.nfs/58473
> 
> I tried Neil's patch on a v3.11 kernel.  The rebuilt kernel still
> exhibited the same 1000s of WRITEs/sec problem.
> 
> Any other ideas?

Yes. Please try the attached patch.

> > Regards, Malahal.
> > 
> > Quentin Barnes [qbarnes@gmail.com] wrote:
> > > If two (or more) processes are doing nothing more than writing to
> > > the memory addresses of an mmapped shared file on an NFS mounted
> > > file system, it results in the kernel scribbling WRITEs to the
> > > server as fast as it can (1000s per second) even while no syscalls
> > > are going on.
> > > 
> > > The problems happens on NFS clients mounting NFSv3 or NFSv4.  I've
> > > reproduced this on the 3.11 kernel, and it happens as far back as
> > > RHEL6 (2.6.32 based), however, it is not a problem on RHEL5 (2.6.18
> > > based).  (All x86_64 systems.)  I didn't try anything in between.
> > > 
> > > I've created a self-contained program below that will demonstrate
> > > the problem (call it "t1").  Assuming /mnt has an NFS file system:
> > > 
> > >   $ t1 /mnt/mynfsfile 1    # Fork 1 writer, kernel behaves normally
> > >   $ t1 /mnt/mynfsfile 2    # Fork 2 writers, kernel goes crazy WRITEing
> > > 
> > > Just run "watch -d nfsstat" in another window while running the two
> > > writer test and watch the WRITE count explode.
> > > 
> > > I don't see anything particularly wrong with what the example code
> > > is doing with its use of mmap.  Is there anything undefined about
> > > the code that would explain this behavior, or is this a NFS bug
> > > that's really lived this long?
> > > 
> > > Quentin
> > > 
> > > 
> > > 
> > > #include <sys/stat.h>
> > > #include <sys/mman.h>
> > > #include <sys/stat.h>
> > > #include <sys/wait.h>
> > > #include <errno.h>
> > > #include <fcntl.h>
> > > #include <stdio.h>
> > > #include <stdlib.h>
> > > #include <signal.h>
> > > #include <string.h>
> > > #include <unistd.h>
> > > 
> > > int
> > > kill_children()
> > > {
> > > 	int		cnt = 0;
> > > 	siginfo_t	infop;
> > > 
> > > 	signal(SIGINT, SIG_IGN);
> > > 	kill(0, SIGINT);
> > > 	while (waitid(P_ALL, 0, &infop, WEXITED) != -1) ++cnt;
> > > 
> > > 	return cnt;
> > > }
> > > 
> > > void
> > > sighandler(int sig)
> > > {
> > > 	printf("Cleaning up all children.\n");
> > > 	int cnt = kill_children();
> > > 	printf("Cleaned up %d child%s.\n", cnt, cnt == 1 ? "" : "ren");
> > > 
> > > 	exit(0);
> > > }
> > > 
> > > int
> > > do_child(volatile int *iaddr)
> > > {
> > > 	while (1) *iaddr = 1;
> > > }
> > > 
> > > int
> > > main(int argc, char **argv)
> > > {
> > > 	const char	*path;
> > > 	int		fd;
> > > 	ssize_t		wlen;
> > > 	int		*ip;
> > > 	int		fork_count = 1;
> > > 
> > > 	if (argc == 1) {
> > > 		fprintf(stderr, "Usage: %s {filename} [fork_count].\n",
> > > 			argv[0]);
> > > 		return 1;
> > > 	}
> > > 
> > > 	path = argv[1];
> > > 
> > > 	if (argc > 2) {
> > > 		int fc = atoi(argv[2]);
> > > 		if (fc >= 0)
> > > 			fork_count = fc;
> > > 	}
> > > 
> > > 	fd = open(path, O_CREAT|O_TRUNC|O_RDWR|O_APPEND, S_IRUSR|S_IWUSR);
> > > 	if (fd < 0) {
> > > 		fprintf(stderr, "Open of '%s' failed: %s (%d)\n",
> > > 			path, strerror(errno), errno);
> > > 		return 1;
> > > 	}
> > > 
> > > 	wlen = write(fd, &(int){0}, sizeof(int));
> > > 	if (wlen != sizeof(int)) {
> > > 		if (wlen < 0)
> > > 			fprintf(stderr, "Write of '%s' failed: %s (%d)\n",
> > > 				path, strerror(errno), errno);
> > > 		else
> > > 			fprintf(stderr, "Short write to '%s'\n", path);
> > > 		return 1;
> > > 	}
> > > 
> > > 	ip = (int *)mmap(NULL, sizeof(int), PROT_READ|PROT_WRITE,
> > > 			   MAP_SHARED, fd, 0);
> > > 	if (ip == MAP_FAILED) {
> > > 		fprintf(stderr, "Mmap of '%s' failed: %s (%d)\n",
> > > 			path, strerror(errno), errno);
> > > 		return 1;
> > > 	}
> > > 
> > > 	signal(SIGINT, sighandler);
> > > 
> > > 	while (fork_count-- > 0) {
> > > 		switch(fork()) {
> > > 		case -1:
> > > 			fprintf(stderr, "Fork failed: %s (%d)\n",
> > > 				strerror(errno), errno);
> > > 			kill_children();
> > > 			return 1;
> > > 		case 0:   /* child  */
> > > 			signal(SIGINT, SIG_DFL);
> > > 			do_child(ip);
> > > 			break;
> > > 		default:  /* parent */
> > > 			break;
> > > 		}
> > > 	}
> > > 
> > > 	printf("Press ^C to terminate test.\n");
> > > 	pause();
> > > 
> > > 	return 0;
> > > }
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > 
> 
> Quentin
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-NFS-Don-t-check-lock-owner-compatibility-in-writes-u.patch --]
[-- Type: text/x-patch; name="0001-NFS-Don-t-check-lock-owner-compatibility-in-writes-u.patch", Size: 1257 bytes --]

From 903ebaeefae78e6e03f3719aafa8fd5dd22d3288 Mon Sep 17 00:00:00 2001
From: Trond Myklebust <Trond.Myklebust@netapp.com>
Date: Thu, 5 Sep 2013 15:52:51 -0400
Subject: [PATCH] NFS: Don't check lock owner compatibility in writes unless
 file is locked

If we're doing buffered writes, and there is no file locking involved,
then we don't have to worry about whether or not the lock owner information
is identical.
By relaxing this check, we ensure that fork()ed child processes can write
to a page without having to first sync dirty data that was written
by the parent to disk.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
---
 fs/nfs/write.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 40979e8..ac1dc33 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -863,7 +863,7 @@ int nfs_flush_incompatible(struct file *file, struct page *page)
 			return 0;
 		l_ctx = req->wb_lock_context;
 		do_flush = req->wb_page != page || req->wb_context != ctx;
-		if (l_ctx) {
+		if (l_ctx && ctx->dentry->d_inode->i_flock != NULL) {
 			do_flush |= l_ctx->lockowner.l_owner != current->files
 				|| l_ctx->lockowner.l_pid != current->tgid;
 		}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: nfs-backed mmap file results in 1000s of WRITEs per second
  2013-09-05 20:02     ` Myklebust, Trond
@ 2013-09-05 21:36       ` Quentin Barnes
  2013-09-05 21:57         ` Myklebust, Trond
  2013-09-05 22:07         ` Myklebust, Trond
  0 siblings, 2 replies; 20+ messages in thread
From: Quentin Barnes @ 2013-09-05 21:36 UTC (permalink / raw)
  To: Myklebust, Trond; +Cc: linux-nfs@vger.kernel.org

On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote:
> On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote:
> > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote:
> > > Neil Brown posted a patch couple days ago for this!
> > > 
> > > http://thread.gmane.org/gmane.linux.nfs/58473
> > 
> > I tried Neil's patch on a v3.11 kernel.  The rebuilt kernel still
> > exhibited the same 1000s of WRITEs/sec problem.
> > 
> > Any other ideas?
> 
> Yes. Please try the attached patch.

Great!  That did the trick!

Do you feel this patch could be worthy of pushing it upstream in its
current state or was it just to verify a theory?


In comparing the nfs_flush_incompatible() implementations between
RHEL5 and v3.11 (without your patch), the guts of the algorithm seem
more or less logically equivalent to me on whether or not to flush
the page.  Also, when and where nfs_flush_incompatible() is invoked
seems the same.  Would you provide a very brief pointer to clue me
in as to why this problem didn't also manifest circa 2.6.18 days?

Quentin



> > > Regards, Malahal.
> > > 
> > > Quentin Barnes [qbarnes@gmail.com] wrote:
> > > > If two (or more) processes are doing nothing more than writing to
> > > > the memory addresses of an mmapped shared file on an NFS mounted
> > > > file system, it results in the kernel scribbling WRITEs to the
> > > > server as fast as it can (1000s per second) even while no syscalls
> > > > are going on.
> > > > 
> > > > The problems happens on NFS clients mounting NFSv3 or NFSv4.  I've
> > > > reproduced this on the 3.11 kernel, and it happens as far back as
> > > > RHEL6 (2.6.32 based), however, it is not a problem on RHEL5 (2.6.18
> > > > based).  (All x86_64 systems.)  I didn't try anything in between.
> > > > 
> > > > I've created a self-contained program below that will demonstrate
> > > > the problem (call it "t1").  Assuming /mnt has an NFS file system:
> > > > 
> > > >   $ t1 /mnt/mynfsfile 1    # Fork 1 writer, kernel behaves normally
> > > >   $ t1 /mnt/mynfsfile 2    # Fork 2 writers, kernel goes crazy WRITEing
> > > > 
> > > > Just run "watch -d nfsstat" in another window while running the two
> > > > writer test and watch the WRITE count explode.
> > > > 
> > > > I don't see anything particularly wrong with what the example code
> > > > is doing with its use of mmap.  Is there anything undefined about
> > > > the code that would explain this behavior, or is this a NFS bug
> > > > that's really lived this long?
> > > > 
> > > > Quentin
> > > > 
> > > > 
> > > > 
> > > > #include <sys/stat.h>
> > > > #include <sys/mman.h>
> > > > #include <sys/stat.h>
> > > > #include <sys/wait.h>
> > > > #include <errno.h>
> > > > #include <fcntl.h>
> > > > #include <stdio.h>
> > > > #include <stdlib.h>
> > > > #include <signal.h>
> > > > #include <string.h>
> > > > #include <unistd.h>
> > > > 
> > > > int
> > > > kill_children()
> > > > {
> > > > 	int		cnt = 0;
> > > > 	siginfo_t	infop;
> > > > 
> > > > 	signal(SIGINT, SIG_IGN);
> > > > 	kill(0, SIGINT);
> > > > 	while (waitid(P_ALL, 0, &infop, WEXITED) != -1) ++cnt;
> > > > 
> > > > 	return cnt;
> > > > }
> > > > 
> > > > void
> > > > sighandler(int sig)
> > > > {
> > > > 	printf("Cleaning up all children.\n");
> > > > 	int cnt = kill_children();
> > > > 	printf("Cleaned up %d child%s.\n", cnt, cnt == 1 ? "" : "ren");
> > > > 
> > > > 	exit(0);
> > > > }
> > > > 
> > > > int
> > > > do_child(volatile int *iaddr)
> > > > {
> > > > 	while (1) *iaddr = 1;
> > > > }
> > > > 
> > > > int
> > > > main(int argc, char **argv)
> > > > {
> > > > 	const char	*path;
> > > > 	int		fd;
> > > > 	ssize_t		wlen;
> > > > 	int		*ip;
> > > > 	int		fork_count = 1;
> > > > 
> > > > 	if (argc == 1) {
> > > > 		fprintf(stderr, "Usage: %s {filename} [fork_count].\n",
> > > > 			argv[0]);
> > > > 		return 1;
> > > > 	}
> > > > 
> > > > 	path = argv[1];
> > > > 
> > > > 	if (argc > 2) {
> > > > 		int fc = atoi(argv[2]);
> > > > 		if (fc >= 0)
> > > > 			fork_count = fc;
> > > > 	}
> > > > 
> > > > 	fd = open(path, O_CREAT|O_TRUNC|O_RDWR|O_APPEND, S_IRUSR|S_IWUSR);
> > > > 	if (fd < 0) {
> > > > 		fprintf(stderr, "Open of '%s' failed: %s (%d)\n",
> > > > 			path, strerror(errno), errno);
> > > > 		return 1;
> > > > 	}
> > > > 
> > > > 	wlen = write(fd, &(int){0}, sizeof(int));
> > > > 	if (wlen != sizeof(int)) {
> > > > 		if (wlen < 0)
> > > > 			fprintf(stderr, "Write of '%s' failed: %s (%d)\n",
> > > > 				path, strerror(errno), errno);
> > > > 		else
> > > > 			fprintf(stderr, "Short write to '%s'\n", path);
> > > > 		return 1;
> > > > 	}
> > > > 
> > > > 	ip = (int *)mmap(NULL, sizeof(int), PROT_READ|PROT_WRITE,
> > > > 			   MAP_SHARED, fd, 0);
> > > > 	if (ip == MAP_FAILED) {
> > > > 		fprintf(stderr, "Mmap of '%s' failed: %s (%d)\n",
> > > > 			path, strerror(errno), errno);
> > > > 		return 1;
> > > > 	}
> > > > 
> > > > 	signal(SIGINT, sighandler);
> > > > 
> > > > 	while (fork_count-- > 0) {
> > > > 		switch(fork()) {
> > > > 		case -1:
> > > > 			fprintf(stderr, "Fork failed: %s (%d)\n",
> > > > 				strerror(errno), errno);
> > > > 			kill_children();
> > > > 			return 1;
> > > > 		case 0:   /* child  */
> > > > 			signal(SIGINT, SIG_DFL);
> > > > 			do_child(ip);
> > > > 			break;
> > > > 		default:  /* parent */
> > > > 			break;
> > > > 		}
> > > > 	}
> > > > 
> > > > 	printf("Press ^C to terminate test.\n");
> > > > 	pause();
> > > > 
> > > > 	return 0;
> > > > }
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > 
> > > 
> > 
> > Quentin
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer
> 
> NetApp
> Trond.Myklebust@netapp.com
> www.netapp.com

> From 903ebaeefae78e6e03f3719aafa8fd5dd22d3288 Mon Sep 17 00:00:00 2001
> From: Trond Myklebust <Trond.Myklebust@netapp.com>
> Date: Thu, 5 Sep 2013 15:52:51 -0400
> Subject: [PATCH] NFS: Don't check lock owner compatibility in writes unless
>  file is locked
> 
> If we're doing buffered writes, and there is no file locking involved,
> then we don't have to worry about whether or not the lock owner information
> is identical.
> By relaxing this check, we ensure that fork()ed child processes can write
> to a page without having to first sync dirty data that was written
> by the parent to disk.
> 
> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
> ---
>  fs/nfs/write.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index 40979e8..ac1dc33 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -863,7 +863,7 @@ int nfs_flush_incompatible(struct file *file, struct page *page)
>  			return 0;
>  		l_ctx = req->wb_lock_context;
>  		do_flush = req->wb_page != page || req->wb_context != ctx;
> -		if (l_ctx) {
> +		if (l_ctx && ctx->dentry->d_inode->i_flock != NULL) {
>  			do_flush |= l_ctx->lockowner.l_owner != current->files
>  				|| l_ctx->lockowner.l_pid != current->tgid;
>  		}
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nfs-backed mmap file results in 1000s of WRITEs per second
  2013-09-05 21:36       ` Quentin Barnes
@ 2013-09-05 21:57         ` Myklebust, Trond
  2013-09-05 22:34           ` Quentin Barnes
  2013-09-05 22:07         ` Myklebust, Trond
  1 sibling, 1 reply; 20+ messages in thread
From: Myklebust, Trond @ 2013-09-05 21:57 UTC (permalink / raw)
  To: Quentin Barnes; +Cc: linux-nfs@vger.kernel.org

T24gVGh1LCAyMDEzLTA5LTA1IGF0IDE2OjM2IC0wNTAwLCBRdWVudGluIEJhcm5lcyB3cm90ZToN
Cj4gT24gVGh1LCBTZXAgMDUsIDIwMTMgYXQgMDg6MDI6MDFQTSArMDAwMCwgTXlrbGVidXN0LCBU
cm9uZCB3cm90ZToNCj4gPiBPbiBUaHUsIDIwMTMtMDktMDUgYXQgMTQ6MTEgLTA1MDAsIFF1ZW50
aW4gQmFybmVzIHdyb3RlOg0KPiA+ID4gT24gVGh1LCBTZXAgMDUsIDIwMTMgYXQgMTI6MDM6MDNQ
TSAtMDUwMCwgTWFsYWhhbCBOYWluZW5pIHdyb3RlOg0KPiA+ID4gPiBOZWlsIEJyb3duIHBvc3Rl
ZCBhIHBhdGNoIGNvdXBsZSBkYXlzIGFnbyBmb3IgdGhpcyENCj4gPiA+ID4gDQo+ID4gPiA+IGh0
dHA6Ly90aHJlYWQuZ21hbmUub3JnL2dtYW5lLmxpbnV4Lm5mcy81ODQ3Mw0KPiA+ID4gDQo+ID4g
PiBJIHRyaWVkIE5laWwncyBwYXRjaCBvbiBhIHYzLjExIGtlcm5lbC4gIFRoZSByZWJ1aWx0IGtl
cm5lbCBzdGlsbA0KPiA+ID4gZXhoaWJpdGVkIHRoZSBzYW1lIDEwMDBzIG9mIFdSSVRFcy9zZWMg
cHJvYmxlbS4NCj4gPiA+IA0KPiA+ID4gQW55IG90aGVyIGlkZWFzPw0KPiA+IA0KPiA+IFllcy4g
UGxlYXNlIHRyeSB0aGUgYXR0YWNoZWQgcGF0Y2guDQo+IA0KPiBHcmVhdCEgIFRoYXQgZGlkIHRo
ZSB0cmljayENCj4gDQo+IERvIHlvdSBmZWVsIHRoaXMgcGF0Y2ggY291bGQgYmUgd29ydGh5IG9m
IHB1c2hpbmcgaXQgdXBzdHJlYW0gaW4gaXRzDQo+IGN1cnJlbnQgc3RhdGUgb3Igd2FzIGl0IGp1
c3QgdG8gdmVyaWZ5IGEgdGhlb3J5Pw0KPiANCj4gDQo+IEluIGNvbXBhcmluZyB0aGUgbmZzX2Zs
dXNoX2luY29tcGF0aWJsZSgpIGltcGxlbWVudGF0aW9ucyBiZXR3ZWVuDQo+IFJIRUw1IGFuZCB2
My4xMSAod2l0aG91dCB5b3VyIHBhdGNoKSwgdGhlIGd1dHMgb2YgdGhlIGFsZ29yaXRobSBzZWVt
DQo+IG1vcmUgb3IgbGVzcyBsb2dpY2FsbHkgZXF1aXZhbGVudCB0byBtZSBvbiB3aGV0aGVyIG9y
IG5vdCB0byBmbHVzaA0KPiB0aGUgcGFnZS4gIEFsc28sIHdoZW4gYW5kIHdoZXJlIG5mc19mbHVz
aF9pbmNvbXBhdGlibGUoKSBpcyBpbnZva2VkDQo+IHNlZW1zIHRoZSBzYW1lLiAgV291bGQgeW91
IHByb3ZpZGUgYSB2ZXJ5IGJyaWVmIHBvaW50ZXIgdG8gY2x1ZSBtZQ0KPiBpbiBhcyB0byB3aHkg
dGhpcyBwcm9ibGVtIGRpZG4ndCBhbHNvIG1hbmlmZXN0IGNpcmNhIDIuNi4xOCBkYXlzPw0KDQpU
aGVyZSB3YXMgbm8gbmZzX3ZtX3BhZ2VfbWt3cml0ZSgpIHRvIGhhbmRsZSBwYWdlIGZhdWx0cyBp
biB0aGUgMi42LjE4DQpkYXlzLCBhbmQgc28gdGhlIHJpc2sgd2FzIHRoYXQgeW91ciBtbWFwcGVk
IHdyaXRlcyBjb3VsZCBlbmQgdXAgYmVpbmcNCnNlbnQgd2l0aCB0aGUgd3JvbmcgY3JlZGVudGlh
bHMuDQoNCi0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXIN
Cg0KTmV0QXBwDQpUcm9uZC5NeWtsZWJ1c3RAbmV0YXBwLmNvbQ0Kd3d3Lm5ldGFwcC5jb20NCg==

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nfs-backed mmap file results in 1000s of WRITEs per second
  2013-09-05 21:57         ` Myklebust, Trond
@ 2013-09-05 22:34           ` Quentin Barnes
  2013-09-06 13:36             ` Jeff Layton
  0 siblings, 1 reply; 20+ messages in thread
From: Quentin Barnes @ 2013-09-05 22:34 UTC (permalink / raw)
  To: Myklebust, Trond; +Cc: linux-nfs@vger.kernel.org

On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote:
> On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote:
> > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote:
> > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote:
> > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote:
> > > > > Neil Brown posted a patch couple days ago for this!
> > > > > 
> > > > > http://thread.gmane.org/gmane.linux.nfs/58473
> > > > 
> > > > I tried Neil's patch on a v3.11 kernel.  The rebuilt kernel still
> > > > exhibited the same 1000s of WRITEs/sec problem.
> > > > 
> > > > Any other ideas?
> > > 
> > > Yes. Please try the attached patch.
> > 
> > Great!  That did the trick!
> > 
> > Do you feel this patch could be worthy of pushing it upstream in its
> > current state or was it just to verify a theory?
> > 
> > 
> > In comparing the nfs_flush_incompatible() implementations between
> > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem
> > more or less logically equivalent to me on whether or not to flush
> > the page.  Also, when and where nfs_flush_incompatible() is invoked
> > seems the same.  Would you provide a very brief pointer to clue me
> > in as to why this problem didn't also manifest circa 2.6.18 days?
> 
> There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18
> days, and so the risk was that your mmapped writes could end up being
> sent with the wrong credentials.

Ah!  You're right that nfs_vm_page_mkwrite() was missing from
the original 2.6.18, so that makes sense, however, Red Hat had
backported that function starting with their RHEL5.9(*) kernels,
yet the problem doesn't manifest on RHEL5.9.  Maybe the answer lies
somewhere in RHEL5.9's do_wp_page(), or up that call path, but
glancing through it, it all looks pretty close though.


(*) That was the source I using when comparing with the 3.11 source
when studying your patch since it was the last kernel known to me
without the problem.

> -- 
> Trond Myklebust
> Linux NFS client maintainer
> 
> NetApp
> Trond.Myklebust@netapp.com
> www.netapp.com

Quentin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nfs-backed mmap file results in 1000s of WRITEs per second
  2013-09-05 22:34           ` Quentin Barnes
@ 2013-09-06 13:36             ` Jeff Layton
  2013-09-06 15:00               ` Myklebust, Trond
  2013-09-06 16:48               ` Quentin Barnes
  0 siblings, 2 replies; 20+ messages in thread
From: Jeff Layton @ 2013-09-06 13:36 UTC (permalink / raw)
  To: Quentin Barnes; +Cc: Myklebust, Trond, linux-nfs@vger.kernel.org

On Thu, 5 Sep 2013 17:34:20 -0500
Quentin Barnes <qbarnes@gmail.com> wrote:

> On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote:
> > On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote:
> > > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote:
> > > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote:
> > > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote:
> > > > > > Neil Brown posted a patch couple days ago for this!
> > > > > > 
> > > > > > http://thread.gmane.org/gmane.linux.nfs/58473
> > > > > 
> > > > > I tried Neil's patch on a v3.11 kernel.  The rebuilt kernel still
> > > > > exhibited the same 1000s of WRITEs/sec problem.
> > > > > 
> > > > > Any other ideas?
> > > > 
> > > > Yes. Please try the attached patch.
> > > 
> > > Great!  That did the trick!
> > > 
> > > Do you feel this patch could be worthy of pushing it upstream in its
> > > current state or was it just to verify a theory?
> > > 
> > > 
> > > In comparing the nfs_flush_incompatible() implementations between
> > > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem
> > > more or less logically equivalent to me on whether or not to flush
> > > the page.  Also, when and where nfs_flush_incompatible() is invoked
> > > seems the same.  Would you provide a very brief pointer to clue me
> > > in as to why this problem didn't also manifest circa 2.6.18 days?
> > 
> > There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18
> > days, and so the risk was that your mmapped writes could end up being
> > sent with the wrong credentials.
> 
> Ah!  You're right that nfs_vm_page_mkwrite() was missing from
> the original 2.6.18, so that makes sense, however, Red Hat had
> backported that function starting with their RHEL5.9(*) kernels,
> yet the problem doesn't manifest on RHEL5.9.  Maybe the answer lies
> somewhere in RHEL5.9's do_wp_page(), or up that call path, but
> glancing through it, it all looks pretty close though.
> 
> 
> (*) That was the source I using when comparing with the 3.11 source
> when studying your patch since it was the last kernel known to me
> without the problem.
> 

I'm pretty sure RHEL5 has a similar problem, but it's unclear to me why
you're not seeing it there. I have a RHBZ open vs. RHEL5 but it's marked
private at the moment (I'll see about opening it up). I brought this up
upstream about a year ago with this strawman patch:

    http://article.gmane.org/gmane.linux.nfs/51240

...at the time Trond said he was working on a set of patches to track
the open/lock stateid on a per-req basis. Did that approach not pan
out?

Also, do you need to do a similar fix to nfs_can_coalesce_requests?

Thanks,
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nfs-backed mmap file results in 1000s of WRITEs per second
  2013-09-06 13:36             ` Jeff Layton
@ 2013-09-06 15:00               ` Myklebust, Trond
  2013-09-06 15:04                 ` Jeff Layton
  2013-09-06 16:48               ` Quentin Barnes
  1 sibling, 1 reply; 20+ messages in thread
From: Myklebust, Trond @ 2013-09-06 15:00 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Quentin Barnes, linux-nfs@vger.kernel.org

T24gRnJpLCAyMDEzLTA5LTA2IGF0IDA5OjM2IC0wNDAwLCBKZWZmIExheXRvbiB3cm90ZToNCj4g
T24gVGh1LCA1IFNlcCAyMDEzIDE3OjM0OjIwIC0wNTAwDQo+IFF1ZW50aW4gQmFybmVzIDxxYmFy
bmVzQGdtYWlsLmNvbT4gd3JvdGU6DQo+IA0KPiA+IE9uIFRodSwgU2VwIDA1LCAyMDEzIGF0IDA5
OjU3OjI0UE0gKzAwMDAsIE15a2xlYnVzdCwgVHJvbmQgd3JvdGU6DQo+ID4gPiBPbiBUaHUsIDIw
MTMtMDktMDUgYXQgMTY6MzYgLTA1MDAsIFF1ZW50aW4gQmFybmVzIHdyb3RlOg0KPiA+ID4gPiBP
biBUaHUsIFNlcCAwNSwgMjAxMyBhdCAwODowMjowMVBNICswMDAwLCBNeWtsZWJ1c3QsIFRyb25k
IHdyb3RlOg0KPiA+ID4gPiA+IE9uIFRodSwgMjAxMy0wOS0wNSBhdCAxNDoxMSAtMDUwMCwgUXVl
bnRpbiBCYXJuZXMgd3JvdGU6DQo+ID4gPiA+ID4gPiBPbiBUaHUsIFNlcCAwNSwgMjAxMyBhdCAx
MjowMzowM1BNIC0wNTAwLCBNYWxhaGFsIE5haW5lbmkgd3JvdGU6DQo+ID4gPiA+ID4gPiA+IE5l
aWwgQnJvd24gcG9zdGVkIGEgcGF0Y2ggY291cGxlIGRheXMgYWdvIGZvciB0aGlzIQ0KPiA+ID4g
PiA+ID4gPiANCj4gPiA+ID4gPiA+ID4gaHR0cDovL3RocmVhZC5nbWFuZS5vcmcvZ21hbmUubGlu
dXgubmZzLzU4NDczDQo+ID4gPiA+ID4gPiANCj4gPiA+ID4gPiA+IEkgdHJpZWQgTmVpbCdzIHBh
dGNoIG9uIGEgdjMuMTEga2VybmVsLiAgVGhlIHJlYnVpbHQga2VybmVsIHN0aWxsDQo+ID4gPiA+
ID4gPiBleGhpYml0ZWQgdGhlIHNhbWUgMTAwMHMgb2YgV1JJVEVzL3NlYyBwcm9ibGVtLg0KPiA+
ID4gPiA+ID4gDQo+ID4gPiA+ID4gPiBBbnkgb3RoZXIgaWRlYXM/DQo+ID4gPiA+ID4gDQo+ID4g
PiA+ID4gWWVzLiBQbGVhc2UgdHJ5IHRoZSBhdHRhY2hlZCBwYXRjaC4NCj4gPiA+ID4gDQo+ID4g
PiA+IEdyZWF0ISAgVGhhdCBkaWQgdGhlIHRyaWNrIQ0KPiA+ID4gPiANCj4gPiA+ID4gRG8geW91
IGZlZWwgdGhpcyBwYXRjaCBjb3VsZCBiZSB3b3J0aHkgb2YgcHVzaGluZyBpdCB1cHN0cmVhbSBp
biBpdHMNCj4gPiA+ID4gY3VycmVudCBzdGF0ZSBvciB3YXMgaXQganVzdCB0byB2ZXJpZnkgYSB0
aGVvcnk/DQo+ID4gPiA+IA0KPiA+ID4gPiANCj4gPiA+ID4gSW4gY29tcGFyaW5nIHRoZSBuZnNf
Zmx1c2hfaW5jb21wYXRpYmxlKCkgaW1wbGVtZW50YXRpb25zIGJldHdlZW4NCj4gPiA+ID4gUkhF
TDUgYW5kIHYzLjExICh3aXRob3V0IHlvdXIgcGF0Y2gpLCB0aGUgZ3V0cyBvZiB0aGUgYWxnb3Jp
dGhtIHNlZW0NCj4gPiA+ID4gbW9yZSBvciBsZXNzIGxvZ2ljYWxseSBlcXVpdmFsZW50IHRvIG1l
IG9uIHdoZXRoZXIgb3Igbm90IHRvIGZsdXNoDQo+ID4gPiA+IHRoZSBwYWdlLiAgQWxzbywgd2hl
biBhbmQgd2hlcmUgbmZzX2ZsdXNoX2luY29tcGF0aWJsZSgpIGlzIGludm9rZWQNCj4gPiA+ID4g
c2VlbXMgdGhlIHNhbWUuICBXb3VsZCB5b3UgcHJvdmlkZSBhIHZlcnkgYnJpZWYgcG9pbnRlciB0
byBjbHVlIG1lDQo+ID4gPiA+IGluIGFzIHRvIHdoeSB0aGlzIHByb2JsZW0gZGlkbid0IGFsc28g
bWFuaWZlc3QgY2lyY2EgMi42LjE4IGRheXM/DQo+ID4gPiANCj4gPiA+IFRoZXJlIHdhcyBubyBu
ZnNfdm1fcGFnZV9ta3dyaXRlKCkgdG8gaGFuZGxlIHBhZ2UgZmF1bHRzIGluIHRoZSAyLjYuMTgN
Cj4gPiA+IGRheXMsIGFuZCBzbyB0aGUgcmlzayB3YXMgdGhhdCB5b3VyIG1tYXBwZWQgd3JpdGVz
IGNvdWxkIGVuZCB1cCBiZWluZw0KPiA+ID4gc2VudCB3aXRoIHRoZSB3cm9uZyBjcmVkZW50aWFs
cy4NCj4gPiANCj4gPiBBaCEgIFlvdSdyZSByaWdodCB0aGF0IG5mc192bV9wYWdlX21rd3JpdGUo
KSB3YXMgbWlzc2luZyBmcm9tDQo+ID4gdGhlIG9yaWdpbmFsIDIuNi4xOCwgc28gdGhhdCBtYWtl
cyBzZW5zZSwgaG93ZXZlciwgUmVkIEhhdCBoYWQNCj4gPiBiYWNrcG9ydGVkIHRoYXQgZnVuY3Rp
b24gc3RhcnRpbmcgd2l0aCB0aGVpciBSSEVMNS45KCopIGtlcm5lbHMsDQo+ID4geWV0IHRoZSBw
cm9ibGVtIGRvZXNuJ3QgbWFuaWZlc3Qgb24gUkhFTDUuOS4gIE1heWJlIHRoZSBhbnN3ZXIgbGll
cw0KPiA+IHNvbWV3aGVyZSBpbiBSSEVMNS45J3MgZG9fd3BfcGFnZSgpLCBvciB1cCB0aGF0IGNh
bGwgcGF0aCwgYnV0DQo+ID4gZ2xhbmNpbmcgdGhyb3VnaCBpdCwgaXQgYWxsIGxvb2tzIHByZXR0
eSBjbG9zZSB0aG91Z2guDQo+ID4gDQo+ID4gDQo+ID4gKCopIFRoYXQgd2FzIHRoZSBzb3VyY2Ug
SSB1c2luZyB3aGVuIGNvbXBhcmluZyB3aXRoIHRoZSAzLjExIHNvdXJjZQ0KPiA+IHdoZW4gc3R1
ZHlpbmcgeW91ciBwYXRjaCBzaW5jZSBpdCB3YXMgdGhlIGxhc3Qga2VybmVsIGtub3duIHRvIG1l
DQo+ID4gd2l0aG91dCB0aGUgcHJvYmxlbS4NCj4gPiANCj4gDQo+IEknbSBwcmV0dHkgc3VyZSBS
SEVMNSBoYXMgYSBzaW1pbGFyIHByb2JsZW0sIGJ1dCBpdCdzIHVuY2xlYXIgdG8gbWUgd2h5DQo+
IHlvdSdyZSBub3Qgc2VlaW5nIGl0IHRoZXJlLiBJIGhhdmUgYSBSSEJaIG9wZW4gdnMuIFJIRUw1
IGJ1dCBpdCdzIG1hcmtlZA0KPiBwcml2YXRlIGF0IHRoZSBtb21lbnQgKEknbGwgc2VlIGFib3V0
IG9wZW5pbmcgaXQgdXApLiBJIGJyb3VnaHQgdGhpcyB1cA0KPiB1cHN0cmVhbSBhYm91dCBhIHll
YXIgYWdvIHdpdGggdGhpcyBzdHJhd21hbiBwYXRjaDoNCj4gDQo+ICAgICBodHRwOi8vYXJ0aWNs
ZS5nbWFuZS5vcmcvZ21hbmUubGludXgubmZzLzUxMjQwDQo+IA0KPiAuLi5hdCB0aGUgdGltZSBU
cm9uZCBzYWlkIGhlIHdhcyB3b3JraW5nIG9uIGEgc2V0IG9mIHBhdGNoZXMgdG8gdHJhY2sNCj4g
dGhlIG9wZW4vbG9jayBzdGF0ZWlkIG9uIGEgcGVyLXJlcSBiYXNpcy4gRGlkIHRoYXQgYXBwcm9h
Y2ggbm90IHBhbg0KPiBvdXQ/DQoNCldlJ3ZlIGFjaGlldmVkIHdoYXQgd2Ugd2FudGVkIHRvIGRv
IChOZWlsJ3MgbG9jayByZWNvdmVyeSBwYXRjaCkgd2l0aG91dA0KdGhhdCBtYWNoaW5lcnksIHNv
IGZvciBub3csIHdlJ3JlIGRyb3BwaW5nIHRoYXQuDQoNCj4gQWxzbywgZG8geW91IG5lZWQgdG8g
ZG8gYSBzaW1pbGFyIGZpeCB0byBuZnNfY2FuX2NvYWxlc2NlX3JlcXVlc3RzPw0KDQpZZXMuIEdv
b2QgcG9pbnQhDQoNCi0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMgY2xpZW50IG1haW50
YWluZXINCg0KTmV0QXBwDQpUcm9uZC5NeWtsZWJ1c3RAbmV0YXBwLmNvbQ0Kd3d3Lm5ldGFwcC5j
b20NCg==

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nfs-backed mmap file results in 1000s of WRITEs per second
  2013-09-06 15:00               ` Myklebust, Trond
@ 2013-09-06 15:04                 ` Jeff Layton
  2013-09-06 15:39                   ` Myklebust, Trond
  0 siblings, 1 reply; 20+ messages in thread
From: Jeff Layton @ 2013-09-06 15:04 UTC (permalink / raw)
  To: Myklebust, Trond; +Cc: Quentin Barnes, linux-nfs@vger.kernel.org

On Fri, 6 Sep 2013 15:00:56 +0000
"Myklebust, Trond" <Trond.Myklebust@netapp.com> wrote:

> On Fri, 2013-09-06 at 09:36 -0400, Jeff Layton wrote:
> > On Thu, 5 Sep 2013 17:34:20 -0500
> > Quentin Barnes <qbarnes@gmail.com> wrote:
> > 
> > > On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote:
> > > > On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote:
> > > > > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote:
> > > > > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote:
> > > > > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote:
> > > > > > > > Neil Brown posted a patch couple days ago for this!
> > > > > > > > 
> > > > > > > > http://thread.gmane.org/gmane.linux.nfs/58473
> > > > > > > 
> > > > > > > I tried Neil's patch on a v3.11 kernel.  The rebuilt kernel still
> > > > > > > exhibited the same 1000s of WRITEs/sec problem.
> > > > > > > 
> > > > > > > Any other ideas?
> > > > > > 
> > > > > > Yes. Please try the attached patch.
> > > > > 
> > > > > Great!  That did the trick!
> > > > > 
> > > > > Do you feel this patch could be worthy of pushing it upstream in its
> > > > > current state or was it just to verify a theory?
> > > > > 
> > > > > 
> > > > > In comparing the nfs_flush_incompatible() implementations between
> > > > > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem
> > > > > more or less logically equivalent to me on whether or not to flush
> > > > > the page.  Also, when and where nfs_flush_incompatible() is invoked
> > > > > seems the same.  Would you provide a very brief pointer to clue me
> > > > > in as to why this problem didn't also manifest circa 2.6.18 days?
> > > > 
> > > > There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18
> > > > days, and so the risk was that your mmapped writes could end up being
> > > > sent with the wrong credentials.
> > > 
> > > Ah!  You're right that nfs_vm_page_mkwrite() was missing from
> > > the original 2.6.18, so that makes sense, however, Red Hat had
> > > backported that function starting with their RHEL5.9(*) kernels,
> > > yet the problem doesn't manifest on RHEL5.9.  Maybe the answer lies
> > > somewhere in RHEL5.9's do_wp_page(), or up that call path, but
> > > glancing through it, it all looks pretty close though.
> > > 
> > > 
> > > (*) That was the source I using when comparing with the 3.11 source
> > > when studying your patch since it was the last kernel known to me
> > > without the problem.
> > > 
> > 
> > I'm pretty sure RHEL5 has a similar problem, but it's unclear to me why
> > you're not seeing it there. I have a RHBZ open vs. RHEL5 but it's marked
> > private at the moment (I'll see about opening it up). I brought this up
> > upstream about a year ago with this strawman patch:
> > 
> >     http://article.gmane.org/gmane.linux.nfs/51240
> > 
> > ...at the time Trond said he was working on a set of patches to track
> > the open/lock stateid on a per-req basis. Did that approach not pan
> > out?
> 
> We've achieved what we wanted to do (Neil's lock recovery patch) without
> that machinery, so for now, we're dropping that.
> 
> > Also, do you need to do a similar fix to nfs_can_coalesce_requests?
> 
> Yes. Good point!
> 

Cool. FWIW, here's the original bug that was opened against RHEL5:

    https://bugzilla.redhat.com/show_bug.cgi?id=736578

...the reproducer that Max cooked up is not doing mmapped I/O so there
may be a difference there, but I haven't looked closely at why that is.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nfs-backed mmap file results in 1000s of WRITEs per second
  2013-09-06 15:04                 ` Jeff Layton
@ 2013-09-06 15:39                   ` Myklebust, Trond
  2013-09-08 14:25                     ` William Dauchy
  0 siblings, 1 reply; 20+ messages in thread
From: Myklebust, Trond @ 2013-09-06 15:39 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Quentin Barnes, linux-nfs@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 972 bytes --]

On Fri, 2013-09-06 at 11:04 -0400, Jeff Layton wrote:
> On Fri, 6 Sep 2013 15:00:56 +0000
> "Myklebust, Trond" <Trond.Myklebust@netapp.com> wrote:
> > > Also, do you need to do a similar fix to nfs_can_coalesce_requests?
> > 
> > Yes. Good point!
> > 
> 
> Cool. FWIW, here's the original bug that was opened against RHEL5:
> 
>     https://bugzilla.redhat.com/show_bug.cgi?id=736578
> 
> ...the reproducer that Max cooked up is not doing mmapped I/O so there
> may be a difference there, but I haven't looked closely at why that is.

Here is the patch...

There should be no big differences between mmapped I/O and ordinary I/O
now that we have page_mkwrite() to set up the request. The only
difference that I can think of offhand is when you use locking: for
mmap() writes, NFS can only guarantee data integrity if you lock the
entire page.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-NFS-Don-t-check-lock-owner-compatability-unless-file.patch --]
[-- Type: text/x-patch; name="0001-NFS-Don-t-check-lock-owner-compatability-unless-file.patch", Size: 2262 bytes --]

From 4109bb7496640aa97a12904527ba8e3a19b7ce7a Mon Sep 17 00:00:00 2001
From: Trond Myklebust <Trond.Myklebust@netapp.com>
Date: Fri, 6 Sep 2013 11:09:38 -0400
Subject: [PATCH] NFS: Don't check lock owner compatability unless file is
 locked (part 2)

When coalescing requests into a single READ or WRITE RPC call, and there
is no file locking involved, we don't have to refuse coalescing for
requests where the lock owner information doesn't match.

Reported-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
---
 fs/nfs/pagelist.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 29cfb7a..2ffebf2 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -328,6 +328,19 @@ void nfs_pageio_init(struct nfs_pageio_descriptor *desc,
 }
 EXPORT_SYMBOL_GPL(nfs_pageio_init);
 
+static bool nfs_match_open_context(const struct nfs_open_context *ctx1,
+		const struct nfs_open_context *ctx2)
+{
+	return ctx1->cred == ctx2->cred && ctx1->state == ctx2->state;
+}
+
+static bool nfs_match_lock_context(const struct nfs_lock_context *l1,
+		const struct nfs_lock_context *l2)
+{
+	return l1->lockowner.l_owner == l2->lockowner.l_owner
+		&& l1->lockowner.l_pid == l2->lockowner.l_pid;
+}
+
 /**
  * nfs_can_coalesce_requests - test two requests for compatibility
  * @prev: pointer to nfs_page
@@ -343,13 +356,10 @@ static bool nfs_can_coalesce_requests(struct nfs_page *prev,
 				      struct nfs_page *req,
 				      struct nfs_pageio_descriptor *pgio)
 {
-	if (req->wb_context->cred != prev->wb_context->cred)
-		return false;
-	if (req->wb_lock_context->lockowner.l_owner != prev->wb_lock_context->lockowner.l_owner)
-		return false;
-	if (req->wb_lock_context->lockowner.l_pid != prev->wb_lock_context->lockowner.l_pid)
+	if (!nfs_match_open_context(req->wb_context, prev->wb_context))
 		return false;
-	if (req->wb_context->state != prev->wb_context->state)
+	if (req->wb_context->dentry->d_inode->i_flock != NULL &&
+	    !nfs_match_lock_context(req->wb_lock_context, prev->wb_lock_context))
 		return false;
 	if (req->wb_pgbase != 0)
 		return false;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: nfs-backed mmap file results in 1000s of WRITEs per second
  2013-09-06 15:39                   ` Myklebust, Trond
@ 2013-09-08 14:25                     ` William Dauchy
  0 siblings, 0 replies; 20+ messages in thread
From: William Dauchy @ 2013-09-08 14:25 UTC (permalink / raw)
  To: Myklebust, Trond; +Cc: Jeff Layton, Quentin Barnes, linux-nfs@vger.kernel.org

On Fri, Sep 6, 2013 at 5:39 PM, Myklebust, Trond
<Trond.Myklebust@netapp.com> wrote:
> There should be no big differences between mmapped I/O and ordinary I/O
> now that we have page_mkwrite() to set up the request. The only
> difference that I can think of offhand is when you use locking: for
> mmap() writes, NFS can only guarantee data integrity if you lock the
> entire page.

Tested the two patches on top of a 3.10.x. Looks ok so far.

-- 
William

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nfs-backed mmap file results in 1000s of WRITEs per second
  2013-09-06 13:36             ` Jeff Layton
  2013-09-06 15:00               ` Myklebust, Trond
@ 2013-09-06 16:48               ` Quentin Barnes
  2013-09-07 14:51                 ` Jeff Layton
  2013-09-09 13:04                 ` Jeff Layton
  1 sibling, 2 replies; 20+ messages in thread
From: Quentin Barnes @ 2013-09-06 16:48 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Myklebust, Trond, linux-nfs@vger.kernel.org

Jeff, can your try out my test program in the base note on your
RHEL5.9 or later RHEL5.x kernels?

I reverified that running the test on a 2.6.18-348.16.1.el5 x86_64
kernel (latest released RHEL5.9) does not show the problem for me.
Based on what you and Trond have said in this thread though, I'm
really curious why it doesn't have the problem.

On Fri, Sep 6, 2013 at 8:36 AM, Jeff Layton <jlayton@redhat.com> wrote:
> On Thu, 5 Sep 2013 17:34:20 -0500
> Quentin Barnes <qbarnes@gmail.com> wrote:
>
>> On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote:
>> > On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote:
>> > > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote:
>> > > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote:
>> > > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote:
>> > > > > > Neil Brown posted a patch couple days ago for this!
>> > > > > >
>> > > > > > http://thread.gmane.org/gmane.linux.nfs/58473
>> > > > >
>> > > > > I tried Neil's patch on a v3.11 kernel.  The rebuilt kernel still
>> > > > > exhibited the same 1000s of WRITEs/sec problem.
>> > > > >
>> > > > > Any other ideas?
>> > > >
>> > > > Yes. Please try the attached patch.
>> > >
>> > > Great!  That did the trick!
>> > >
>> > > Do you feel this patch could be worthy of pushing it upstream in its
>> > > current state or was it just to verify a theory?
>> > >
>> > >
>> > > In comparing the nfs_flush_incompatible() implementations between
>> > > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem
>> > > more or less logically equivalent to me on whether or not to flush
>> > > the page.  Also, when and where nfs_flush_incompatible() is invoked
>> > > seems the same.  Would you provide a very brief pointer to clue me
>> > > in as to why this problem didn't also manifest circa 2.6.18 days?
>> >
>> > There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18
>> > days, and so the risk was that your mmapped writes could end up being
>> > sent with the wrong credentials.
>>
>> Ah!  You're right that nfs_vm_page_mkwrite() was missing from
>> the original 2.6.18, so that makes sense, however, Red Hat had
>> backported that function starting with their RHEL5.9(*) kernels,
>> yet the problem doesn't manifest on RHEL5.9.  Maybe the answer lies
>> somewhere in RHEL5.9's do_wp_page(), or up that call path, but
>> glancing through it, it all looks pretty close though.
>>
>>
>> (*) That was the source I using when comparing with the 3.11 source
>> when studying your patch since it was the last kernel known to me
>> without the problem.
>>
>
> I'm pretty sure RHEL5 has a similar problem, but it's unclear to me why
> you're not seeing it there. I have a RHBZ open vs. RHEL5 but it's marked
> private at the moment (I'll see about opening it up). I brought this up
> upstream about a year ago with this strawman patch:
>
>     http://article.gmane.org/gmane.linux.nfs/51240
>
> ...at the time Trond said he was working on a set of patches to track
> the open/lock stateid on a per-req basis. Did that approach not pan
> out?
>
> Also, do you need to do a similar fix to nfs_can_coalesce_requests?
>
> Thanks,
> --
> Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nfs-backed mmap file results in 1000s of WRITEs per second
  2013-09-06 16:48               ` Quentin Barnes
@ 2013-09-07 14:51                 ` Jeff Layton
  2013-09-07 15:00                   ` Myklebust, Trond
  2013-09-09 13:04                 ` Jeff Layton
  1 sibling, 1 reply; 20+ messages in thread
From: Jeff Layton @ 2013-09-07 14:51 UTC (permalink / raw)
  To: Quentin Barnes; +Cc: Myklebust, Trond, linux-nfs@vger.kernel.org

On Fri, 6 Sep 2013 11:48:45 -0500
Quentin Barnes <qbarnes@gmail.com> wrote:

> Jeff, can your try out my test program in the base note on your
> RHEL5.9 or later RHEL5.x kernels?
> 
> I reverified that running the test on a 2.6.18-348.16.1.el5 x86_64
> kernel (latest released RHEL5.9) does not show the problem for me.
> Based on what you and Trond have said in this thread though, I'm
> really curious why it doesn't have the problem.
> 
> On Fri, Sep 6, 2013 at 8:36 AM, Jeff Layton <jlayton@redhat.com> wrote:
> > On Thu, 5 Sep 2013 17:34:20 -0500
> > Quentin Barnes <qbarnes@gmail.com> wrote:
> >
> >> On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote:
> >> > On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote:
> >> > > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote:
> >> > > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote:
> >> > > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote:
> >> > > > > > Neil Brown posted a patch couple days ago for this!
> >> > > > > >
> >> > > > > > http://thread.gmane.org/gmane.linux.nfs/58473
> >> > > > >
> >> > > > > I tried Neil's patch on a v3.11 kernel.  The rebuilt kernel still
> >> > > > > exhibited the same 1000s of WRITEs/sec problem.
> >> > > > >
> >> > > > > Any other ideas?
> >> > > >
> >> > > > Yes. Please try the attached patch.
> >> > >
> >> > > Great!  That did the trick!
> >> > >
> >> > > Do you feel this patch could be worthy of pushing it upstream in its
> >> > > current state or was it just to verify a theory?
> >> > >
> >> > >
> >> > > In comparing the nfs_flush_incompatible() implementations between
> >> > > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem
> >> > > more or less logically equivalent to me on whether or not to flush
> >> > > the page.  Also, when and where nfs_flush_incompatible() is invoked
> >> > > seems the same.  Would you provide a very brief pointer to clue me
> >> > > in as to why this problem didn't also manifest circa 2.6.18 days?
> >> >
> >> > There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18
> >> > days, and so the risk was that your mmapped writes could end up being
> >> > sent with the wrong credentials.
> >>
> >> Ah!  You're right that nfs_vm_page_mkwrite() was missing from
> >> the original 2.6.18, so that makes sense, however, Red Hat had
> >> backported that function starting with their RHEL5.9(*) kernels,
> >> yet the problem doesn't manifest on RHEL5.9.  Maybe the answer lies
> >> somewhere in RHEL5.9's do_wp_page(), or up that call path, but
> >> glancing through it, it all looks pretty close though.
> >>
> >>
> >> (*) That was the source I using when comparing with the 3.11 source
> >> when studying your patch since it was the last kernel known to me
> >> without the problem.
> >>
> >
> > I'm pretty sure RHEL5 has a similar problem, but it's unclear to me why
> > you're not seeing it there. I have a RHBZ open vs. RHEL5 but it's marked
> > private at the moment (I'll see about opening it up). I brought this up
> > upstream about a year ago with this strawman patch:
> >
> >     http://article.gmane.org/gmane.linux.nfs/51240
> >
> > ...at the time Trond said he was working on a set of patches to track
> > the open/lock stateid on a per-req basis. Did that approach not pan
> > out?
> >
> > Also, do you need to do a similar fix to nfs_can_coalesce_requests?
> >

Yes, I see the same behavior you do. With a recent kernel I see a ton
of WRITE requests go out, with RHEL5 hardly any.

I guess I'm a little confused as to the reverse question. Why are we
seeing this data get flushed out so quickly in recent kernels from just
changes to the mmaped pages?

My understanding has always been that when a page is cleaned, we set
the WP bit on it, and then when it goes dirty we clear it and also
call page_mkwrite (not necessarily in that order).

So here we have two processes that mmap the same page, and then are
furiously writing to it. The kernel shouldn't really care or be aware
of that thrashing until that page gets flushed out for some reason
(msync() call or VM pressure).

IOW, RHEL5 behaves the way I'd expect. What's unclear to me is why more
recent kernels don't behave that way.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nfs-backed mmap file results in 1000s of WRITEs per second
  2013-09-07 14:51                 ` Jeff Layton
@ 2013-09-07 15:00                   ` Myklebust, Trond
  0 siblings, 0 replies; 20+ messages in thread
From: Myklebust, Trond @ 2013-09-07 15:00 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Quentin Barnes, linux-nfs@vger.kernel.org

T24gU2F0LCAyMDEzLTA5LTA3IGF0IDEwOjUxIC0wNDAwLCBKZWZmIExheXRvbiB3cm90ZToNCj4g
T24gRnJpLCA2IFNlcCAyMDEzIDExOjQ4OjQ1IC0wNTAwDQo+IFF1ZW50aW4gQmFybmVzIDxxYmFy
bmVzQGdtYWlsLmNvbT4gd3JvdGU6DQo+IA0KPiA+IEplZmYsIGNhbiB5b3VyIHRyeSBvdXQgbXkg
dGVzdCBwcm9ncmFtIGluIHRoZSBiYXNlIG5vdGUgb24geW91cg0KPiA+IFJIRUw1Ljkgb3IgbGF0
ZXIgUkhFTDUueCBrZXJuZWxzPw0KPiA+IA0KPiA+IEkgcmV2ZXJpZmllZCB0aGF0IHJ1bm5pbmcg
dGhlIHRlc3Qgb24gYSAyLjYuMTgtMzQ4LjE2LjEuZWw1IHg4Nl82NA0KPiA+IGtlcm5lbCAobGF0
ZXN0IHJlbGVhc2VkIFJIRUw1LjkpIGRvZXMgbm90IHNob3cgdGhlIHByb2JsZW0gZm9yIG1lLg0K
PiA+IEJhc2VkIG9uIHdoYXQgeW91IGFuZCBUcm9uZCBoYXZlIHNhaWQgaW4gdGhpcyB0aHJlYWQg
dGhvdWdoLCBJJ20NCj4gPiByZWFsbHkgY3VyaW91cyB3aHkgaXQgZG9lc24ndCBoYXZlIHRoZSBw
cm9ibGVtLg0KPiA+IA0KPiA+IE9uIEZyaSwgU2VwIDYsIDIwMTMgYXQgODozNiBBTSwgSmVmZiBM
YXl0b24gPGpsYXl0b25AcmVkaGF0LmNvbT4gd3JvdGU6DQo+ID4gPiBPbiBUaHUsIDUgU2VwIDIw
MTMgMTc6MzQ6MjAgLTA1MDANCj4gPiA+IFF1ZW50aW4gQmFybmVzIDxxYmFybmVzQGdtYWlsLmNv
bT4gd3JvdGU6DQo+ID4gPg0KPiA+ID4+IE9uIFRodSwgU2VwIDA1LCAyMDEzIGF0IDA5OjU3OjI0
UE0gKzAwMDAsIE15a2xlYnVzdCwgVHJvbmQgd3JvdGU6DQo+ID4gPj4gPiBPbiBUaHUsIDIwMTMt
MDktMDUgYXQgMTY6MzYgLTA1MDAsIFF1ZW50aW4gQmFybmVzIHdyb3RlOg0KPiA+ID4+ID4gPiBP
biBUaHUsIFNlcCAwNSwgMjAxMyBhdCAwODowMjowMVBNICswMDAwLCBNeWtsZWJ1c3QsIFRyb25k
IHdyb3RlOg0KPiA+ID4+ID4gPiA+IE9uIFRodSwgMjAxMy0wOS0wNSBhdCAxNDoxMSAtMDUwMCwg
UXVlbnRpbiBCYXJuZXMgd3JvdGU6DQo+ID4gPj4gPiA+ID4gPiBPbiBUaHUsIFNlcCAwNSwgMjAx
MyBhdCAxMjowMzowM1BNIC0wNTAwLCBNYWxhaGFsIE5haW5lbmkgd3JvdGU6DQo+ID4gPj4gPiA+
ID4gPiA+IE5laWwgQnJvd24gcG9zdGVkIGEgcGF0Y2ggY291cGxlIGRheXMgYWdvIGZvciB0aGlz
IQ0KPiA+ID4+ID4gPiA+ID4gPg0KPiA+ID4+ID4gPiA+ID4gPiBodHRwOi8vdGhyZWFkLmdtYW5l
Lm9yZy9nbWFuZS5saW51eC5uZnMvNTg0NzMNCj4gPiA+PiA+ID4gPiA+DQo+ID4gPj4gPiA+ID4g
PiBJIHRyaWVkIE5laWwncyBwYXRjaCBvbiBhIHYzLjExIGtlcm5lbC4gIFRoZSByZWJ1aWx0IGtl
cm5lbCBzdGlsbA0KPiA+ID4+ID4gPiA+ID4gZXhoaWJpdGVkIHRoZSBzYW1lIDEwMDBzIG9mIFdS
SVRFcy9zZWMgcHJvYmxlbS4NCj4gPiA+PiA+ID4gPiA+DQo+ID4gPj4gPiA+ID4gPiBBbnkgb3Ro
ZXIgaWRlYXM/DQo+ID4gPj4gPiA+ID4NCj4gPiA+PiA+ID4gPiBZZXMuIFBsZWFzZSB0cnkgdGhl
IGF0dGFjaGVkIHBhdGNoLg0KPiA+ID4+ID4gPg0KPiA+ID4+ID4gPiBHcmVhdCEgIFRoYXQgZGlk
IHRoZSB0cmljayENCj4gPiA+PiA+ID4NCj4gPiA+PiA+ID4gRG8geW91IGZlZWwgdGhpcyBwYXRj
aCBjb3VsZCBiZSB3b3J0aHkgb2YgcHVzaGluZyBpdCB1cHN0cmVhbSBpbiBpdHMNCj4gPiA+PiA+
ID4gY3VycmVudCBzdGF0ZSBvciB3YXMgaXQganVzdCB0byB2ZXJpZnkgYSB0aGVvcnk/DQo+ID4g
Pj4gPiA+DQo+ID4gPj4gPiA+DQo+ID4gPj4gPiA+IEluIGNvbXBhcmluZyB0aGUgbmZzX2ZsdXNo
X2luY29tcGF0aWJsZSgpIGltcGxlbWVudGF0aW9ucyBiZXR3ZWVuDQo+ID4gPj4gPiA+IFJIRUw1
IGFuZCB2My4xMSAod2l0aG91dCB5b3VyIHBhdGNoKSwgdGhlIGd1dHMgb2YgdGhlIGFsZ29yaXRo
bSBzZWVtDQo+ID4gPj4gPiA+IG1vcmUgb3IgbGVzcyBsb2dpY2FsbHkgZXF1aXZhbGVudCB0byBt
ZSBvbiB3aGV0aGVyIG9yIG5vdCB0byBmbHVzaA0KPiA+ID4+ID4gPiB0aGUgcGFnZS4gIEFsc28s
IHdoZW4gYW5kIHdoZXJlIG5mc19mbHVzaF9pbmNvbXBhdGlibGUoKSBpcyBpbnZva2VkDQo+ID4g
Pj4gPiA+IHNlZW1zIHRoZSBzYW1lLiAgV291bGQgeW91IHByb3ZpZGUgYSB2ZXJ5IGJyaWVmIHBv
aW50ZXIgdG8gY2x1ZSBtZQ0KPiA+ID4+ID4gPiBpbiBhcyB0byB3aHkgdGhpcyBwcm9ibGVtIGRp
ZG4ndCBhbHNvIG1hbmlmZXN0IGNpcmNhIDIuNi4xOCBkYXlzPw0KPiA+ID4+ID4NCj4gPiA+PiA+
IFRoZXJlIHdhcyBubyBuZnNfdm1fcGFnZV9ta3dyaXRlKCkgdG8gaGFuZGxlIHBhZ2UgZmF1bHRz
IGluIHRoZSAyLjYuMTgNCj4gPiA+PiA+IGRheXMsIGFuZCBzbyB0aGUgcmlzayB3YXMgdGhhdCB5
b3VyIG1tYXBwZWQgd3JpdGVzIGNvdWxkIGVuZCB1cCBiZWluZw0KPiA+ID4+ID4gc2VudCB3aXRo
IHRoZSB3cm9uZyBjcmVkZW50aWFscy4NCj4gPiA+Pg0KPiA+ID4+IEFoISAgWW91J3JlIHJpZ2h0
IHRoYXQgbmZzX3ZtX3BhZ2VfbWt3cml0ZSgpIHdhcyBtaXNzaW5nIGZyb20NCj4gPiA+PiB0aGUg
b3JpZ2luYWwgMi42LjE4LCBzbyB0aGF0IG1ha2VzIHNlbnNlLCBob3dldmVyLCBSZWQgSGF0IGhh
ZA0KPiA+ID4+IGJhY2twb3J0ZWQgdGhhdCBmdW5jdGlvbiBzdGFydGluZyB3aXRoIHRoZWlyIFJI
RUw1LjkoKikga2VybmVscywNCj4gPiA+PiB5ZXQgdGhlIHByb2JsZW0gZG9lc24ndCBtYW5pZmVz
dCBvbiBSSEVMNS45LiAgTWF5YmUgdGhlIGFuc3dlciBsaWVzDQo+ID4gPj4gc29tZXdoZXJlIGlu
IFJIRUw1LjkncyBkb193cF9wYWdlKCksIG9yIHVwIHRoYXQgY2FsbCBwYXRoLCBidXQNCj4gPiA+
PiBnbGFuY2luZyB0aHJvdWdoIGl0LCBpdCBhbGwgbG9va3MgcHJldHR5IGNsb3NlIHRob3VnaC4N
Cj4gPiA+Pg0KPiA+ID4+DQo+ID4gPj4gKCopIFRoYXQgd2FzIHRoZSBzb3VyY2UgSSB1c2luZyB3
aGVuIGNvbXBhcmluZyB3aXRoIHRoZSAzLjExIHNvdXJjZQ0KPiA+ID4+IHdoZW4gc3R1ZHlpbmcg
eW91ciBwYXRjaCBzaW5jZSBpdCB3YXMgdGhlIGxhc3Qga2VybmVsIGtub3duIHRvIG1lDQo+ID4g
Pj4gd2l0aG91dCB0aGUgcHJvYmxlbS4NCj4gPiA+Pg0KPiA+ID4NCj4gPiA+IEknbSBwcmV0dHkg
c3VyZSBSSEVMNSBoYXMgYSBzaW1pbGFyIHByb2JsZW0sIGJ1dCBpdCdzIHVuY2xlYXIgdG8gbWUg
d2h5DQo+ID4gPiB5b3UncmUgbm90IHNlZWluZyBpdCB0aGVyZS4gSSBoYXZlIGEgUkhCWiBvcGVu
IHZzLiBSSEVMNSBidXQgaXQncyBtYXJrZWQNCj4gPiA+IHByaXZhdGUgYXQgdGhlIG1vbWVudCAo
SSdsbCBzZWUgYWJvdXQgb3BlbmluZyBpdCB1cCkuIEkgYnJvdWdodCB0aGlzIHVwDQo+ID4gPiB1
cHN0cmVhbSBhYm91dCBhIHllYXIgYWdvIHdpdGggdGhpcyBzdHJhd21hbiBwYXRjaDoNCj4gPiA+
DQo+ID4gPiAgICAgaHR0cDovL2FydGljbGUuZ21hbmUub3JnL2dtYW5lLmxpbnV4Lm5mcy81MTI0
MA0KPiA+ID4NCj4gPiA+IC4uLmF0IHRoZSB0aW1lIFRyb25kIHNhaWQgaGUgd2FzIHdvcmtpbmcg
b24gYSBzZXQgb2YgcGF0Y2hlcyB0byB0cmFjaw0KPiA+ID4gdGhlIG9wZW4vbG9jayBzdGF0ZWlk
IG9uIGEgcGVyLXJlcSBiYXNpcy4gRGlkIHRoYXQgYXBwcm9hY2ggbm90IHBhbg0KPiA+ID4gb3V0
Pw0KPiA+ID4NCj4gPiA+IEFsc28sIGRvIHlvdSBuZWVkIHRvIGRvIGEgc2ltaWxhciBmaXggdG8g
bmZzX2Nhbl9jb2FsZXNjZV9yZXF1ZXN0cz8NCj4gPiA+DQo+IA0KPiBZZXMsIEkgc2VlIHRoZSBz
YW1lIGJlaGF2aW9yIHlvdSBkby4gV2l0aCBhIHJlY2VudCBrZXJuZWwgSSBzZWUgYSB0b24NCj4g
b2YgV1JJVEUgcmVxdWVzdHMgZ28gb3V0LCB3aXRoIFJIRUw1IGhhcmRseSBhbnkuDQo+IA0KPiBJ
IGd1ZXNzIEknbSBhIGxpdHRsZSBjb25mdXNlZCBhcyB0byB0aGUgcmV2ZXJzZSBxdWVzdGlvbi4g
V2h5IGFyZSB3ZQ0KPiBzZWVpbmcgdGhpcyBkYXRhIGdldCBmbHVzaGVkIG91dCBzbyBxdWlja2x5
IGluIHJlY2VudCBrZXJuZWxzIGZyb20ganVzdA0KPiBjaGFuZ2VzIHRvIHRoZSBtbWFwZWQgcGFn
ZXM/DQo+IA0KPiBNeSB1bmRlcnN0YW5kaW5nIGhhcyBhbHdheXMgYmVlbiB0aGF0IHdoZW4gYSBw
YWdlIGlzIGNsZWFuZWQsIHdlIHNldA0KPiB0aGUgV1AgYml0IG9uIGl0LCBhbmQgdGhlbiB3aGVu
IGl0IGdvZXMgZGlydHkgd2UgY2xlYXIgaXQgYW5kIGFsc28NCj4gY2FsbCBwYWdlX21rd3JpdGUg
KG5vdCBuZWNlc3NhcmlseSBpbiB0aGF0IG9yZGVyKS4NCj4gDQo+IFNvIGhlcmUgd2UgaGF2ZSB0
d28gcHJvY2Vzc2VzIHRoYXQgbW1hcCB0aGUgc2FtZSBwYWdlLCBhbmQgdGhlbiBhcmUNCj4gZnVy
aW91c2x5IHdyaXRpbmcgdG8gaXQuIFRoZSBrZXJuZWwgc2hvdWxkbid0IHJlYWxseSBjYXJlIG9y
IGJlIGF3YXJlDQo+IG9mIHRoYXQgdGhyYXNoaW5nIHVudGlsIHRoYXQgcGFnZSBnZXRzIGZsdXNo
ZWQgb3V0IGZvciBzb21lIHJlYXNvbg0KPiAobXN5bmMoKSBjYWxsIG9yIFZNIHByZXNzdXJlKS4N
Cg0KZm9yaygpIGlzIG5vdCBzdXBwb3NlZCB0byBzaGFyZSBwYWdlIHRhYmxlcyBiZXR3ZWVuIHBh
cmVudCBhbmQgY2hpbGQNCnByb2Nlc3MuIFNob3VsZG4ndCB0aGF0IGFsc28gaW1wbHkgdGhhdCB0
aGUgcGFnZSB3cml0ZSBwcm90ZWN0IGJpdHMgYXJlDQpub3Qgc2hhcmVkPw0KDQpJT1c6IGEgd3Jp
dGUgcHJvdGVjdCBwYWdlIGZhdWx0IGluIHRoZSBwYXJlbnQgcHJvY2VzcyB0aGF0IHNldHMNCnBh
Z2VfbWt3cml0ZSgpIHNob3VsZCBub3QgcHJldmVudCBhIHNpbWlsYXIgd3JpdGUgcHJvdGVjdCBw
YWdlIGZhdWx0IGluDQp0aGUgY2hpbGQgcHJvY2VzcyAoYW5kIHN1YnNlcXVlbnQgY2FsbCB0byBw
YWdlX21rd3JpdGUoKSkuDQoNCi4uLm9yIGlzIG15IHVuZGVyc3RhbmRpbmcgb2YgdGhlIHBhZ2Ug
ZmF1bHQgc2VtYW50aWNzIHdyb25nPw0KDQo+IElPVywgUkhFTDUgYmVoYXZlcyB0aGUgd2F5IEkn
ZCBleHBlY3QuIFdoYXQncyB1bmNsZWFyIHRvIG1lIGlzIHdoeSBtb3JlDQo+IHJlY2VudCBrZXJu
ZWxzIGRvbid0IGJlaGF2ZSB0aGF0IHdheS4NCg0KLS0gDQpUcm9uZCBNeWtsZWJ1c3QNCkxpbnV4
IE5GUyBjbGllbnQgbWFpbnRhaW5lcg0KDQpOZXRBcHANClRyb25kLk15a2xlYnVzdEBuZXRhcHAu
Y29tDQp3d3cubmV0YXBwLmNvbQ0K

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nfs-backed mmap file results in 1000s of WRITEs per second
  2013-09-06 16:48               ` Quentin Barnes
  2013-09-07 14:51                 ` Jeff Layton
@ 2013-09-09 13:04                 ` Jeff Layton
  2013-09-09 17:32                   ` Quentin Barnes
  1 sibling, 1 reply; 20+ messages in thread
From: Jeff Layton @ 2013-09-09 13:04 UTC (permalink / raw)
  To: Quentin Barnes; +Cc: Myklebust, Trond, linux-nfs@vger.kernel.org

On Fri, 6 Sep 2013 11:48:45 -0500
Quentin Barnes <qbarnes@gmail.com> wrote:

> Jeff, can your try out my test program in the base note on your
> RHEL5.9 or later RHEL5.x kernels?
> 
> I reverified that running the test on a 2.6.18-348.16.1.el5 x86_64
> kernel (latest released RHEL5.9) does not show the problem for me.
> Based on what you and Trond have said in this thread though, I'm
> really curious why it doesn't have the problem.
> 

I can confirm what you see on RHEL5. One difference is that RHEL5's
page_mkwrite handler does not do wait_on_page_writeback. That was added
as part of the stable pages work that went in a while back, so that may 
be the main difference. Adding that in doesn't seem to materially
change things though.

In any case, what I see is that the initial program just ends up with a
two calls to nfs_vm_page_mkwrite(). They both push out a WRITE and then
things settle down (likely because the page is still marked dirty).

Eventually, another write occurs and the dirty page gets pushed out to
the server in a small flurry of WRITEs to the same range.Then, things
settle down again until there's another small flurry of activity.

My suspicion is that there is a race condition involved here, but I'm
unclear on where it is. I'm not 100% convinced this is a bug, but page
fault semantics aren't my strong suit.

You may want to consider opening a "formal" RH support case if you have
interest in getting Trond's patch backported, and/or following up on
why RHEL5 behaves the way it does.


> On Fri, Sep 6, 2013 at 8:36 AM, Jeff Layton <jlayton@redhat.com> wrote:
> > On Thu, 5 Sep 2013 17:34:20 -0500
> > Quentin Barnes <qbarnes@gmail.com> wrote:
> >
> >> On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote:
> >> > On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote:
> >> > > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote:
> >> > > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote:
> >> > > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote:
> >> > > > > > Neil Brown posted a patch couple days ago for this!
> >> > > > > >
> >> > > > > > http://thread.gmane.org/gmane.linux.nfs/58473
> >> > > > >
> >> > > > > I tried Neil's patch on a v3.11 kernel.  The rebuilt kernel still
> >> > > > > exhibited the same 1000s of WRITEs/sec problem.
> >> > > > >
> >> > > > > Any other ideas?
> >> > > >
> >> > > > Yes. Please try the attached patch.
> >> > >
> >> > > Great!  That did the trick!
> >> > >
> >> > > Do you feel this patch could be worthy of pushing it upstream in its
> >> > > current state or was it just to verify a theory?
> >> > >
> >> > >
> >> > > In comparing the nfs_flush_incompatible() implementations between
> >> > > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem
> >> > > more or less logically equivalent to me on whether or not to flush
> >> > > the page.  Also, when and where nfs_flush_incompatible() is invoked
> >> > > seems the same.  Would you provide a very brief pointer to clue me
> >> > > in as to why this problem didn't also manifest circa 2.6.18 days?
> >> >
> >> > There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18
> >> > days, and so the risk was that your mmapped writes could end up being
> >> > sent with the wrong credentials.
> >>
> >> Ah!  You're right that nfs_vm_page_mkwrite() was missing from
> >> the original 2.6.18, so that makes sense, however, Red Hat had
> >> backported that function starting with their RHEL5.9(*) kernels,
> >> yet the problem doesn't manifest on RHEL5.9.  Maybe the answer lies
> >> somewhere in RHEL5.9's do_wp_page(), or up that call path, but
> >> glancing through it, it all looks pretty close though.
> >>
> >>
> >> (*) That was the source I using when comparing with the 3.11 source
> >> when studying your patch since it was the last kernel known to me
> >> without the problem.
> >>
> >
> > I'm pretty sure RHEL5 has a similar problem, but it's unclear to me why
> > you're not seeing it there. I have a RHBZ open vs. RHEL5 but it's marked
> > private at the moment (I'll see about opening it up). I brought this up
> > upstream about a year ago with this strawman patch:
> >
> >     http://article.gmane.org/gmane.linux.nfs/51240
> >
> > ...at the time Trond said he was working on a set of patches to track
> > the open/lock stateid on a per-req basis. Did that approach not pan
> > out?
> >
> > Also, do you need to do a similar fix to nfs_can_coalesce_requests?
> >
> > Thanks,
> > --
> > Jeff Layton <jlayton@redhat.com>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nfs-backed mmap file results in 1000s of WRITEs per second
  2013-09-09 13:04                 ` Jeff Layton
@ 2013-09-09 17:32                   ` Quentin Barnes
  2013-09-09 17:47                     ` Myklebust, Trond
  0 siblings, 1 reply; 20+ messages in thread
From: Quentin Barnes @ 2013-09-09 17:32 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Myklebust, Trond, linux-nfs@vger.kernel.org

On Mon, Sep 09, 2013 at 09:04:24AM -0400, Jeff Layton wrote:
> On Fri, 6 Sep 2013 11:48:45 -0500
> Quentin Barnes <qbarnes@gmail.com> wrote:
> 
> > Jeff, can your try out my test program in the base note on your
> > RHEL5.9 or later RHEL5.x kernels?
> > 
> > I reverified that running the test on a 2.6.18-348.16.1.el5 x86_64
> > kernel (latest released RHEL5.9) does not show the problem for me.
> > Based on what you and Trond have said in this thread though, I'm
> > really curious why it doesn't have the problem.
> 
> I can confirm what you see on RHEL5. One difference is that RHEL5's
> page_mkwrite handler does not do wait_on_page_writeback. That was added
> as part of the stable pages work that went in a while back, so that may 
> be the main difference. Adding that in doesn't seem to materially
> change things though.

Good to know you confirmed the behavior I saw on RHEL5 (just so that
I know it's not some random variable in play I had overlooked).

> In any case, what I see is that the initial program just ends up with a
> two calls to nfs_vm_page_mkwrite(). They both push out a WRITE and then
> things settle down (likely because the page is still marked dirty).
> 
> Eventually, another write occurs and the dirty page gets pushed out to
> the server in a small flurry of WRITEs to the same range.Then, things
> settle down again until there's another small flurry of activity.
> 
> My suspicion is that there is a race condition involved here, but I'm
> unclear on where it is. I'm not 100% convinced this is a bug, but page
> fault semantics aren't my strong suit.

As a test on RHEL6, I made a trivial systemtap script for kprobing
nfs_vm_page_mkwrite() and nfs_flush_incompatible().  I wanted to
make sure this bug was limited to just the nfs module and was not a
result of some mm behavior change.

With the bug unfixed running the test program, nfs_vm_page_mkwrite()
and nfs_flush_incompatible() are called repeatedly at a very high rate
(hence all the WRITEs).

After Trond's patch, the two functions are called just at the
program's initialization and then called only every 30 seconds or
so.

It looks like to me from the code flow that there must be something
nfs_wb_page() does that resets the need for mm to keeping reinvoking
nfs_vm_page_mkwrite().  I didn't look any deeper than that though
for now.  Maybe a race in how nfs_wb_page() updates status you're
thinking of?

> You may want to consider opening a "formal" RH support case if you have
> interest in getting Trond's patch backported, and/or following up on
> why RHEL5 behaves the way it does.

Yes, I'll be doing that.  When I do, I'll send you an email with the
case ticket.  Before filing it though, I want to hear back from the
group that had the original problem to make sure Trond's patch fully
addresses their problem (besides just the trivial test program).

> -- 
> Jeff Layton <jlayton@redhat.com>

Quentin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nfs-backed mmap file results in 1000s of WRITEs per second
  2013-09-09 17:32                   ` Quentin Barnes
@ 2013-09-09 17:47                     ` Myklebust, Trond
  2013-09-09 18:21                       ` Jeff Layton
  0 siblings, 1 reply; 20+ messages in thread
From: Myklebust, Trond @ 2013-09-09 17:47 UTC (permalink / raw)
  To: Quentin Barnes; +Cc: Jeff Layton, linux-nfs@vger.kernel.org

T24gTW9uLCAyMDEzLTA5LTA5IGF0IDEyOjMyIC0wNTAwLCBRdWVudGluIEJhcm5lcyB3cm90ZToN
Cj4gT24gTW9uLCBTZXAgMDksIDIwMTMgYXQgMDk6MDQ6MjRBTSAtMDQwMCwgSmVmZiBMYXl0b24g
d3JvdGU6DQo+ID4gT24gRnJpLCA2IFNlcCAyMDEzIDExOjQ4OjQ1IC0wNTAwDQo+ID4gUXVlbnRp
biBCYXJuZXMgPHFiYXJuZXNAZ21haWwuY29tPiB3cm90ZToNCj4gPiANCj4gPiA+IEplZmYsIGNh
biB5b3VyIHRyeSBvdXQgbXkgdGVzdCBwcm9ncmFtIGluIHRoZSBiYXNlIG5vdGUgb24geW91cg0K
PiA+ID4gUkhFTDUuOSBvciBsYXRlciBSSEVMNS54IGtlcm5lbHM/DQo+ID4gPiANCj4gPiA+IEkg
cmV2ZXJpZmllZCB0aGF0IHJ1bm5pbmcgdGhlIHRlc3Qgb24gYSAyLjYuMTgtMzQ4LjE2LjEuZWw1
IHg4Nl82NA0KPiA+ID4ga2VybmVsIChsYXRlc3QgcmVsZWFzZWQgUkhFTDUuOSkgZG9lcyBub3Qg
c2hvdyB0aGUgcHJvYmxlbSBmb3IgbWUuDQo+ID4gPiBCYXNlZCBvbiB3aGF0IHlvdSBhbmQgVHJv
bmQgaGF2ZSBzYWlkIGluIHRoaXMgdGhyZWFkIHRob3VnaCwgSSdtDQo+ID4gPiByZWFsbHkgY3Vy
aW91cyB3aHkgaXQgZG9lc24ndCBoYXZlIHRoZSBwcm9ibGVtLg0KPiA+IA0KPiA+IEkgY2FuIGNv
bmZpcm0gd2hhdCB5b3Ugc2VlIG9uIFJIRUw1LiBPbmUgZGlmZmVyZW5jZSBpcyB0aGF0IFJIRUw1
J3MNCj4gPiBwYWdlX21rd3JpdGUgaGFuZGxlciBkb2VzIG5vdCBkbyB3YWl0X29uX3BhZ2Vfd3Jp
dGViYWNrLiBUaGF0IHdhcyBhZGRlZA0KPiA+IGFzIHBhcnQgb2YgdGhlIHN0YWJsZSBwYWdlcyB3
b3JrIHRoYXQgd2VudCBpbiBhIHdoaWxlIGJhY2ssIHNvIHRoYXQgbWF5IA0KPiA+IGJlIHRoZSBt
YWluIGRpZmZlcmVuY2UuIEFkZGluZyB0aGF0IGluIGRvZXNuJ3Qgc2VlbSB0byBtYXRlcmlhbGx5
DQo+ID4gY2hhbmdlIHRoaW5ncyB0aG91Z2guDQo+IA0KPiBHb29kIHRvIGtub3cgeW91IGNvbmZp
cm1lZCB0aGUgYmVoYXZpb3IgSSBzYXcgb24gUkhFTDUgKGp1c3Qgc28gdGhhdA0KPiBJIGtub3cg
aXQncyBub3Qgc29tZSByYW5kb20gdmFyaWFibGUgaW4gcGxheSBJIGhhZCBvdmVybG9va2VkKS4N
Cj4gDQo+ID4gSW4gYW55IGNhc2UsIHdoYXQgSSBzZWUgaXMgdGhhdCB0aGUgaW5pdGlhbCBwcm9n
cmFtIGp1c3QgZW5kcyB1cCB3aXRoIGENCj4gPiB0d28gY2FsbHMgdG8gbmZzX3ZtX3BhZ2VfbWt3
cml0ZSgpLiBUaGV5IGJvdGggcHVzaCBvdXQgYSBXUklURSBhbmQgdGhlbg0KPiA+IHRoaW5ncyBz
ZXR0bGUgZG93biAobGlrZWx5IGJlY2F1c2UgdGhlIHBhZ2UgaXMgc3RpbGwgbWFya2VkIGRpcnR5
KS4NCj4gPiANCj4gPiBFdmVudHVhbGx5LCBhbm90aGVyIHdyaXRlIG9jY3VycyBhbmQgdGhlIGRp
cnR5IHBhZ2UgZ2V0cyBwdXNoZWQgb3V0IHRvDQo+ID4gdGhlIHNlcnZlciBpbiBhIHNtYWxsIGZs
dXJyeSBvZiBXUklURXMgdG8gdGhlIHNhbWUgcmFuZ2UuVGhlbiwgdGhpbmdzDQo+ID4gc2V0dGxl
IGRvd24gYWdhaW4gdW50aWwgdGhlcmUncyBhbm90aGVyIHNtYWxsIGZsdXJyeSBvZiBhY3Rpdml0
eS4NCj4gPiANCj4gPiBNeSBzdXNwaWNpb24gaXMgdGhhdCB0aGVyZSBpcyBhIHJhY2UgY29uZGl0
aW9uIGludm9sdmVkIGhlcmUsIGJ1dCBJJ20NCj4gPiB1bmNsZWFyIG9uIHdoZXJlIGl0IGlzLiBJ
J20gbm90IDEwMCUgY29udmluY2VkIHRoaXMgaXMgYSBidWcsIGJ1dCBwYWdlDQo+ID4gZmF1bHQg
c2VtYW50aWNzIGFyZW4ndCBteSBzdHJvbmcgc3VpdC4NCj4gDQo+IEFzIGEgdGVzdCBvbiBSSEVM
NiwgSSBtYWRlIGEgdHJpdmlhbCBzeXN0ZW10YXAgc2NyaXB0IGZvciBrcHJvYmluZw0KPiBuZnNf
dm1fcGFnZV9ta3dyaXRlKCkgYW5kIG5mc19mbHVzaF9pbmNvbXBhdGlibGUoKS4gIEkgd2FudGVk
IHRvDQo+IG1ha2Ugc3VyZSB0aGlzIGJ1ZyB3YXMgbGltaXRlZCB0byBqdXN0IHRoZSBuZnMgbW9k
dWxlIGFuZCB3YXMgbm90IGENCj4gcmVzdWx0IG9mIHNvbWUgbW0gYmVoYXZpb3IgY2hhbmdlLg0K
PiANCj4gV2l0aCB0aGUgYnVnIHVuZml4ZWQgcnVubmluZyB0aGUgdGVzdCBwcm9ncmFtLCBuZnNf
dm1fcGFnZV9ta3dyaXRlKCkNCj4gYW5kIG5mc19mbHVzaF9pbmNvbXBhdGlibGUoKSBhcmUgY2Fs
bGVkIHJlcGVhdGVkbHkgYXQgYSB2ZXJ5IGhpZ2ggcmF0ZQ0KPiAoaGVuY2UgYWxsIHRoZSBXUklU
RXMpLg0KPiANCj4gQWZ0ZXIgVHJvbmQncyBwYXRjaCwgdGhlIHR3byBmdW5jdGlvbnMgYXJlIGNh
bGxlZCBqdXN0IGF0IHRoZQ0KPiBwcm9ncmFtJ3MgaW5pdGlhbGl6YXRpb24gYW5kIHRoZW4gY2Fs
bGVkIG9ubHkgZXZlcnkgMzAgc2Vjb25kcyBvcg0KPiBzby4NCj4gDQo+IEl0IGxvb2tzIGxpa2Ug
dG8gbWUgZnJvbSB0aGUgY29kZSBmbG93IHRoYXQgdGhlcmUgbXVzdCBiZSBzb21ldGhpbmcNCj4g
bmZzX3diX3BhZ2UoKSBkb2VzIHRoYXQgcmVzZXRzIHRoZSBuZWVkIGZvciBtbSB0byBrZWVwaW5n
IHJlaW52b2tpbmcNCj4gbmZzX3ZtX3BhZ2VfbWt3cml0ZSgpLiAgSSBkaWRuJ3QgbG9vayBhbnkg
ZGVlcGVyIHRoYW4gdGhhdCB0aG91Z2gNCj4gZm9yIG5vdy4gIE1heWJlIGEgcmFjZSBpbiBob3cg
bmZzX3diX3BhZ2UoKSB1cGRhdGVzIHN0YXR1cyB5b3UncmUNCj4gdGhpbmtpbmcgb2Y/DQoNCklu
IFJIRUwtNSwgbmZzX3diX3BhZ2UoKSBpcyBqdXN0IGEgd3JhcHBlciB0byBuZnNfc3luY19pbm9k
ZV93YWl0KCksDQp3aGljaCBkb2VzIF9ub3RfIGNhbGwgY2xlYXJfcGFnZV9kaXJ0eV9mb3JfaW8o
KSAoYW5kIGhlbmNlIGRvZXMgbm90IGNhbGwNCnBhZ2VfbWtjbGVhbigpKS4NCg0KVGhhdCB3b3Vs
ZCBleHBsYWluIGl0Li4uDQoNCi0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMgY2xpZW50
IG1haW50YWluZXINCg0KTmV0QXBwDQpUcm9uZC5NeWtsZWJ1c3RAbmV0YXBwLmNvbQ0Kd3d3Lm5l
dGFwcC5jb20NCg==

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nfs-backed mmap file results in 1000s of WRITEs per second
  2013-09-09 17:47                     ` Myklebust, Trond
@ 2013-09-09 18:21                       ` Jeff Layton
  0 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2013-09-09 18:21 UTC (permalink / raw)
  To: Myklebust, Trond; +Cc: Quentin Barnes, linux-nfs@vger.kernel.org

On Mon, 9 Sep 2013 17:47:48 +0000
"Myklebust, Trond" <Trond.Myklebust@netapp.com> wrote:

> On Mon, 2013-09-09 at 12:32 -0500, Quentin Barnes wrote:
> > On Mon, Sep 09, 2013 at 09:04:24AM -0400, Jeff Layton wrote:
> > > On Fri, 6 Sep 2013 11:48:45 -0500
> > > Quentin Barnes <qbarnes@gmail.com> wrote:
> > > 
> > > > Jeff, can your try out my test program in the base note on your
> > > > RHEL5.9 or later RHEL5.x kernels?
> > > > 
> > > > I reverified that running the test on a 2.6.18-348.16.1.el5 x86_64
> > > > kernel (latest released RHEL5.9) does not show the problem for me.
> > > > Based on what you and Trond have said in this thread though, I'm
> > > > really curious why it doesn't have the problem.
> > > 
> > > I can confirm what you see on RHEL5. One difference is that RHEL5's
> > > page_mkwrite handler does not do wait_on_page_writeback. That was added
> > > as part of the stable pages work that went in a while back, so that may 
> > > be the main difference. Adding that in doesn't seem to materially
> > > change things though.
> > 
> > Good to know you confirmed the behavior I saw on RHEL5 (just so that
> > I know it's not some random variable in play I had overlooked).
> > 
> > > In any case, what I see is that the initial program just ends up with a
> > > two calls to nfs_vm_page_mkwrite(). They both push out a WRITE and then
> > > things settle down (likely because the page is still marked dirty).
> > > 
> > > Eventually, another write occurs and the dirty page gets pushed out to
> > > the server in a small flurry of WRITEs to the same range.Then, things
> > > settle down again until there's another small flurry of activity.
> > > 
> > > My suspicion is that there is a race condition involved here, but I'm
> > > unclear on where it is. I'm not 100% convinced this is a bug, but page
> > > fault semantics aren't my strong suit.
> > 
> > As a test on RHEL6, I made a trivial systemtap script for kprobing
> > nfs_vm_page_mkwrite() and nfs_flush_incompatible().  I wanted to
> > make sure this bug was limited to just the nfs module and was not a
> > result of some mm behavior change.
> > 
> > With the bug unfixed running the test program, nfs_vm_page_mkwrite()
> > and nfs_flush_incompatible() are called repeatedly at a very high rate
> > (hence all the WRITEs).
> > 
> > After Trond's patch, the two functions are called just at the
> > program's initialization and then called only every 30 seconds or
> > so.
> > 
> > It looks like to me from the code flow that there must be something
> > nfs_wb_page() does that resets the need for mm to keeping reinvoking
> > nfs_vm_page_mkwrite().  I didn't look any deeper than that though
> > for now.  Maybe a race in how nfs_wb_page() updates status you're
> > thinking of?
> 
> In RHEL-5, nfs_wb_page() is just a wrapper to nfs_sync_inode_wait(),
> which does _not_ call clear_page_dirty_for_io() (and hence does not call
> page_mkclean()).
> 
> That would explain it...
> 

Thanks Trond, that does explain it.

FWIW, at this point in the RHEL5 lifecycle I'd be disinclined to make
any changes to that code without some strong justification. Backporting
Trond's recent patch for RHEL6 and making sure that RHEL7 has it sounds
quite reasonable though.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nfs-backed mmap file results in 1000s of WRITEs per second
  2013-09-05 21:36       ` Quentin Barnes
  2013-09-05 21:57         ` Myklebust, Trond
@ 2013-09-05 22:07         ` Myklebust, Trond
  1 sibling, 0 replies; 20+ messages in thread
From: Myklebust, Trond @ 2013-09-05 22:07 UTC (permalink / raw)
  To: Quentin Barnes; +Cc: linux-nfs@vger.kernel.org

T24gVGh1LCAyMDEzLTA5LTA1IGF0IDE2OjM2IC0wNTAwLCBRdWVudGluIEJhcm5lcyB3cm90ZToN
Cj4gT24gVGh1LCBTZXAgMDUsIDIwMTMgYXQgMDg6MDI6MDFQTSArMDAwMCwgTXlrbGVidXN0LCBU
cm9uZCB3cm90ZToNCj4gPiBPbiBUaHUsIDIwMTMtMDktMDUgYXQgMTQ6MTEgLTA1MDAsIFF1ZW50
aW4gQmFybmVzIHdyb3RlOg0KPiA+ID4gT24gVGh1LCBTZXAgMDUsIDIwMTMgYXQgMTI6MDM6MDNQ
TSAtMDUwMCwgTWFsYWhhbCBOYWluZW5pIHdyb3RlOg0KPiA+ID4gPiBOZWlsIEJyb3duIHBvc3Rl
ZCBhIHBhdGNoIGNvdXBsZSBkYXlzIGFnbyBmb3IgdGhpcyENCj4gPiA+ID4gDQo+ID4gPiA+IGh0
dHA6Ly90aHJlYWQuZ21hbmUub3JnL2dtYW5lLmxpbnV4Lm5mcy81ODQ3Mw0KPiA+ID4gDQo+ID4g
PiBJIHRyaWVkIE5laWwncyBwYXRjaCBvbiBhIHYzLjExIGtlcm5lbC4gIFRoZSByZWJ1aWx0IGtl
cm5lbCBzdGlsbA0KPiA+ID4gZXhoaWJpdGVkIHRoZSBzYW1lIDEwMDBzIG9mIFdSSVRFcy9zZWMg
cHJvYmxlbS4NCj4gPiA+IA0KPiA+ID4gQW55IG90aGVyIGlkZWFzPw0KPiA+IA0KPiA+IFllcy4g
UGxlYXNlIHRyeSB0aGUgYXR0YWNoZWQgcGF0Y2guDQo+IA0KPiBHcmVhdCEgIFRoYXQgZGlkIHRo
ZSB0cmljayENCj4gDQo+IERvIHlvdSBmZWVsIHRoaXMgcGF0Y2ggY291bGQgYmUgd29ydGh5IG9m
IHB1c2hpbmcgaXQgdXBzdHJlYW0gaW4gaXRzDQo+IGN1cnJlbnQgc3RhdGUgb3Igd2FzIGl0IGp1
c3QgdG8gdmVyaWZ5IGEgdGhlb3J5Pw0KDQpJdCBzaG91bGQgYmUgc2FmZSB0byBtZXJnZSB0aGlz
LiBJdCBpcyBhIHZhbGlkIG9wdGltaXNhdGlvbiBpbiB0aGUgY2FzZQ0Kd2hlcmUgdGhlcmUgYXJl
IG5vIGxvY2tzIGFwcGxpZWQuDQoNCi0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMgY2xp
ZW50IG1haW50YWluZXINCg0KTmV0QXBwDQpUcm9uZC5NeWtsZWJ1c3RAbmV0YXBwLmNvbQ0Kd3d3
Lm5ldGFwcC5jb20NCg==

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2013-09-09 18:21 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-05 16:21 nfs-backed mmap file results in 1000s of WRITEs per second Quentin Barnes
2013-09-05 17:03 ` Malahal Naineni
2013-09-05 19:11   ` Quentin Barnes
2013-09-05 20:02     ` Myklebust, Trond
2013-09-05 21:36       ` Quentin Barnes
2013-09-05 21:57         ` Myklebust, Trond
2013-09-05 22:34           ` Quentin Barnes
2013-09-06 13:36             ` Jeff Layton
2013-09-06 15:00               ` Myklebust, Trond
2013-09-06 15:04                 ` Jeff Layton
2013-09-06 15:39                   ` Myklebust, Trond
2013-09-08 14:25                     ` William Dauchy
2013-09-06 16:48               ` Quentin Barnes
2013-09-07 14:51                 ` Jeff Layton
2013-09-07 15:00                   ` Myklebust, Trond
2013-09-09 13:04                 ` Jeff Layton
2013-09-09 17:32                   ` Quentin Barnes
2013-09-09 17:47                     ` Myklebust, Trond
2013-09-09 18:21                       ` Jeff Layton
2013-09-05 22:07         ` Myklebust, Trond

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).