Linux NFS development
 help / color / mirror / Atom feed
* burst mount of NFS over tcp
@ 2007-05-15 15:17 Flavio Leitner
  2007-05-15 19:23 ` Trond Myklebust
  2007-05-16  2:32 ` Ian Kent
  0 siblings, 2 replies; 5+ messages in thread
From: Flavio Leitner @ 2007-05-15 15:17 UTC (permalink / raw)
  To: nfs; +Cc: flavio.leitner

(please, keep myself on CC because I'm not subscribed)

Hi,

In the case of several (>500) mounts running at the same time with -o tcp,
the number of attempts that succeed is about 300-400 because it run out
of priviledged port (they are busy in TIME_WAIT state).

The priviledged port range available (~512 ports) for this can't be changed to
avoid more port conflicts.

An option could be reuse these ports in TIME_WAIT state openning the socket
with SO_REUSEADDR socket option, but the socket is a tuple:
(local addr, local port, remote addr, remote port, proto) and it must be
different from the previous one. Well, they always are equal when you
mount the same NFS server, so it can't be.

I think reducing TIME_WAIT state in kernel enabling fast recycle
affects other tcp
connections and have some extra undesireble effects.

The solution I'm proposing is to add two new nfs parameters to specify
a timeout and a number of retries to mount before fails. i.e.

# mount -o tcp,rsvretry=10,rsvtimeout=5 server:/export /mnt

This will affect mount.nfs to try bind a reserved port 10 times with a timeout
of 5 seconds before give up. If not used, the default behavious is unchanged
(i.e. try only one time and fails).

The code should looks like:
+             int bind_retry=0
...
-               if (bindresvport(so, &laddr) < 0) {
+              while (bindresvport(so, &laddr) < 0) {
+                       if (errno == EADDRINUSE && --bind_retry > 0) {
+                               sleep(bind_timeout);
+                               continue;
+                       }

The real scenario could be a huge fstab or a MDA delivering e-mails for 500
users with home automounted.

I was able to mount more than 1000 NFS with success in the second scenario
where normally I only manage to have 200-400.

What you think?

-- 
Flavio Leitner

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: burst mount of NFS over tcp
  2007-05-15 15:17 burst mount of NFS over tcp Flavio Leitner
@ 2007-05-15 19:23 ` Trond Myklebust
  2007-05-17  2:25   ` Flavio Leitner
  2007-06-13  1:15   ` Flavio Leitner
  2007-05-16  2:32 ` Ian Kent
  1 sibling, 2 replies; 5+ messages in thread
From: Trond Myklebust @ 2007-05-15 19:23 UTC (permalink / raw)
  To: Flavio Leitner; +Cc: nfs

On Tue, 2007-05-15 at 12:17 -0300, Flavio Leitner wrote:
> (please, keep myself on CC because I'm not subscribed)
> 
> Hi,
> 
> In the case of several (>500) mounts running at the same time with -o tcp,
> the number of attempts that succeed is about 300-400 because it run out
> of priviledged port (they are busy in TIME_WAIT state).
> 
> The priviledged port range available (~512 ports) for this can't be changed to
> avoid more port conflicts.
> 
> An option could be reuse these ports in TIME_WAIT state openning the socket
> with SO_REUSEADDR socket option, but the socket is a tuple:
> (local addr, local port, remote addr, remote port, proto) and it must be
> different from the previous one. Well, they always are equal when you
> mount the same NFS server, so it can't be.
> 
> I think reducing TIME_WAIT state in kernel enabling fast recycle
> affects other tcp
> connections and have some extra undesireble effects.
> 
> The solution I'm proposing is to add two new nfs parameters to specify
> a timeout and a number of retries to mount before fails. i.e.
> 
> # mount -o tcp,rsvretry=10,rsvtimeout=5 server:/export /mnt
> 
> This will affect mount.nfs to try bind a reserved port 10 times with a timeout
> of 5 seconds before give up. If not used, the default behavious is unchanged
> (i.e. try only one time and fails).
> 
> The code should looks like:
> +             int bind_retry=0
> ...
> -               if (bindresvport(so, &laddr) < 0) {
> +              while (bindresvport(so, &laddr) < 0) {
> +                       if (errno == EADDRINUSE && --bind_retry > 0) {
> +                               sleep(bind_timeout);
> +                               continue;
> +                       }
> 
> The real scenario could be a huge fstab or a MDA delivering e-mails for 500
> users with home automounted.
> 
> I was able to mount more than 1000 NFS with success in the second scenario
> where normally I only manage to have 200-400.
> 
> What you think?

Why not fix the 'retry' mount option to deal properly with this case? I
can't see why running out of reserved ports really needs its own mount
options.

Trond


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: burst mount of NFS over tcp
  2007-05-15 15:17 burst mount of NFS over tcp Flavio Leitner
  2007-05-15 19:23 ` Trond Myklebust
@ 2007-05-16  2:32 ` Ian Kent
  1 sibling, 0 replies; 5+ messages in thread
From: Ian Kent @ 2007-05-16  2:32 UTC (permalink / raw)
  To: Flavio Leitner; +Cc: nfs

On Tue, 2007-05-15 at 12:17 -0300, Flavio Leitner wrote:
> (please, keep myself on CC because I'm not subscribed)
> 
> Hi,
> 
> In the case of several (>500) mounts running at the same time with -o tcp,
> the number of attempts that succeed is about 300-400 because it run out
> of priviledged port (they are busy in TIME_WAIT state).
> 
> The priviledged port range available (~512 ports) for this can't be changed to
> avoid more port conflicts.
> 
> An option could be reuse these ports in TIME_WAIT state openning the socket
> with SO_REUSEADDR socket option, but the socket is a tuple:
> (local addr, local port, remote addr, remote port, proto) and it must be
> different from the previous one. Well, they always are equal when you
> mount the same NFS server, so it can't be.

This has been discussed before.
The SO_REUSEADDR is used on server side listening sockets and, as far as
I can tell, can't be used on the client side.

But I'm open to be educated.

> 
> I think reducing TIME_WAIT state in kernel enabling fast recycle
> affects other tcp
> connections and have some extra undesireble effects.

Exactly. Ideally there would be a socket pooling mechanism under glibc
RPC which would allow for reuse of connected sockets but I can't think
of a way to do this since file handles are per process. And adding
non-standard syscalls is likely to not be well accepted.

Any ideas?

Ian



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: burst mount of NFS over tcp
  2007-05-15 19:23 ` Trond Myklebust
@ 2007-05-17  2:25   ` Flavio Leitner
  2007-06-13  1:15   ` Flavio Leitner
  1 sibling, 0 replies; 5+ messages in thread
From: Flavio Leitner @ 2007-05-17  2:25 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: nfs

[-- Attachment #1: Type: text/plain, Size: 2494 bytes --]

On 5/15/07, Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
> On Tue, 2007-05-15 at 12:17 -0300, Flavio Leitner wrote:
> > (please, keep myself on CC because I'm not subscribed)
> >
> > Hi,
> >
> > In the case of several (>500) mounts running at the same time with -o tcp,
> > the number of attempts that succeed is about 300-400 because it run out
> > of priviledged port (they are busy in TIME_WAIT state).
> >
> > The priviledged port range available (~512 ports) for this can't be changed to
> > avoid more port conflicts.
> >
> > An option could be reuse these ports in TIME_WAIT state openning the socket
> > with SO_REUSEADDR socket option, but the socket is a tuple:
> > (local addr, local port, remote addr, remote port, proto) and it must be
> > different from the previous one. Well, they always are equal when you
> > mount the same NFS server, so it can't be.
> >
> > I think reducing TIME_WAIT state in kernel enabling fast recycle
> > affects other tcp
> > connections and have some extra undesireble effects.
> >
> > The solution I'm proposing is to add two new nfs parameters to specify
> > a timeout and a number of retries to mount before fails. i.e.
> >
> > # mount -o tcp,rsvretry=10,rsvtimeout=5 server:/export /mnt
> >
> > This will affect mount.nfs to try bind a reserved port 10 times with a timeout
> > of 5 seconds before give up. If not used, the default behavious is unchanged
> > (i.e. try only one time and fails).
> >
> > The code should looks like:
> > +             int bind_retry=0
> > ...
> > -               if (bindresvport(so, &laddr) < 0) {
> > +              while (bindresvport(so, &laddr) < 0) {
> > +                       if (errno == EADDRINUSE && --bind_retry > 0) {
> > +                               sleep(bind_timeout);
> > +                               continue;
> > +                       }
> >
> > The real scenario could be a huge fstab or a MDA delivering e-mails for 500
> > users with home automounted.
> >
> > I was able to mount more than 1000 NFS with success in the second scenario
> > where normally I only manage to have 200-400.
> >
> > What you think?
>
> Why not fix the 'retry' mount option to deal properly with this case? I
> can't see why running out of reserved ports really needs its own mount
> options.

Sure, you are right. What about the attached one?

Thanks!

( in case of my mailer mess with patch, I leave a copy at
http://sysclose.org/nfs-utils/nfs-utils-1.0.12-bindrsvp-retry.patch )

-- 
Flavio Leitner

[-- Attachment #2: nfs-utils-1.0.12-bindrsvp-retry.patch --]
[-- Type: application/octet-stream, Size: 1709 bytes --]


[mount] fix retry= to handle lack of reserved port situation

In the case of several (>500) mounts running at the same time 
with -o tcp, the number of attempts that succeed is about 300-500 
because it run out of priviledged port (they are busy in TIME_WAIT 
state).

This patch fix retry= mount option to handle this situation.

Signed-off-by: Flavio Leitner <flavio.leitner@gmail.com>

Index: nfs-utils-1.0.12/utils/mount/nfsmount.c
===================================================================
--- nfs-utils-1.0.12.orig/utils/mount/nfsmount.c
+++ nfs-utils-1.0.12/utils/mount/nfsmount.c
@@ -1130,9 +1130,13 @@ noauth_flavors:
 		perror(_("nfs socket"));
 		goto fail;
 	}
-	if (bindresvport(fsock, 0) < 0) {
-		perror(_("nfs bindresvport"));
-		goto fail;
+	while (bindresvport(fsock, 0) < 0) {
+		t = time(NULL);
+		if (t > timeout || errno != EADDRINUSE) {
+			perror(_("nfs bindresvport"));
+			goto fail;
+		}
+		sleep(10);
 	}
 #ifdef NFS_MOUNT_DEBUG
 	printf(_("using port %d for nfs deamon\n"), nfs_pmap->pm_port);
Index: nfs-utils-1.0.12/support/nfs/conn.c
===================================================================
--- nfs-utils-1.0.12.orig/support/nfs/conn.c
+++ nfs-utils-1.0.12/support/nfs/conn.c
@@ -192,6 +192,12 @@ CLIENT *mnt_openclnt(clnt_addr_t *mnt_se
 	/* contact the mount daemon via TCP */
 	mnt_saddr->sin_port = htons((u_short)mnt_pmap->pm_port);
 	*msock = get_socket(mnt_saddr, mnt_pmap->pm_prot, TRUE);
+	if (*msock == RPC_ANYSOCK &&
+		rpc_createerr.cf_error.re_errno == EADDRINUSE) {
+		/* Bubble up the error to see how it should be handled. */
+		rpc_createerr.cf_stat = RPC_TIMEDOUT;
+		return NULL;
+	}
 
 	switch (mnt_pmap->pm_prot) {
 	case IPPROTO_UDP:

[-- Attachment #3: Type: text/plain, Size: 286 bytes --]

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

[-- Attachment #4: Type: text/plain, Size: 140 bytes --]

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: burst mount of NFS over tcp
  2007-05-15 19:23 ` Trond Myklebust
  2007-05-17  2:25   ` Flavio Leitner
@ 2007-06-13  1:15   ` Flavio Leitner
  1 sibling, 0 replies; 5+ messages in thread
From: Flavio Leitner @ 2007-06-13  1:15 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: nfs

[-- Attachment #1: Type: text/plain, Size: 2414 bytes --]


Resending it with updated patch (based on git)

On 5/15/07, Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
> On Tue, 2007-05-15 at 12:17 -0300, Flavio Leitner wrote:
> > (please, keep myself on CC because I'm not subscribed)
> >
> > Hi,
> >
> > In the case of several (>500) mounts running at the same time with -o tcp,
> > the number of attempts that succeed is about 300-400 because it run out
> > of priviledged port (they are busy in TIME_WAIT state).
> >
> > The priviledged port range available (~512 ports) for this can't be changed to
> > avoid more port conflicts.
> >
> > An option could be reuse these ports in TIME_WAIT state openning the socket
> > with SO_REUSEADDR socket option, but the socket is a tuple:
> > (local addr, local port, remote addr, remote port, proto) and it must be
> > different from the previous one. Well, they always are equal when you
> > mount the same NFS server, so it can't be.
> >
> > I think reducing TIME_WAIT state in kernel enabling fast recycle
> > affects other tcp
> > connections and have some extra undesireble effects.
> >
> > The solution I'm proposing is to add two new nfs parameters to specify
> > a timeout and a number of retries to mount before fails. i.e.
> >
> > # mount -o tcp,rsvretry=10,rsvtimeout=5 server:/export /mnt
> >
> > This will affect mount.nfs to try bind a reserved port 10 times with a timeout
> > of 5 seconds before give up. If not used, the default behavious is unchanged
> > (i.e. try only one time and fails).
> >
> > The code should looks like:
> > +             int bind_retry=0
> > ...
> > -               if (bindresvport(so, &laddr) < 0) {
> > +              while (bindresvport(so, &laddr) < 0) {
> > +                       if (errno == EADDRINUSE && --bind_retry > 0) {
> > +                               sleep(bind_timeout);
> > +                               continue;
> > +                       }
> >
> > The real scenario could be a huge fstab or a MDA delivering e-mails for 500
> > users with home automounted.
> >
> > I was able to mount more than 1000 NFS with success in the second scenario
> > where normally I only manage to have 200-400.
> >
> > What you think?
>
> Why not fix the 'retry' mount option to deal properly with this case? I
> can't see why running out of reserved ports really needs its own mount
> options.

Sure, you are right. What about the attached one?

Thanks!

-- 
Flavio Leitner

[-- Attachment #2: nfs-utils-bindrsvp-retry.patch --]
[-- Type: text/plain, Size: 1595 bytes --]

[PATCH] Fix retry= to handle lack of reserved port situation

In the case of several (>500) mounts running at the same time
with -o tcp, the number of attempts that succeed is about 300-500
because it run out of priviledged port (they are busy in TIME_WAIT
state).

Signed-off-by: Flavio Leitner <flavio.leitner@gmail.com>
---
 support/nfs/conn.c     |    6 ++++++
 utils/mount/nfsmount.c |   10 +++++++---
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/support/nfs/conn.c b/support/nfs/conn.c
index 29dbb82..315d766 100644
--- a/support/nfs/conn.c
+++ b/support/nfs/conn.c
@@ -200,6 +200,12 @@ CLIENT *mnt_openclnt(clnt_addr_t *mnt_se
 	/* contact the mount daemon via TCP */
 	mnt_saddr->sin_port = htons((u_short)mnt_pmap->pm_port);
 	*msock = get_socket(mnt_saddr, mnt_pmap->pm_prot, TRUE, FALSE);
+	if (*msock == RPC_ANYSOCK &&
+		rpc_createerr.cf_error.re_errno == EADDRINUSE) {
+		/* Bubble up the error to see how it should be handled. */
+		rpc_createerr.cf_stat = RPC_TIMEDOUT;
+		return NULL;
+	}
 
 	switch (mnt_pmap->pm_prot) {
 	case IPPROTO_UDP:
diff --git a/utils/mount/nfsmount.c b/utils/mount/nfsmount.c
index 815064a..a2ae520 100644
--- a/utils/mount/nfsmount.c
+++ b/utils/mount/nfsmount.c
@@ -1178,9 +1178,13 @@ #endif
 			perror(_("nfs socket"));
 			goto fail;
 		}
-		if (bindresvport(fsock, 0) < 0) {
-			perror(_("nfs bindresvport"));
-			goto fail;
+		while (bindresvport(fsock, 0) < 0) {
+			t = time(NULL);
+			if (t > timeout || errno != EADDRINUSE) {
+				perror(_("nfs bindresvport"));
+				goto fail;
+			}
+			sleep(10);
 		}
 	}
 
-- 
1.4.1


[-- Attachment #3: Type: text/plain, Size: 286 bytes --]

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

[-- Attachment #4: Type: text/plain, Size: 140 bytes --]

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2007-06-13  1:15 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-15 15:17 burst mount of NFS over tcp Flavio Leitner
2007-05-15 19:23 ` Trond Myklebust
2007-05-17  2:25   ` Flavio Leitner
2007-06-13  1:15   ` Flavio Leitner
2007-05-16  2:32 ` Ian Kent

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox