Massive NFS problems on large cluster with large number of mounts

public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed

* Massive NFS problems on large cluster with large number of mounts
@ 2008-07-01  8:19 Carsten Aulbert
       [not found] ` <4869E8AB.4060905-l1a6w7hxd2yELgA04lAiVw@public.gmane.org>
  0 siblings, 1 reply; 29+ messages in thread
From: Carsten Aulbert @ 2008-07-01  8:19 UTC (permalink / raw)
  To: linux-nfs; +Cc: Henning Fehrmann, Steffen Grunewald

Hi all (now to the right email list),

We are running a large cluster and do a lot of cross-mounting between
the nodes. To get this running we are running a lot of nfsd (196) and
use mountd with 64 threads, just in case we get a massive number of hit=
s
onto a single node. All this is on Debian Etch with a recent 2.6.24
kernel using autofs4 at the moment to do the automounts.

When running these two not nice scripts:

$ cat test_mount
#!/bin/sh

n_node=3D1000

for i in `seq 1 $n_node`;do
        n=3D`echo $RANDOM%1342+10001 | bc| sed -e "s/1/n/"`
        $HOME/bin/mount.sh $n&
        echo -n .
done

$ cat mount.sh
#!/bin/sh

dir=3D"/distributed/spray/data/EatH/S5R1"

ping -c1 -w1 $1 > /dev/null&& file=3D"/atlas/node/$1$dir/"`ls -f
/atlas/node/$1$dir/|head -n 50 | tail -n 1`
md5sum ${file}

With that we encounter different problems:

Running this gives this in syslog:
Jul  1 07:37:19 n1312 rpc.idmapd[2309]: nfsopen:
open(/var/lib/nfs/rpc_pipefs/nfs/clntaa58/idmap): Too many open files
Jul  1 07:37:19 n1312 rpc.idmapd[2309]: nfsopen:
open(/var/lib/nfs/rpc_pipefs/nfs/clntaa58/idmap): Too many open files
Jul  1 07:37:19 n1312 rpc.idmapd[2309]: nfsopen:
open(/var/lib/nfs/rpc_pipefs/nfs/clntaa5e/idmap): Too many open files
Jul  1 07:37:19 n1312 rpc.idmapd[2309]: nfsopen:
open(/var/lib/nfs/rpc_pipefs/nfs/clntaa5e/idmap): Too many open files
Jul  1 07:37:19 n1312 rpc.idmapd[2309]: nfsopen:
open(/var/lib/nfs/rpc_pipefs/nfs/clntaa9c/idmap): Too many open files

Which is not surprising to me. However, there are a few things I'm
wondering about.

(1) All our mounts use nfsvers=3D3 why is rpc.idmapd involved at all?
(2) Why is this daemon growing so extremely large?
# ps aux|grep rpc.idmapd
root      2309  0.1 16.2 2037152 1326944 ?     Ss   Jun30   1:24
/usr/sbin/rpc.idmapd
NOTE: We are now disabling this one, but still it wouldbe nice to
understand why there seem to be a memory leak.

(3) The script maxes out at about 340 concurrent mounts, any idea how t=
o
increase this number? We are already running all servers with the
insecure option, thus low ports should not be a restriction.
(4) After running this script /etc/mtab and /proc/mounts are out of
sync. Ian Kent from autofs fame suggested a broken local mount
implementation which does not lock mtab well enough. Any idee about tha=
t?

We are currently testing autofs5 and this is not giving these messages,
but still we are not using high/unprivilidged ports.

TIA for any help you might give us.

Cheers

Carsten

--=20
Dr. Carsten Aulbert - Max Planck Institut f=C3=BCr Gravitationsphysik
Callinstra=C3=9Fe 38, 30167 Hannover, Germany
=46on: +49 511 762 17185, Fax: +49 511 762 17193
http://www.top500.org/system/9234 | http://www.top500.org/connfam/6/lis=
t/31

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
       [not found] ` <4869E8AB.4060905-l1a6w7hxd2yELgA04lAiVw@public.gmane.org>
@ 2008-07-01 18:22   ` J. Bruce Fields
  2008-07-01 18:26     ` J. Bruce Fields
  2008-07-02 14:00     ` Carsten Aulbert
  0 siblings, 2 replies; 29+ messages in thread
From: J. Bruce Fields @ 2008-07-01 18:22 UTC (permalink / raw)
  To: Carsten Aulbert; +Cc: linux-nfs, Henning Fehrmann, Steffen Grunewald

On Tue, Jul 01, 2008 at 10:19:55AM +0200, Carsten Aulbert wrote:
> Hi all (now to the right email list),
>=20
> We are running a large cluster and do a lot of cross-mounting between
> the nodes. To get this running we are running a lot of nfsd (196) and
> use mountd with 64 threads, just in case we get a massive number of h=
its
> onto a single node. All this is on Debian Etch with a recent 2.6.24
> kernel using autofs4 at the moment to do the automounts.

I'm slightly confused--the above is all about server configuration, but
the below seems to describe only client problems?

>=20
> When running these two not nice scripts:
>=20
> $ cat test_mount
> #!/bin/sh
>=20
> n_node=3D1000
>=20
> for i in `seq 1 $n_node`;do
>         n=3D`echo $RANDOM%1342+10001 | bc| sed -e "s/1/n/"`
>         $HOME/bin/mount.sh $n&
>         echo -n .
> done
>=20
> $ cat mount.sh
> #!/bin/sh
>=20
> dir=3D"/distributed/spray/data/EatH/S5R1"
>=20
> ping -c1 -w1 $1 > /dev/null&& file=3D"/atlas/node/$1$dir/"`ls -f
> /atlas/node/$1$dir/|head -n 50 | tail -n 1`
> md5sum ${file}
>=20
> With that we encounter different problems:
>=20
> Running this gives this in syslog:
> Jul  1 07:37:19 n1312 rpc.idmapd[2309]: nfsopen:
> open(/var/lib/nfs/rpc_pipefs/nfs/clntaa58/idmap): Too many open files
> Jul  1 07:37:19 n1312 rpc.idmapd[2309]: nfsopen:
> open(/var/lib/nfs/rpc_pipefs/nfs/clntaa58/idmap): Too many open files
> Jul  1 07:37:19 n1312 rpc.idmapd[2309]: nfsopen:
> open(/var/lib/nfs/rpc_pipefs/nfs/clntaa5e/idmap): Too many open files
> Jul  1 07:37:19 n1312 rpc.idmapd[2309]: nfsopen:
> open(/var/lib/nfs/rpc_pipefs/nfs/clntaa5e/idmap): Too many open files
> Jul  1 07:37:19 n1312 rpc.idmapd[2309]: nfsopen:
> open(/var/lib/nfs/rpc_pipefs/nfs/clntaa9c/idmap): Too many open files
>=20
> Which is not surprising to me. However, there are a few things I'm
> wondering about.
>=20
> (1) All our mounts use nfsvers=3D3 why is rpc.idmapd involved at all?

Are there actually files named "idmap" in those directories?  (Looks to
me like they're only created in the v4 case, so I assume those open
calls would return ENOENT if they didn't return ENFILE....)

> (2) Why is this daemon growing so extremely large?
> # ps aux|grep rpc.idmapd
> root      2309  0.1 16.2 2037152 1326944 ?     Ss   Jun30   1:24
> /usr/sbin/rpc.idmapd

I think rpc.idmapd has some state for each directory whether they're fo=
r
a v4 client or not, since it's using dnotify to watch for an "idmap"
file to appear in each one.  The above shows about 2k per mount?

--b.

> NOTE: We are now disabling this one, but still it wouldbe nice to
> understand why there seem to be a memory leak.
>=20
> (3) The script maxes out at about 340 concurrent mounts, any idea how=
 to
> increase this number? We are already running all servers with the
> insecure option, thus low ports should not be a restriction.
> (4) After running this script /etc/mtab and /proc/mounts are out of
> sync. Ian Kent from autofs fame suggested a broken local mount
> implementation which does not lock mtab well enough. Any idee about t=
hat?
>=20
> We are currently testing autofs5 and this is not giving these message=
s,
> but still we are not using high/unprivilidged ports.
>=20
> TIA for any help you might give us.
>=20
> Cheers
>=20
> Carsten
>=20
> --=20
> Dr. Carsten Aulbert - Max Planck Institut f=C3=BCr Gravitationsphysik
> Callinstra=C3=9Fe 38, 30167 Hannover, Germany
> Fon: +49 511 762 17185, Fax: +49 511 762 17193
> http://www.top500.org/system/9234 | http://www.top500.org/connfam/6/l=
ist/31
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" =
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
  2008-07-01 18:22   ` J. Bruce Fields
@ 2008-07-01 18:26     ` J. Bruce Fields
  2008-07-02 14:00     ` Carsten Aulbert
  1 sibling, 0 replies; 29+ messages in thread
From: J. Bruce Fields @ 2008-07-01 18:26 UTC (permalink / raw)
  To: Carsten Aulbert; +Cc: linux-nfs, Henning Fehrmann, Steffen Grunewald

On Tue, Jul 01, 2008 at 02:22:50PM -0400, bfields wrote:
> On Tue, Jul 01, 2008 at 10:19:55AM +0200, Carsten Aulbert wrote:
> > Hi all (now to the right email list),
> >=20
> > We are running a large cluster and do a lot of cross-mounting betwe=
en
> > the nodes. To get this running we are running a lot of nfsd (196) a=
nd
> > use mountd with 64 threads, just in case we get a massive number of=
 hits
> > onto a single node. All this is on Debian Etch with a recent 2.6.24
> > kernel using autofs4 at the moment to do the automounts.
>=20
> I'm slightly confused--the above is all about server configuration, b=
ut
> the below seems to describe only client problems?
>=20
> >=20
> > When running these two not nice scripts:
> >=20
> > $ cat test_mount
> > #!/bin/sh
> >=20
> > n_node=3D1000
> >=20
> > for i in `seq 1 $n_node`;do
> >         n=3D`echo $RANDOM%1342+10001 | bc| sed -e "s/1/n/"`
> >         $HOME/bin/mount.sh $n&
> >         echo -n .
> > done
> >=20
> > $ cat mount.sh
> > #!/bin/sh
> >=20
> > dir=3D"/distributed/spray/data/EatH/S5R1"
> >=20
> > ping -c1 -w1 $1 > /dev/null&& file=3D"/atlas/node/$1$dir/"`ls -f
> > /atlas/node/$1$dir/|head -n 50 | tail -n 1`
> > md5sum ${file}
> >=20
> > With that we encounter different problems:
> >=20
> > Running this gives this in syslog:
> > Jul  1 07:37:19 n1312 rpc.idmapd[2309]: nfsopen:
> > open(/var/lib/nfs/rpc_pipefs/nfs/clntaa58/idmap): Too many open fil=
es
> > Jul  1 07:37:19 n1312 rpc.idmapd[2309]: nfsopen:
> > open(/var/lib/nfs/rpc_pipefs/nfs/clntaa58/idmap): Too many open fil=
es
> > Jul  1 07:37:19 n1312 rpc.idmapd[2309]: nfsopen:
> > open(/var/lib/nfs/rpc_pipefs/nfs/clntaa5e/idmap): Too many open fil=
es
> > Jul  1 07:37:19 n1312 rpc.idmapd[2309]: nfsopen:
> > open(/var/lib/nfs/rpc_pipefs/nfs/clntaa5e/idmap): Too many open fil=
es
> > Jul  1 07:37:19 n1312 rpc.idmapd[2309]: nfsopen:
> > open(/var/lib/nfs/rpc_pipefs/nfs/clntaa9c/idmap): Too many open fil=
es
> >=20
> > Which is not surprising to me. However, there are a few things I'm
> > wondering about.
> >=20
> > (1) All our mounts use nfsvers=3D3 why is rpc.idmapd involved at al=
l?
>=20
> Are there actually files named "idmap" in those directories?  (Looks =
to
> me like they're only created in the v4 case, so I assume those open
> calls would return ENOENT if they didn't return ENFILE....)
>=20
> > (2) Why is this daemon growing so extremely large?
> > # ps aux|grep rpc.idmapd
> > root      2309  0.1 16.2 2037152 1326944 ?     Ss   Jun30   1:24
> > /usr/sbin/rpc.idmapd
>=20
> I think rpc.idmapd has some state for each directory whether they're =
for
> a v4 client or not, since it's using dnotify to watch for an "idmap"
> file to appear in each one.  The above shows about 2k per mount?

Sorry, no, if ps reports those fields in kilobytes, then that's
megabytes per mount, so yes there's clearly a bug here that needs
fixing.

--b.

>=20
> --b.
>=20
> > NOTE: We are now disabling this one, but still it wouldbe nice to
> > understand why there seem to be a memory leak.
> >=20
> > (3) The script maxes out at about 340 concurrent mounts, any idea h=
ow to
> > increase this number? We are already running all servers with the
> > insecure option, thus low ports should not be a restriction.
> > (4) After running this script /etc/mtab and /proc/mounts are out of
> > sync. Ian Kent from autofs fame suggested a broken local mount
> > implementation which does not lock mtab well enough. Any idee about=
 that?
> >=20
> > We are currently testing autofs5 and this is not giving these messa=
ges,
> > but still we are not using high/unprivilidged ports.
> >=20
> > TIA for any help you might give us.
> >=20
> > Cheers
> >=20
> > Carsten
> >=20
> > --=20
> > Dr. Carsten Aulbert - Max Planck Institut f=C3=BCr Gravitationsphys=
ik
> > Callinstra=C3=9Fe 38, 30167 Hannover, Germany
> > Fon: +49 511 762 17185, Fax: +49 511 762 17193
> > http://www.top500.org/system/9234 | http://www.top500.org/connfam/6=
/list/31
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs=
" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
  2008-07-01 18:22   ` J. Bruce Fields
  2008-07-01 18:26     ` J. Bruce Fields
@ 2008-07-02 14:00     ` Carsten Aulbert
       [not found]       ` <486B89F5.9000109-l1a6w7hxd2yELgA04lAiVw@public.gmane.org>
  1 sibling, 1 reply; 29+ messages in thread
From: Carsten Aulbert @ 2008-07-02 14:00 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs, Henning Fehrmann, Steffen Grunewald

Hi all,

J. Bruce Fields wrote:
> 
> I'm slightly confused--the above is all about server configuration, but
> the below seems to describe only client problems?

Well, yes and no. All our servers are clients as well. I.e. we have
~1340 nodes which all export a local directory to be cross-mounted.

>> (1) All our mounts use nfsvers=3 why is rpc.idmapd involved at all?
> 
> Are there actually files named "idmap" in those directories?  (Looks to
> me like they're only created in the v4 case, so I assume those open
> calls would return ENOENT if they didn't return ENFILE....)

No there is not and since we are not running v4 yet, we've disabled the
start for these on all nodes now.

> 
>> (2) Why is this daemon growing so extremely large?
>> # ps aux|grep rpc.idmapd
>> root      2309  0.1 16.2 2037152 1326944 ?     Ss   Jun30   1:24
>> /usr/sbin/rpc.idmapd
> 
> I think rpc.idmapd has some state for each directory whether they're for
> a v4 client or not, since it's using dnotify to watch for an "idmap"
> file to appear in each one.  The above shows about 2k per mount?

As you have written in your other email, yes that's 2 GByte and I've
seen boxes where > 500 mounts hung that the process was using all of the
8 GByte. So I do think there is a bug.

OTOH, we still have the problem, that we can only mount up to ~ 350
remote directories. This one we think we tracked down to the fact that
the NFS clients refuse to use ports >1023 even though the servers are
exporting with the "insecure" option. Is there a way to force this?
Right now the NFS clients use ports 665-1023 (except a few odd ports
which were in use earlier).

Any hint for us how we shall proceed and maybe force the clients to also
use ports > 1023? I think that would solve our problems.

Cheers

Carsten

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
       [not found]       ` <486B89F5.9000109-l1a6w7hxd2yELgA04lAiVw@public.gmane.org>
@ 2008-07-02 20:31         ` J. Bruce Fields
  2008-07-02 21:04           ` Trond Myklebust
  0 siblings, 1 reply; 29+ messages in thread
From: J. Bruce Fields @ 2008-07-02 20:31 UTC (permalink / raw)
  To: Carsten Aulbert; +Cc: linux-nfs, Henning Fehrmann, Steffen Grunewald

On Wed, Jul 02, 2008 at 04:00:21PM +0200, Carsten Aulbert wrote:
> Hi all,
> 
> 
> J. Bruce Fields wrote:
> > 
> > I'm slightly confused--the above is all about server configuration, but
> > the below seems to describe only client problems?
> 
> Well, yes and no. All our servers are clients as well. I.e. we have
> ~1340 nodes which all export a local directory to be cross-mounted.
> 
> >> (1) All our mounts use nfsvers=3 why is rpc.idmapd involved at all?
> > 
> > Are there actually files named "idmap" in those directories?  (Looks to
> > me like they're only created in the v4 case, so I assume those open
> > calls would return ENOENT if they didn't return ENFILE....)
> 
> No there is not and since we are not running v4 yet, we've disabled the
> start for these on all nodes now.
> 
> 
> > 
> >> (2) Why is this daemon growing so extremely large?
> >> # ps aux|grep rpc.idmapd
> >> root      2309  0.1 16.2 2037152 1326944 ?     Ss   Jun30   1:24
> >> /usr/sbin/rpc.idmapd
> > 
> > I think rpc.idmapd has some state for each directory whether they're for
> > a v4 client or not, since it's using dnotify to watch for an "idmap"
> > file to appear in each one.  The above shows about 2k per mount?
> 
> As you have written in your other email, yes that's 2 GByte and I've
> seen boxes where > 500 mounts hung that the process was using all of the
> 8 GByte. So I do think there is a bug.
> 
> OTOH, we still have the problem, that we can only mount up to ~ 350
> remote directories. This one we think we tracked down to the fact that
> the NFS clients refuse to use ports >1023 even though the servers are
> exporting with the "insecure" option. Is there a way to force this?
> Right now the NFS clients use ports 665-1023 (except a few odd ports
> which were in use earlier).
> 
> Any hint for us how we shall proceed and maybe force the clients to also
> use ports > 1023? I think that would solve our problems.

I think the below (untested) would tell the client to stop demanding a
privileged port.

Then you may find you run into other problems, I don't know.  Sounds
like nobody's using this many mounts, so you get to find out what the
next limit is....  But if it works, then maybe someday we should add a
mount option to control this.

--b.


diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 8945307..51f68cc 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -300,9 +300,7 @@ struct rpc_clnt *rpc_create(struct rpc_create_args *args)
 	 * but it is always enabled for rpciod, which handles the connect
 	 * operation.
 	 */
-	xprt->resvport = 1;
-	if (args->flags & RPC_CLNT_CREATE_NONPRIVPORT)
-		xprt->resvport = 0;
+	xprt->resvport = 0;
 
 	clnt = rpc_new_client(args, xprt);
 	if (IS_ERR(clnt))

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
  2008-07-02 20:31         ` J. Bruce Fields
@ 2008-07-02 21:04           ` Trond Myklebust
  2008-07-02 21:08             ` J. Bruce Fields
                               ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Trond Myklebust @ 2008-07-02 21:04 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Carsten Aulbert, linux-nfs, Henning Fehrmann, Steffen Grunewald

On Wed, 2008-07-02 at 16:31 -0400, J. Bruce Fields wrote:
> On Wed, Jul 02, 2008 at 04:00:21PM +0200, Carsten Aulbert wrote:
> > Hi all,
> > 
> > 
> > J. Bruce Fields wrote:
> > > 
> > > I'm slightly confused--the above is all about server configuration, but
> > > the below seems to describe only client problems?
> > 
> > Well, yes and no. All our servers are clients as well. I.e. we have
> > ~1340 nodes which all export a local directory to be cross-mounted.
> > 
> > >> (1) All our mounts use nfsvers=3 why is rpc.idmapd involved at all?
> > > 
> > > Are there actually files named "idmap" in those directories?  (Looks to
> > > me like they're only created in the v4 case, so I assume those open
> > > calls would return ENOENT if they didn't return ENFILE....)
> > 
> > No there is not and since we are not running v4 yet, we've disabled the
> > start for these on all nodes now.
> > 
> > 
> > > 
> > >> (2) Why is this daemon growing so extremely large?
> > >> # ps aux|grep rpc.idmapd
> > >> root      2309  0.1 16.2 2037152 1326944 ?     Ss   Jun30   1:24
> > >> /usr/sbin/rpc.idmapd
> > > 
> > > I think rpc.idmapd has some state for each directory whether they're for
> > > a v4 client or not, since it's using dnotify to watch for an "idmap"
> > > file to appear in each one.  The above shows about 2k per mount?
> > 
> > As you have written in your other email, yes that's 2 GByte and I've
> > seen boxes where > 500 mounts hung that the process was using all of the
> > 8 GByte. So I do think there is a bug.
> > 
> > OTOH, we still have the problem, that we can only mount up to ~ 350
> > remote directories. This one we think we tracked down to the fact that
> > the NFS clients refuse to use ports >1023 even though the servers are
> > exporting with the "insecure" option. Is there a way to force this?
> > Right now the NFS clients use ports 665-1023 (except a few odd ports
> > which were in use earlier).
> > 
> > Any hint for us how we shall proceed and maybe force the clients to also
> > use ports > 1023? I think that would solve our problems.
> 
> I think the below (untested) would tell the client to stop demanding a
> privileged port.

Alternatively, just change the values of /proc/sys/sunrpc/min_resvport
and /proc/sys/sunrpc/max_resvport to whatever range of ports you
actually want to use.

Trond


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
  2008-07-02 21:04           ` Trond Myklebust
@ 2008-07-02 21:08             ` J. Bruce Fields
  2008-07-03  5:31             ` Carsten Aulbert
  2008-07-16  9:49             ` Carsten Aulbert
  2 siblings, 0 replies; 29+ messages in thread
From: J. Bruce Fields @ 2008-07-02 21:08 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Carsten Aulbert, linux-nfs, Henning Fehrmann, Steffen Grunewald

On Wed, Jul 02, 2008 at 05:04:36PM -0400, Trond Myklebust wrote:
> On Wed, 2008-07-02 at 16:31 -0400, J. Bruce Fields wrote:
> > On Wed, Jul 02, 2008 at 04:00:21PM +0200, Carsten Aulbert wrote:
> > > Any hint for us how we shall proceed and maybe force the clients to also
> > > use ports > 1023? I think that would solve our problems.
> > 
> > I think the below (untested) would tell the client to stop demanding a
> > privileged port.
> 
> Alternatively, just change the values of /proc/sys/sunrpc/min_resvport
> and /proc/sys/sunrpc/max_resvport to whatever range of ports you
> actually want to use.

Whoops, yes, I missed those, thanks.

--b.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
  2008-07-02 21:04           ` Trond Myklebust
  2008-07-02 21:08             ` J. Bruce Fields
@ 2008-07-03  5:31             ` Carsten Aulbert
       [not found]               ` <486C642B.3020100-l1a6w7hxd2yELgA04lAiVw@public.gmane.org>
  2008-07-16  9:49             ` Carsten Aulbert
  2 siblings, 1 reply; 29+ messages in thread
From: Carsten Aulbert @ 2008-07-03  5:31 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: J. Bruce Fields, linux-nfs, Henning Fehrmann, Steffen Grunewald



Trond Myklebust wrote:

> 
> Alternatively, just change the values of /proc/sys/sunrpc/min_resvport
> and /proc/sys/sunrpc/max_resvport to whatever range of ports you
> actually want to use.

That looks indeed great. We will test this hopefully today and see where
the next ceiling is, we will bang our heads into ;)

Thanks

Carsten

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
       [not found]               ` <486C642B.3020100-l1a6w7hxd2yELgA04lAiVw@public.gmane.org>
@ 2008-07-03 12:35                 ` Carsten Aulbert
  0 siblings, 0 replies; 29+ messages in thread
From: Carsten Aulbert @ 2008-07-03 12:35 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: J. Bruce Fields, linux-nfs, Henning Fehrmann, Steffen Grunewald



Carsten Aulbert wrote:
> 
> Trond Myklebust wrote:
> 
>> Alternatively, just change the values of /proc/sys/sunrpc/min_resvport
>> and /proc/sys/sunrpc/max_resvport to whatever range of ports you
>> actually want to use.
> 
> That looks indeed great. We will test this hopefully today and see where
> the next ceiling is, we will bang our heads into ;)

OK, we reached a mount count a little beyond 1300 today without any
negative side-effect found so far.

Thanks a lot for your answers!

Cheers

Carsten

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
  2008-07-02 21:04           ` Trond Myklebust
  2008-07-02 21:08             ` J. Bruce Fields
  2008-07-03  5:31             ` Carsten Aulbert
@ 2008-07-16  9:49             ` Carsten Aulbert
       [not found]               ` <487DC43F.8040408-l1a6w7hxd2yELgA04lAiVw@public.gmane.org>
  2 siblings, 1 reply; 29+ messages in thread
From: Carsten Aulbert @ 2008-07-16  9:49 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, Henning Fehrmann, Steffen Grunewald

Hi Trond et al.

I'm following up on this discussion because we hit another problem:

Trond Myklebust wrote:

> 
> Alternatively, just change the values of /proc/sys/sunrpc/min_resvport
> and /proc/sys/sunrpc/max_resvport to whatever range of ports you
> actually want to use.

This works like a charm, however, if you set these values before
restarting the nfs-kernel-server then you are in deep trouble, since
when nfsd wants to start it needs to register with the portmapper, right?

But what happens if this requests comes from a high^Wunpriviliged port?
Right:
Jul 16 11:46:43 d23 portmap[8216]: connect from 127.0.0.1 to set(nfs):
request from unprivileged port
Jul 16 11:46:43 d23 nfsd[8214]: nfssvc: writting fds to kernel failed:
errno 13 (Permission denied)
Jul 16 11:46:44 d23 kernel: [ 8437.726223] NFSD: Using
/var/lib/nfs/v4recovery as the NFSv4 state recovery directory
Jul 16 11:46:44 d23 kernel: [ 8437.800607] NFSD: starting 90-second
grace period
Jul 16 11:46:44 d23 kernel: [ 8437.842891] nfsd: last server has exited
Jul 16 11:46:44 d23 kernel: [ 8437.879940] nfsd: unexporting all filesystems
Jul 16 11:46:44 d23 nfsd[8214]: nfssvc: Address already in use

Changing /proc/sys/sunrpc/max_resvport to 1023 again resolves this
issue, however defeats the purpose for the initial problem. I still need
to look into the code for hte portmapper, but is it easily possible that
the portmapper would accept nfsd requests from "insecure" ports also?
Since e are (mostly) in a controlled environment that should not pose a
problem.

Anyone with an idea?

Thanks a lot

Carsten

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
       [not found]               ` <487DC43F.8040408-l1a6w7hxd2yELgA04lAiVw@public.gmane.org>
@ 2008-07-16 19:06                 ` J. Bruce Fields
  2008-07-17  5:53                   ` Carsten Aulbert
  2008-07-17 14:47                   ` Chuck Lever
  0 siblings, 2 replies; 29+ messages in thread
From: J. Bruce Fields @ 2008-07-16 19:06 UTC (permalink / raw)
  To: Carsten Aulbert
  Cc: Trond Myklebust, linux-nfs, Henning Fehrmann, Steffen Grunewald

On Wed, Jul 16, 2008 at 11:49:51AM +0200, Carsten Aulbert wrote:
> Hi Trond et al.
> 
> I'm following up on this discussion because we hit another problem:
> 
> Trond Myklebust wrote:
> 
> > 
> > Alternatively, just change the values of /proc/sys/sunrpc/min_resvport
> > and /proc/sys/sunrpc/max_resvport to whatever range of ports you
> > actually want to use.
> 
> This works like a charm, however, if you set these values before
> restarting the nfs-kernel-server then you are in deep trouble, since
> when nfsd wants to start it needs to register with the portmapper, right?
> 
> But what happens if this requests comes from a high^Wunpriviliged port?
> Right:
> Jul 16 11:46:43 d23 portmap[8216]: connect from 127.0.0.1 to set(nfs):
> request from unprivileged port
> Jul 16 11:46:43 d23 nfsd[8214]: nfssvc: writting fds to kernel failed:
> errno 13 (Permission denied)
> Jul 16 11:46:44 d23 kernel: [ 8437.726223] NFSD: Using
> /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
> Jul 16 11:46:44 d23 kernel: [ 8437.800607] NFSD: starting 90-second
> grace period
> Jul 16 11:46:44 d23 kernel: [ 8437.842891] nfsd: last server has exited
> Jul 16 11:46:44 d23 kernel: [ 8437.879940] nfsd: unexporting all filesystems
> Jul 16 11:46:44 d23 nfsd[8214]: nfssvc: Address already in use
> 
> 
> Changing /proc/sys/sunrpc/max_resvport to 1023 again resolves this
> issue, however defeats the purpose for the initial problem. I still need
> to look into the code for hte portmapper, but is it easily possible that
> the portmapper would accept nfsd requests from "insecure" ports also?
> Since e are (mostly) in a controlled environment that should not pose a
> problem.
> 
> Anyone with an idea?

The immediate problem seems like a kernel bug to me--it seems to me that
the calls to local daemons should be ignoring {min_,max}_resvport.  (Or
is there some way the daemons can still know that those calls come from
the local kernel?)

--b.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
  2008-07-16 19:06                 ` J. Bruce Fields
@ 2008-07-17  5:53                   ` Carsten Aulbert
       [not found]                     ` <487EDE57.4070100-l1a6w7hxd2yELgA04lAiVw@public.gmane.org>
  2008-07-17 14:47                   ` Chuck Lever
  1 sibling, 1 reply; 29+ messages in thread
From: Carsten Aulbert @ 2008-07-17  5:53 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs, Henning Fehrmann, Steffen Grunewald

Hi all,

J. Bruce Fields wrote:

>> Changing /proc/sys/sunrpc/max_resvport to 1023 again resolves this
>> issue, however defeats the purpose for the initial problem. I still need
>> to look into the code for hte portmapper, but is it easily possible that
>> the portmapper would accept nfsd requests from "insecure" ports also?
>> Since e are (mostly) in a controlled environment that should not pose a
>> problem.
>>
>> Anyone with an idea?
> 
> The immediate problem seems like a kernel bug to me--it seems to me that
> the calls to local daemons should be ignoring {min_,max}_resvport.  (Or
> is there some way the daemons can still know that those calls come from
> the local kernel?)

I just found this in the Makefile for the portmapper:

# To disable tcp-wrapper style access control, comment out the following
# macro definitions.  Access control can also be turned off by providing
# no access control tables. The local system, since it runs the portmap
# daemon, is always treated as an authorized host.

HOSTS_ACCESS= -DHOSTS_ACCESS
#WRAP_LIB = $(WRAP_DIR)/libwrap.a
WRAP_LIB = -lwrap

# Comment out if your RPC library does not allocate privileged ports for
# requests from processes with root privilege, or the new portmap will
# always reject requests to register/unregister services on privileged
# ports. You can find out by running "rpcinfo -p"; if all mountd and NIS
# daemons use a port >= 1024 you should probably disable the next line.

CHECK_PORT = -DCHECK_PORT

I'll try to head down the road of not checking for the ports anymore -
on exposed ports I could block the listening daemons from the outside
world by iptables. Not nice, but probably a solution (and yet another
custom package for us).

Anyone who knows a good reason not to walk this route?

Cheers

Carsten

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
       [not found]                     ` <487EDE57.4070100-l1a6w7hxd2yELgA04lAiVw@public.gmane.org>
@ 2008-07-17 14:27                       ` J. Bruce Fields
  0 siblings, 0 replies; 29+ messages in thread
From: J. Bruce Fields @ 2008-07-17 14:27 UTC (permalink / raw)
  To: Carsten Aulbert; +Cc: linux-nfs, Henning Fehrmann, Steffen Grunewald

On Thu, Jul 17, 2008 at 07:53:27AM +0200, Carsten Aulbert wrote:
> Hi all,
> 
> J. Bruce Fields wrote:
> 
> >> Changing /proc/sys/sunrpc/max_resvport to 1023 again resolves this
> >> issue, however defeats the purpose for the initial problem. I still need
> >> to look into the code for hte portmapper, but is it easily possible that
> >> the portmapper would accept nfsd requests from "insecure" ports also?
> >> Since e are (mostly) in a controlled environment that should not pose a
> >> problem.
> >>
> >> Anyone with an idea?
> > 
> > The immediate problem seems like a kernel bug to me--it seems to me that
> > the calls to local daemons should be ignoring {min_,max}_resvport.  (Or
> > is there some way the daemons can still know that those calls come from
> > the local kernel?)
> 
> I just found this in the Makefile for the portmapper:
> 
> # To disable tcp-wrapper style access control, comment out the following
> # macro definitions.  Access control can also be turned off by providing
> # no access control tables. The local system, since it runs the portmap
> # daemon, is always treated as an authorized host.
> 
> HOSTS_ACCESS= -DHOSTS_ACCESS
> #WRAP_LIB = $(WRAP_DIR)/libwrap.a
> WRAP_LIB = -lwrap
> 

Slightly off-topic, but I'm confused by the comment:

> # Comment out if your RPC library does not allocate privileged ports for
> # requests from processes with root privilege, or the new portmap will
> # always reject requests to register/unregister services on privileged
> # ports.

Shouldn't that be "on unprivileged ports"?

> You can find out by running "rpcinfo -p"; if all mountd and NIS
> # daemons use a port >= 1024 you should probably disable the next line.

Doesn't rpcinfo -p just tell you which port those daemons are listening
on, not which ports they'll use for contacting the portmapper?  A priori
I don't see what one would have to do with the other.

> 
> CHECK_PORT = -DCHECK_PORT
> 
> I'll try to head down the road of not checking for the ports anymore -
> on exposed ports I could block the listening daemons from the outside
> world by iptables.

It's just the port that the portmapper itself listens on that needs to
be firewalled, right?

> Not nice, but probably a solution (and yet another
> custom package for us).
> 
> Anyone who knows a good reason not to walk this route?

I guess the risk is that any old userland process on the server can now
advertise nfsd service, and the clients end up contacting it instead of
the kernel's nfsd.

--b.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
  2008-07-16 19:06                 ` J. Bruce Fields
  2008-07-17  5:53                   ` Carsten Aulbert
@ 2008-07-17 14:47                   ` Chuck Lever
       [not found]                     ` <76bd70e30807170747r31af3280icf0bd3fdbde17bac-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 29+ messages in thread
From: Chuck Lever @ 2008-07-17 14:47 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Carsten Aulbert, Trond Myklebust, linux-nfs, Henning Fehrmann,
	Steffen Grunewald

On Wed, Jul 16, 2008 at 3:06 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> On Wed, Jul 16, 2008 at 11:49:51AM +0200, Carsten Aulbert wrote:
>> Hi Trond et al.
>>
>> I'm following up on this discussion because we hit another problem:
>>
>> Trond Myklebust wrote:
>>
>> >
>> > Alternatively, just change the values of /proc/sys/sunrpc/min_resvport
>> > and /proc/sys/sunrpc/max_resvport to whatever range of ports you
>> > actually want to use.
>>
>> This works like a charm, however, if you set these values before
>> restarting the nfs-kernel-server then you are in deep trouble, since
>> when nfsd wants to start it needs to register with the portmapper, right?
>>
>> But what happens if this requests comes from a high^Wunpriviliged port?
>> Right:
>> Jul 16 11:46:43 d23 portmap[8216]: connect from 127.0.0.1 to set(nfs):
>> request from unprivileged port
>> Jul 16 11:46:43 d23 nfsd[8214]: nfssvc: writting fds to kernel failed:
>> errno 13 (Permission denied)
>> Jul 16 11:46:44 d23 kernel: [ 8437.726223] NFSD: Using
>> /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
>> Jul 16 11:46:44 d23 kernel: [ 8437.800607] NFSD: starting 90-second
>> grace period
>> Jul 16 11:46:44 d23 kernel: [ 8437.842891] nfsd: last server has exited
>> Jul 16 11:46:44 d23 kernel: [ 8437.879940] nfsd: unexporting all filesystems
>> Jul 16 11:46:44 d23 nfsd[8214]: nfssvc: Address already in use
>>
>>
>> Changing /proc/sys/sunrpc/max_resvport to 1023 again resolves this
>> issue, however defeats the purpose for the initial problem. I still need
>> to look into the code for hte portmapper, but is it easily possible that
>> the portmapper would accept nfsd requests from "insecure" ports also?
>> Since e are (mostly) in a controlled environment that should not pose a
>> problem.
>>
>> Anyone with an idea?
>
> The immediate problem seems like a kernel bug to me--it seems to me that
> the calls to local daemons should be ignoring {min_,max}_resvport.  (Or
> is there some way the daemons can still know that those calls come from
> the local kernel?)

I tend to agree.  The rpcbind client (at least) does specifically
require a privileged port, so a large min/max port range would be out
of the question for those rpc_clients.

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
       [not found]                     ` <76bd70e30807170747r31af3280icf0bd3fdbde17bac-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2008-07-17 14:48                       ` J. Bruce Fields
  2008-07-17 15:11                         ` Chuck Lever
  2008-07-17 15:35                       ` Trond Myklebust
  1 sibling, 1 reply; 29+ messages in thread
From: J. Bruce Fields @ 2008-07-17 14:48 UTC (permalink / raw)
  To: chucklever
  Cc: Carsten Aulbert, Trond Myklebust, linux-nfs, Henning Fehrmann,
	Steffen Grunewald

On Thu, Jul 17, 2008 at 10:47:25AM -0400, Chuck Lever wrote:
> On Wed, Jul 16, 2008 at 3:06 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > The immediate problem seems like a kernel bug to me--it seems to me that
> > the calls to local daemons should be ignoring {min_,max}_resvport.  (Or
> > is there some way the daemons can still know that those calls come from
> > the local kernel?)
> 
> I tend to agree.  The rpcbind client (at least) does specifically
> require a privileged port, so a large min/max port range would be out
> of the question for those rpc_clients.

Any chance I could talk you into doing a patch for that?

--b.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
  2008-07-17 14:48                       ` J. Bruce Fields
@ 2008-07-17 15:11                         ` Chuck Lever
       [not found]                           ` <76bd70e30807170811s78175c0ep3a52da7c0ef95fc6-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 29+ messages in thread
From: Chuck Lever @ 2008-07-17 15:11 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Carsten Aulbert, Trond Myklebust, linux-nfs, Henning Fehrmann,
	Steffen Grunewald

On Thu, Jul 17, 2008 at 10:48 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
> On Thu, Jul 17, 2008 at 10:47:25AM -0400, Chuck Lever wrote:
>> On Wed, Jul 16, 2008 at 3:06 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
>> > The immediate problem seems like a kernel bug to me--it seems to me that
>> > the calls to local daemons should be ignoring {min_,max}_resvport.  (Or
>> > is there some way the daemons can still know that those calls come from
>> > the local kernel?)
>>
>> I tend to agree.  The rpcbind client (at least) does specifically
>> require a privileged port, so a large min/max port range would be out
>> of the question for those rpc_clients.
>
> Any chance I could talk you into doing a patch for that?

I can look at it when I get back next week.

-- 
 "Alright guard, begin the unnecessarily slow-moving dipping mechanism."
--Dr. Evil

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
       [not found]                     ` <76bd70e30807170747r31af3280icf0bd3fdbde17bac-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2008-07-17 14:48                       ` J. Bruce Fields
@ 2008-07-17 15:35                       ` Trond Myklebust
  1 sibling, 0 replies; 29+ messages in thread
From: Trond Myklebust @ 2008-07-17 15:35 UTC (permalink / raw)
  To: chucklever
  Cc: J. Bruce Fields, Carsten Aulbert, linux-nfs, Henning Fehrmann,
	Steffen Grunewald

On Thu, 2008-07-17 at 10:47 -0400, Chuck Lever wrote:
> I tend to agree.  The rpcbind client (at least) does specifically
> require a privileged port, so a large min/max port range would be out
> of the question for those rpc_clients.

'portmap' does indeed appear to check for privileged ports, but I can't
see any such checks in the rpcbind code.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
       [not found]                           ` <76bd70e30807170811s78175c0ep3a52da7c0ef95fc6-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2008-07-28 20:55                             ` Chuck Lever
       [not found]                               ` <76bd70e30807281355t4890a9b2q6960d79552538f60-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 29+ messages in thread
From: Chuck Lever @ 2008-07-28 20:55 UTC (permalink / raw)
  To: J. Bruce Fields, Trond Myklebust, Trond Myklebust
  Cc: Carsten Aulbert, linux-nfs, Henning Fehrmann, Steffen Grunewald

On Thu, Jul 17, 2008 at 11:11 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
> On Thu, Jul 17, 2008 at 10:48 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
>> On Thu, Jul 17, 2008 at 10:47:25AM -0400, Chuck Lever wrote:
>>> On Wed, Jul 16, 2008 at 3:06 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
>>> > The immediate problem seems like a kernel bug to me--it seems to me that
>>> > the calls to local daemons should be ignoring {min_,max}_resvport.  (Or
>>> > is there some way the daemons can still know that those calls come from
>>> > the local kernel?)
>>>
>>> I tend to agree.  The rpcbind client (at least) does specifically
>>> require a privileged port, so a large min/max port range would be out
>>> of the question for those rpc_clients.
>>
>> Any chance I could talk you into doing a patch for that?
>
> I can look at it when I get back next week.

I've been pondering this.

It seems like the NFS client is a rather unique case for using
unprivileged ports; most or all of the other RPC clients in the kernel
want to use privileged ports pretty much all the time, and have
learned to switch this off as needed and appropriate.  We even have an
internal API feature for doing this: the RPC_CLNT_CREATE_NONPRIVPORT
flag to rpc_create().

And instead of allowing a wide source port range, it would be better
for the NFS client to use either privileged ports, or unprivileged
ports, but not both, for the same mount point.  Otherwise we could be
opening ourselves up for non-deterministic behavior: "How come
sometimes I get EPERM when I try to mount my NFS servers, but other
times the same mount command works fine?" or "Sometimes after a long
idle period my NFS mount points stop working, and all the programs
running on the mount point get EACCES."

It seems like a good solution would be to:

1.  Make the xprt_minresvport and xprt_maxresvport sysctls mean what
they say: they are _reserved_ port limits.  Thus xprt_maxresvport
should never be allowed to be larger than 1023, and xprt_minresvport
should always be made to be strictly less than xprt_maxresvport; and

2.  Introduce a mechanism to specifically enable the NFS client to use
non-privileged ports.  It could be a new mount option like "insecure"
(which is what some other O/Ses use) or "unpriv-source-port" for
example.  I tend to dislike the former because such a feature is
likely to be quite useful with Kerberos-authenticated NFS, and
"sec=krb5,insecure" is probably a little funny looking, but
"sec=krb5,unpriv-source-port" makes it pretty clear what is going on.

Such an "insecure" mount option would then set
RPC_CLNT_CREATE_NONPRIVPORT on rpc_clnt's created on behalf of the NFS
client.

I'm not married to the names of the options, or even using a mount
option at all (although that seems like a natural place to put such a
feature).

Thoughts?

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
       [not found]                               ` <76bd70e30807281355t4890a9b2q6960d79552538f60-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2008-07-29 11:32                                 ` Jeff Layton
       [not found]                                   ` <20080729073203.546a4269-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  2008-07-30 17:53                                 ` J. Bruce Fields
  1 sibling, 1 reply; 29+ messages in thread
From: Jeff Layton @ 2008-07-29 11:32 UTC (permalink / raw)
  To: chucklever
  Cc: chuck.lever, J. Bruce Fields, Trond Myklebust, Trond Myklebust,
	Carsten Aulbert, linux-nfs, Henning Fehrmann, Steffen Grunewald

On Mon, 28 Jul 2008 16:55:50 -0400
"Chuck Lever" <chuck.lever@oracle.com> wrote:

> On Thu, Jul 17, 2008 at 11:11 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
> > On Thu, Jul 17, 2008 at 10:48 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
> >> On Thu, Jul 17, 2008 at 10:47:25AM -0400, Chuck Lever wrote:
> >>> On Wed, Jul 16, 2008 at 3:06 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> >>> > The immediate problem seems like a kernel bug to me--it seems to me that
> >>> > the calls to local daemons should be ignoring {min_,max}_resvport.  (Or
> >>> > is there some way the daemons can still know that those calls come from
> >>> > the local kernel?)
> >>>
> >>> I tend to agree.  The rpcbind client (at least) does specifically
> >>> require a privileged port, so a large min/max port range would be out
> >>> of the question for those rpc_clients.
> >>
> >> Any chance I could talk you into doing a patch for that?
> >
> > I can look at it when I get back next week.
> 
> I've been pondering this.
> 
> It seems like the NFS client is a rather unique case for using
> unprivileged ports; most or all of the other RPC clients in the kernel
> want to use privileged ports pretty much all the time, and have
> learned to switch this off as needed and appropriate.  We even have an
> internal API feature for doing this: the RPC_CLNT_CREATE_NONPRIVPORT
> flag to rpc_create().
> 
> And instead of allowing a wide source port range, it would be better
> for the NFS client to use either privileged ports, or unprivileged
> ports, but not both, for the same mount point.  Otherwise we could be
> opening ourselves up for non-deterministic behavior: "How come
> sometimes I get EPERM when I try to mount my NFS servers, but other
> times the same mount command works fine?" or "Sometimes after a long
> idle period my NFS mount points stop working, and all the programs
> running on the mount point get EACCES."
> 
> It seems like a good solution would be to:
> 
> 1.  Make the xprt_minresvport and xprt_maxresvport sysctls mean what
> they say: they are _reserved_ port limits.  Thus xprt_maxresvport
> should never be allowed to be larger than 1023, and xprt_minresvport
> should always be made to be strictly less than xprt_maxresvport; and
> 
> 2.  Introduce a mechanism to specifically enable the NFS client to use
> non-privileged ports.  It could be a new mount option like "insecure"
> (which is what some other O/Ses use) or "unpriv-source-port" for
> example.  I tend to dislike the former because such a feature is
> likely to be quite useful with Kerberos-authenticated NFS, and
> "sec=krb5,insecure" is probably a little funny looking, but
> "sec=krb5,unpriv-source-port" makes it pretty clear what is going on.
> 
> Such an "insecure" mount option would then set
> RPC_CLNT_CREATE_NONPRIVPORT on rpc_clnt's created on behalf of the NFS
> client.
> 
> I'm not married to the names of the options, or even using a mount
> option at all (although that seems like a natural place to put such a
> feature).
> 
> Thoughts?
>

IMNSHO, the whole concept of "privileged ports" is pretty antiquated
anyway. It doesn't mean much unless you have a very tightly controlled
physical network...

Being able to allow the client to use non-privileged ports could be
useful. It's less of a problem than it used to be since the NFS client
shares sockets better now, but it could still be a problem in an HPC-type
environment. The NFS server already has an option to allow for clients
that do this so we might as well allow the client to do it too.

I tend to be of the opinion that we should try to use option names that
other OS's have already established where possible. This makes it easier
for admins in mixed environments (shared autofs maps and fewer option
synonyms to remember). My vote would be for calling the new option
"insecure", or at least making "insecure" a synonym for whatever the
new mount option is.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
       [not found]                                   ` <20080729073203.546a4269-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2008-07-29 17:43                                     ` Mike Mackovitch
  0 siblings, 0 replies; 29+ messages in thread
From: Mike Mackovitch @ 2008-07-29 17:43 UTC (permalink / raw)
  To: Jeff Layton
  Cc: chucklever, chuck.lever, J. Bruce Fields, Trond Myklebust,
	Trond Myklebust, Carsten Aulbert, linux-nfs, Henning Fehrmann,
	Steffen Grunewald

On Tue, Jul 29, 2008 at 07:32:03AM -0400, Jeff Layton wrote:
> 
> IMNSHO, the whole concept of "privileged ports" is pretty antiquated
> anyway. It doesn't mean much unless you have a very tightly controlled
> physical network...
> 
> Being able to allow the client to use non-privileged ports could be
> useful. It's less of a problem than it used to be since the NFS client
> shares sockets better now, but it could still be a problem in an HPC-type
> environment. The NFS server already has an option to allow for clients
> that do this so we might as well allow the client to do it too.
> 
> I tend to be of the opinion that we should try to use option names that
> other OS's have already established where possible. This makes it easier
> for admins in mixed environments (shared autofs maps and fewer option
> synonyms to remember). My vote would be for calling the new option
> "insecure", or at least making "insecure" a synonym for whatever the
> new mount option is.

BSD has had such an option for years: "resvport"
You can make the default be enabled and if you don't
want it just specify "noresvport".

It has the added bonus that it doesn't falsely imply anything
about security.  (If you want security, use Kerberos.)

HTH
--macko

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
       [not found]                               ` <76bd70e30807281355t4890a9b2q6960d79552538f60-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2008-07-29 11:32                                 ` Jeff Layton
@ 2008-07-30 17:53                                 ` J. Bruce Fields
  2008-07-30 19:33                                   ` Chuck Lever
  1 sibling, 1 reply; 29+ messages in thread
From: J. Bruce Fields @ 2008-07-30 17:53 UTC (permalink / raw)
  To: chucklever
  Cc: Trond Myklebust, Trond Myklebust, Carsten Aulbert, linux-nfs,
	Henning Fehrmann, Steffen Grunewald

On Mon, Jul 28, 2008 at 04:55:50PM -0400, Chuck Lever wrote:
> On Thu, Jul 17, 2008 at 11:11 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
> > On Thu, Jul 17, 2008 at 10:48 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
> >> On Thu, Jul 17, 2008 at 10:47:25AM -0400, Chuck Lever wrote:
> >>> On Wed, Jul 16, 2008 at 3:06 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> >>> > The immediate problem seems like a kernel bug to me--it seems to me that
> >>> > the calls to local daemons should be ignoring {min_,max}_resvport.  (Or
> >>> > is there some way the daemons can still know that those calls come from
> >>> > the local kernel?)
> >>>
> >>> I tend to agree.  The rpcbind client (at least) does specifically
> >>> require a privileged port, so a large min/max port range would be out
> >>> of the question for those rpc_clients.
> >>
> >> Any chance I could talk you into doing a patch for that?
> >
> > I can look at it when I get back next week.
> 
> I've been pondering this.
> 
> It seems like the NFS client is a rather unique case for using
> unprivileged ports; most or all of the other RPC clients in the kernel
> want to use privileged ports pretty much all the time, and have
> learned to switch this off as needed and appropriate.  We even have an
> internal API feature for doing this: the RPC_CLNT_CREATE_NONPRIVPORT
> flag to rpc_create().
> 
> And instead of allowing a wide source port range, it would be better
> for the NFS client to use either privileged ports, or unprivileged
> ports, but not both, for the same mount point.  Otherwise we could be
> opening ourselves up for non-deterministic behavior: "How come
> sometimes I get EPERM when I try to mount my NFS servers, but other
> times the same mount command works fine?" or "Sometimes after a long
> idle period my NFS mount points stop working, and all the programs
> running on the mount point get EACCES."
> 
> It seems like a good solution would be to:
> 
> 1.  Make the xprt_minresvport and xprt_maxresvport sysctls mean what
> they say: they are _reserved_ port limits.  Thus xprt_maxresvport
> should never be allowed to be larger than 1023, and xprt_minresvport
> should always be made to be strictly less than xprt_maxresvport; and

That would break existing setups: so, someone googles for "nfs linux
large numbers of mounts" and comes across:

	http://marc.info/?l=linux-nfs&m=121509091004851&w=2

They add

	echo 2000 >/proc/sys/sunrpc/max_resvport

to their initscripts, and their problem goes away.  A year later, with
this incident long forgotten, they upgrade their kernel, start getting
failed mounts, and in the worst case end up debugging the whole problem
from scratch again.

> 2.  Introduce a mechanism to specifically enable the NFS client to use
> non-privileged ports.  It could be a new mount option like "insecure"
> (which is what some other O/Ses use) or "unpriv-source-port" for
> example.  I tend to dislike the former because such a feature is
> likely to be quite useful with Kerberos-authenticated NFS, and
> "sec=krb5,insecure" is probably a little funny looking, but
> "sec=krb5,unpriv-source-port" makes it pretty clear what is going on.

But I can see the argument for the mount option.

Maybe we could leave the meaning of the sysctls alone, and allowing
noresvport as an alternate way to allow use of nonreserved ports?

In any case, this all seems a bit orthogonal to the problem of what
ports the rpcbind client uses, right?

--b.

> 
> Such an "insecure" mount option would then set
> RPC_CLNT_CREATE_NONPRIVPORT on rpc_clnt's created on behalf of the NFS
> client.
> 
> I'm not married to the names of the options, or even using a mount
> option at all (although that seems like a natural place to put such a
> feature).
> 
> Thoughts?
> 
> -- 
> Chuck Lever

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
  2008-07-30 17:53                                 ` J. Bruce Fields
@ 2008-07-30 19:33                                   ` Chuck Lever
       [not found]                                     ` <76bd70e30807301233t73f92775tbdeb3f8efbb34a4f-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 29+ messages in thread
From: Chuck Lever @ 2008-07-30 19:33 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Trond Myklebust, Trond Myklebust, Carsten Aulbert, linux-nfs,
	Henning Fehrmann, Steffen Grunewald

On Wed, Jul 30, 2008 at 1:53 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> On Mon, Jul 28, 2008 at 04:55:50PM -0400, Chuck Lever wrote:
>> On Thu, Jul 17, 2008 at 11:11 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
>> > On Thu, Jul 17, 2008 at 10:48 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
>> >> On Thu, Jul 17, 2008 at 10:47:25AM -0400, Chuck Lever wrote:
>> >>> On Wed, Jul 16, 2008 at 3:06 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
>> >>> > The immediate problem seems like a kernel bug to me--it seems to me that
>> >>> > the calls to local daemons should be ignoring {min_,max}_resvport.  (Or
>> >>> > is there some way the daemons can still know that those calls come from
>> >>> > the local kernel?)
>> >>>
>> >>> I tend to agree.  The rpcbind client (at least) does specifically
>> >>> require a privileged port, so a large min/max port range would be out
>> >>> of the question for those rpc_clients.
>> >>
>> >> Any chance I could talk you into doing a patch for that?
>> >
>> > I can look at it when I get back next week.
>>
>> I've been pondering this.
>>
>> It seems like the NFS client is a rather unique case for using
>> unprivileged ports; most or all of the other RPC clients in the kernel
>> want to use privileged ports pretty much all the time, and have
>> learned to switch this off as needed and appropriate.  We even have an
>> internal API feature for doing this: the RPC_CLNT_CREATE_NONPRIVPORT
>> flag to rpc_create().
>>
>> And instead of allowing a wide source port range, it would be better
>> for the NFS client to use either privileged ports, or unprivileged
>> ports, but not both, for the same mount point.  Otherwise we could be
>> opening ourselves up for non-deterministic behavior: "How come
>> sometimes I get EPERM when I try to mount my NFS servers, but other
>> times the same mount command works fine?" or "Sometimes after a long
>> idle period my NFS mount points stop working, and all the programs
>> running on the mount point get EACCES."
>>
>> It seems like a good solution would be to:
>>
>> 1.  Make the xprt_minresvport and xprt_maxresvport sysctls mean what
>> they say: they are _reserved_ port limits.  Thus xprt_maxresvport
>> should never be allowed to be larger than 1023, and xprt_minresvport
>> should always be made to be strictly less than xprt_maxresvport; and
>
> That would break existing setups: so, someone googles for "nfs linux
> large numbers of mounts" and comes across:
>
>        http://marc.info/?l=linux-nfs&m=121509091004851&w=2
>
> They add
>
>        echo 2000 >/proc/sys/sunrpc/max_resvport
>
> to their initscripts, and their problem goes away.  A year later, with
> this incident long forgotten, they upgrade their kernel, start getting
> failed mounts, and in the worst case end up debugging the whole problem
> from scratch again.

>> 2.  Introduce a mechanism to specifically enable the NFS client to use
>> non-privileged ports.  It could be a new mount option like "insecure"
>> (which is what some other O/Ses use) or "unpriv-source-port" for
>> example.  I tend to dislike the former because such a feature is
>> likely to be quite useful with Kerberos-authenticated NFS, and
>> "sec=krb5,insecure" is probably a little funny looking, but
>> "sec=krb5,unpriv-source-port" makes it pretty clear what is going on.
>
> But I can see the argument for the mount option.
>
> Maybe we could leave the meaning of the sysctls alone, and allowing
> noresvport as an alternate way to allow use of nonreserved ports?
>
> In any case, this all seems a bit orthogonal to the problem of what
> ports the rpcbind client uses, right?

No, this is exactly the original problem.  The reason xprt_maxresvport
is allowed to go larger than 1023 is to permit more NFS mounts.  There
really is no other reason for it I can think of.

But it's broken (or at least inconsistent) behavior that max_resvport
can go past 1023 in the first place.  The name is "max_resvport" --
Maximum Reserved Port.  A port value of more than 1024 is not a
reserved port.  These sysctls are designed to restrict the range of
ports used when a _reserved_ port is requested, not when _any_ source
port is requested.  Trond's suggestion is an "off label" use of this
facility.

And rpcbind isn't the only kernel-level RPC service that requires a
reserved port.  The kernel-level NSM code that calls user space, for
example, is one such service.  In other words, rpcbind isn't the only
service that could potentially hit this issue, so an rpcbind-only fix
would be incomplete.

We already have an appropriate interface for kernel RPC services to
request a non-privileged port.  The NFS client should use that
interface.

Now, we don't have to change both at the same time.  We can introduce
the mount option now; the default reserved port range is still good.
And eventually folks using the sysctl will hit the rpcbind bug (or a
lock recovery problem), trace it back to this issue, and change their
mount options and reset their resvport sysctls.

At some later point, though, the maximum should be restricted to 1023.

>> Such an "insecure" mount option would then set
>> RPC_CLNT_CREATE_NONPRIVPORT on rpc_clnt's created on behalf of the NFS
>> client.
>>
>> I'm not married to the names of the options, or even using a mount
>> option at all (although that seems like a natural place to put such a
>> feature).
>>
>> Thoughts?

--
Chuck Lever

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
       [not found]                                     ` <76bd70e30807301233t73f92775tbdeb3f8efbb34a4f-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2008-07-30 22:01                                       ` Chuck Lever
       [not found]                                         ` <76bd70e30807301501p5c0ba3c6i38fee02a1e606e31-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2008-07-30 22:13                                       ` J. Bruce Fields
  1 sibling, 1 reply; 29+ messages in thread
From: Chuck Lever @ 2008-07-30 22:01 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Trond Myklebust, Trond Myklebust, Carsten Aulbert, linux-nfs,
	Henning Fehrmann, Steffen Grunewald

On Wed, Jul 30, 2008 at 3:33 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> On Wed, Jul 30, 2008 at 1:53 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
>> On Mon, Jul 28, 2008 at 04:55:50PM -0400, Chuck Lever wrote:
>>> On Thu, Jul 17, 2008 at 11:11 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>> > On Thu, Jul 17, 2008 at 10:48 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
>>> >> On Thu, Jul 17, 2008 at 10:47:25AM -0400, Chuck Lever wrote:
>>> >>> On Wed, Jul 16, 2008 at 3:06 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
>>> >>> > The immediate problem seems like a kernel bug to me--it seems to me that
>>> >>> > the calls to local daemons should be ignoring {min_,max}_resvport.  (Or
>>> >>> > is there some way the daemons can still know that those calls come from
>>> >>> > the local kernel?)
>>> >>>
>>> >>> I tend to agree.  The rpcbind client (at least) does specifically
>>> >>> require a privileged port, so a large min/max port range would be out
>>> >>> of the question for those rpc_clients.
>>> >>
>>> >> Any chance I could talk you into doing a patch for that?
>>> >
>>> > I can look at it when I get back next week.
>>>
>>> I've been pondering this.
>>>
>>> It seems like the NFS client is a rather unique case for using
>>> unprivileged ports; most or all of the other RPC clients in the kernel
>>> want to use privileged ports pretty much all the time, and have
>>> learned to switch this off as needed and appropriate.  We even have an
>>> internal API feature for doing this: the RPC_CLNT_CREATE_NONPRIVPORT
>>> flag to rpc_create().
>>>
>>> And instead of allowing a wide source port range, it would be better
>>> for the NFS client to use either privileged ports, or unprivileged
>>> ports, but not both, for the same mount point.  Otherwise we could be
>>> opening ourselves up for non-deterministic behavior: "How come
>>> sometimes I get EPERM when I try to mount my NFS servers, but other
>>> times the same mount command works fine?" or "Sometimes after a long
>>> idle period my NFS mount points stop working, and all the programs
>>> running on the mount point get EACCES."
>>>
>>> It seems like a good solution would be to:
>>>
>>> 1.  Make the xprt_minresvport and xprt_maxresvport sysctls mean what
>>> they say: they are _reserved_ port limits.  Thus xprt_maxresvport
>>> should never be allowed to be larger than 1023, and xprt_minresvport
>>> should always be made to be strictly less than xprt_maxresvport; and
>>
>> That would break existing setups: so, someone googles for "nfs linux
>> large numbers of mounts" and comes across:
>>
>>        http://marc.info/?l=linux-nfs&m=121509091004851&w=2
>>
>> They add
>>
>>        echo 2000 >/proc/sys/sunrpc/max_resvport
>>
>> to their initscripts, and their problem goes away.  A year later, with
>> this incident long forgotten, they upgrade their kernel, start getting
>> failed mounts, and in the worst case end up debugging the whole problem
>> from scratch again.
>
>>> 2.  Introduce a mechanism to specifically enable the NFS client to use
>>> non-privileged ports.  It could be a new mount option like "insecure"
>>> (which is what some other O/Ses use) or "unpriv-source-port" for
>>> example.  I tend to dislike the former because such a feature is
>>> likely to be quite useful with Kerberos-authenticated NFS, and
>>> "sec=krb5,insecure" is probably a little funny looking, but
>>> "sec=krb5,unpriv-source-port" makes it pretty clear what is going on.
>>
>> But I can see the argument for the mount option.
>>
>> Maybe we could leave the meaning of the sysctls alone, and allowing
>> noresvport as an alternate way to allow use of nonreserved ports?
>>
>> In any case, this all seems a bit orthogonal to the problem of what
>> ports the rpcbind client uses, right?
>
> No, this is exactly the original problem.  The reason xprt_maxresvport
> is allowed to go larger than 1023 is to permit more NFS mounts.  There
> really is no other reason for it I can think of.
>
> But it's broken (or at least inconsistent) behavior that max_resvport
> can go past 1023 in the first place.  The name is "max_resvport" --
> Maximum Reserved Port.  A port value of more than 1024 is not a
> reserved port.  These sysctls are designed to restrict the range of
> ports used when a _reserved_ port is requested, not when _any_ source
> port is requested.  Trond's suggestion is an "off label" use of this
> facility.
>
> And rpcbind isn't the only kernel-level RPC service that requires a
> reserved port.  The kernel-level NSM code that calls user space, for
> example, is one such service.  In other words, rpcbind isn't the only
> service that could potentially hit this issue, so an rpcbind-only fix
> would be incomplete.
>
> We already have an appropriate interface for kernel RPC services to
> request a non-privileged port.  The NFS client should use that
> interface.
>
> Now, we don't have to change both at the same time.  We can introduce
> the mount option now; the default reserved port range is still good.
> And eventually folks using the sysctl will hit the rpcbind bug (or a
> lock recovery problem), trace it back to this issue, and change their
> mount options and reset their resvport sysctls.

Unfortunately we are out of NFS_MOUNT_ flags: there are already 16
defined and this is a legacy kernel ABI, so I'm not sure if we are
allowed to use the upper 16 bits in the flags word.

Will think about this more.

> At some later point, though, the maximum should be restricted to 1023.
>
>>> Such an "insecure" mount option would then set
>>> RPC_CLNT_CREATE_NONPRIVPORT on rpc_clnt's created on behalf of the NFS
>>> client.
>>>
>>> I'm not married to the names of the options, or even using a mount
>>> option at all (although that seems like a natural place to put such a
>>> feature).
>>>
>>> Thoughts?
>
> --
> Chuck Lever
>



-- 
 "Alright guard, begin the unnecessarily slow-moving dipping mechanism."
--Dr. Evil

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
       [not found]                                     ` <76bd70e30807301233t73f92775tbdeb3f8efbb34a4f-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2008-07-30 22:01                                       ` Chuck Lever
@ 2008-07-30 22:13                                       ` J. Bruce Fields
  2008-07-31 16:35                                         ` Chuck Lever
  1 sibling, 1 reply; 29+ messages in thread
From: J. Bruce Fields @ 2008-07-30 22:13 UTC (permalink / raw)
  To: chucklever
  Cc: Trond Myklebust, Trond Myklebust, Carsten Aulbert, linux-nfs,
	Henning Fehrmann, Steffen Grunewald

On Wed, Jul 30, 2008 at 03:33:38PM -0400, Chuck Lever wrote:
> On Wed, Jul 30, 2008 at 1:53 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > In any case, this all seems a bit orthogonal to the problem of what
> > ports the rpcbind client uses, right?
> 
> No, this is exactly the original problem.  The reason xprt_maxresvport
> is allowed to go larger than 1023 is to permit more NFS mounts.  There
> really is no other reason for it I can think of.
> 
> But it's broken (or at least inconsistent) behavior that max_resvport
> can go past 1023 in the first place.  The name is "max_resvport" --
> Maximum Reserved Port.  A port value of more than 1024 is not a
> reserved port.  These sysctls are designed to restrict the range of
> ports used when a _reserved_ port is requested, not when _any_ source
> port is requested. Trond's suggestion is an "off label" use of this
> facility.

We could do a better job of communicating what is and isn't a documented
usage, in that case.

Once people are already using an interface a certain way (and because we
told them to) discussions about whether it's really a correct use start
to seem a little academic.

> And rpcbind isn't the only kernel-level RPC service that requires a
> reserved port.  The kernel-level NSM code that calls user space, for
> example, is one such service.  In other words, rpcbind isn't the only
> service that could potentially hit this issue, so an rpcbind-only fix
> would be incomplete.
> 
> We already have an appropriate interface for kernel RPC services to
> request a non-privileged port.  The NFS client should use that
> interface.

I admit that would be nicer.

--b.

> Now, we don't have to change both at the same time.  We can introduce
> the mount option now; the default reserved port range is still good.
> And eventually folks using the sysctl will hit the rpcbind bug (or a
> lock recovery problem), trace it back to this issue, and change their
> mount options and reset their resvport sysctls.
> 
> At some later point, though, the maximum should be restricted to 1023.
> 
> >> Such an "insecure" mount option would then set
> >> RPC_CLNT_CREATE_NONPRIVPORT on rpc_clnt's created on behalf of the NFS
> >> client.
> >>
> >> I'm not married to the names of the options, or even using a mount
> >> option at all (although that seems like a natural place to put such a
> >> feature).
> >>
> >> Thoughts?
> 
> --
> Chuck Lever

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
  2008-07-30 22:13                                       ` J. Bruce Fields
@ 2008-07-31 16:35                                         ` Chuck Lever
  0 siblings, 0 replies; 29+ messages in thread
From: Chuck Lever @ 2008-07-31 16:35 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: chucklever, Trond Myklebust, Trond Myklebust, Carsten Aulbert,
	linux-nfs, Henning Fehrmann, Steffen Grunewald

On Jul 30, 2008, at 6:13 PM, J. Bruce Fields wrote:
> On Wed, Jul 30, 2008 at 03:33:38PM -0400, Chuck Lever wrote:
>> On Wed, Jul 30, 2008 at 1:53 PM, J. Bruce Fields <bfields@fieldses.org 
>> > wrote:
>>> In any case, this all seems a bit orthogonal to the problem of what
>>> ports the rpcbind client uses, right?
>>
>> No, this is exactly the original problem.  The reason  
>> xprt_maxresvport
>> is allowed to go larger than 1023 is to permit more NFS mounts.   
>> There
>> really is no other reason for it I can think of.
>>
>> But it's broken (or at least inconsistent) behavior that max_resvport
>> can go past 1023 in the first place.  The name is "max_resvport" --
>> Maximum Reserved Port.  A port value of more than 1024 is not a
>> reserved port.  These sysctls are designed to restrict the range of
>> ports used when a _reserved_ port is requested, not when _any_ source
>> port is requested. Trond's suggestion is an "off label" use of this
>> facility.
>
> We could do a better job of communicating what is and isn't a  
> documented
> usage, in that case.
>
> Once people are already using an interface a certain way (and  
> because we
> told them to) discussions about whether it's really a correct use  
> start
> to seem a little academic.

It's not at all academic.

We _must_ revisit interface design whenever we have a design that  
results in a kernel paging exception, a privilege escalation or denial  
of service, or it's simply confusing or using standard terminology  
incorrectly.  It is always appropriate to talk about it.

What we need to be careful about when people are already using an  
interface is how we go about changing it.

>> And rpcbind isn't the only kernel-level RPC service that requires a
>> reserved port.  The kernel-level NSM code that calls user space, for
>> example, is one such service.  In other words, rpcbind isn't the only
>> service that could potentially hit this issue, so an rpcbind-only fix
>> would be incomplete.
>>
>> We already have an appropriate interface for kernel RPC services to
>> request a non-privileged port.  The NFS client should use that
>> interface.
>
> I admit that would be nicer.
>
> --b.
>
>> Now, we don't have to change both at the same time.  We can introduce
>> the mount option now; the default reserved port range is still good.
>> And eventually folks using the sysctl will hit the rpcbind bug (or a
>> lock recovery problem), trace it back to this issue, and change their
>> mount options and reset their resvport sysctls.
>>
>> At some later point, though, the maximum should be restricted to  
>> 1023.
>>
>>>> Such an "insecure" mount option would then set
>>>> RPC_CLNT_CREATE_NONPRIVPORT on rpc_clnt's created on behalf of  
>>>> the NFS
>>>> client.
>>>>
>>>> I'm not married to the names of the options, or even using a mount
>>>> option at all (although that seems like a natural place to put  
>>>> such a
>>>> feature).
>>>>
>>>> Thoughts?
>>
>> --
>> Chuck Lever
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs"  
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
       [not found]                                         ` <76bd70e30807301501p5c0ba3c6i38fee02a1e606e31-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2008-08-15 20:34                                           ` Chuck Lever
       [not found]                                             ` <76bd70e30808151334i19822280j67a08b92b17582ba-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 29+ messages in thread
From: Chuck Lever @ 2008-08-15 20:34 UTC (permalink / raw)
  To: Trond Myklebust, Trond Myklebust
  Cc: Carsten Aulbert, linux-nfs, Henning Fehrmann, Steffen Grunewald

On Wed, Jul 30, 2008 at 6:01 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> On Wed, Jul 30, 2008 at 3:33 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>> On Wed, Jul 30, 2008 at 1:53 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
>>> On Mon, Jul 28, 2008 at 04:55:50PM -0400, Chuck Lever wrote:
>>>> On Thu, Jul 17, 2008 at 11:11 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>> > On Thu, Jul 17, 2008 at 10:48 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
>>>> >> On Thu, Jul 17, 2008 at 10:47:25AM -0400, Chuck Lever wrote:
>>>> >>> On Wed, Jul 16, 2008 at 3:06 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
>>>> >>> > The immediate problem seems like a kernel bug to me--it seems to me that
>>>> >>> > the calls to local daemons should be ignoring {min_,max}_resvport.  (Or
>>>> >>> > is there some way the daemons can still know that those calls come from
>>>> >>> > the local kernel?)
>>>> >>>
>>>> >>> I tend to agree.  The rpcbind client (at least) does specifically
>>>> >>> require a privileged port, so a large min/max port range would be out
>>>> >>> of the question for those rpc_clients.
>>>> >>
>>>> >> Any chance I could talk you into doing a patch for that?
>>>> >
>>>> > I can look at it when I get back next week.
>>>>
>>>> I've been pondering this.
>>>>
>>>> It seems like the NFS client is a rather unique case for using
>>>> unprivileged ports; most or all of the other RPC clients in the kernel
>>>> want to use privileged ports pretty much all the time, and have
>>>> learned to switch this off as needed and appropriate.  We even have an
>>>> internal API feature for doing this: the RPC_CLNT_CREATE_NONPRIVPORT
>>>> flag to rpc_create().
>>>>
>>>> And instead of allowing a wide source port range, it would be better
>>>> for the NFS client to use either privileged ports, or unprivileged
>>>> ports, but not both, for the same mount point.  Otherwise we could be
>>>> opening ourselves up for non-deterministic behavior: "How come
>>>> sometimes I get EPERM when I try to mount my NFS servers, but other
>>>> times the same mount command works fine?" or "Sometimes after a long
>>>> idle period my NFS mount points stop working, and all the programs
>>>> running on the mount point get EACCES."
>>>>
>>>> It seems like a good solution would be to:
>>>>
>>>> 1.  Make the xprt_minresvport and xprt_maxresvport sysctls mean what
>>>> they say: they are _reserved_ port limits.  Thus xprt_maxresvport
>>>> should never be allowed to be larger than 1023, and xprt_minresvport
>>>> should always be made to be strictly less than xprt_maxresvport; and
>>>
>>> That would break existing setups: so, someone googles for "nfs linux
>>> large numbers of mounts" and comes across:
>>>
>>>        http://marc.info/?l=linux-nfs&m=121509091004851&w=2
>>>
>>> They add
>>>
>>>        echo 2000 >/proc/sys/sunrpc/max_resvport
>>>
>>> to their initscripts, and their problem goes away.  A year later, with
>>> this incident long forgotten, they upgrade their kernel, start getting
>>> failed mounts, and in the worst case end up debugging the whole problem
>>> from scratch again.
>>
>>>> 2.  Introduce a mechanism to specifically enable the NFS client to use
>>>> non-privileged ports.  It could be a new mount option like "insecure"
>>>> (which is what some other O/Ses use) or "unpriv-source-port" for
>>>> example.  I tend to dislike the former because such a feature is
>>>> likely to be quite useful with Kerberos-authenticated NFS, and
>>>> "sec=krb5,insecure" is probably a little funny looking, but
>>>> "sec=krb5,unpriv-source-port" makes it pretty clear what is going on.
>>>
>>> But I can see the argument for the mount option.
>>>
>>> Maybe we could leave the meaning of the sysctls alone, and allowing
>>> noresvport as an alternate way to allow use of nonreserved ports?
>>>
>>> In any case, this all seems a bit orthogonal to the problem of what
>>> ports the rpcbind client uses, right?
>>
>> No, this is exactly the original problem.  The reason xprt_maxresvport
>> is allowed to go larger than 1023 is to permit more NFS mounts.  There
>> really is no other reason for it I can think of.
>>
>> But it's broken (or at least inconsistent) behavior that max_resvport
>> can go past 1023 in the first place.  The name is "max_resvport" --
>> Maximum Reserved Port.  A port value of more than 1024 is not a
>> reserved port.  These sysctls are designed to restrict the range of
>> ports used when a _reserved_ port is requested, not when _any_ source
>> port is requested.  Trond's suggestion is an "off label" use of this
>> facility.
>>
>> And rpcbind isn't the only kernel-level RPC service that requires a
>> reserved port.  The kernel-level NSM code that calls user space, for
>> example, is one such service.  In other words, rpcbind isn't the only
>> service that could potentially hit this issue, so an rpcbind-only fix
>> would be incomplete.
>>
>> We already have an appropriate interface for kernel RPC services to
>> request a non-privileged port.  The NFS client should use that
>> interface.
>>
>> Now, we don't have to change both at the same time.  We can introduce
>> the mount option now; the default reserved port range is still good.
>> And eventually folks using the sysctl will hit the rpcbind bug (or a
>> lock recovery problem), trace it back to this issue, and change their
>> mount options and reset their resvport sysctls.
>
> Unfortunately we are out of NFS_MOUNT_ flags: there are already 16
> defined and this is a legacy kernel ABI, so I'm not sure if we are
> allowed to use the upper 16 bits in the flags word.
>
> Will think about this more.

We had some discussion about this at the pub last night.

Trond, NFS_MOUNT_FLAGMASK is used in nfs_init_server() and
nfs4_init_server() for both legacy binary and text-based mounts.  This
needs to be moved to a legacy-only path if we want to use the
high-order 16 bits in the 'flags' field for text-based mounts.

I reviewed the Solaris mount_nfs(1M) man page (I hope this is the
correct place to look).  There doesn't appear to be a mount option to
make Solaris NFS clients use a reserved port. Not sure if there's some
other UI (like a config file in /etc).

FreeBSD and Mac OS both use "[no]resvport" as Mike pointed out
earlier.  That's my vote for the new Linux mount option.

[ Sidebar: I found this in the Mac OS mount_nfs(8) man page:

     noconn  Do not connect UDP sockets.  For UDP mount points, do not do a
             connect(2).  This must be used for servers that do not reply to
             requests from the standard NFS port number 2049.  It may also be
             required for servers with more than one IP address if replies come
             from an address other than the one specified in the requests.

An interesting consideration if we support connected UDP sockets for
NFS at some point. ]

-- 
"Officer. Ma'am. Squeaker."
  -- Mr. Incredible

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
       [not found]                                             ` <76bd70e30808151334i19822280j67a08b92b17582ba-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2008-08-15 20:47                                               ` Trond Myklebust
  2008-08-15 21:04                                                 ` Trond Myklebust
  0 siblings, 1 reply; 29+ messages in thread
From: Trond Myklebust @ 2008-08-15 20:47 UTC (permalink / raw)
  To: chucklever
  Cc: Carsten Aulbert, linux-nfs, Henning Fehrmann, Steffen Grunewald

On Fri, 2008-08-15 at 16:34 -0400, Chuck Lever wrote:
> Trond, NFS_MOUNT_FLAGMASK is used in nfs_init_server() and
> nfs4_init_server() for both legacy binary and text-based mounts.  This
> needs to be moved to a legacy-only path if we want to use the
> high-order 16 bits in the 'flags' field for text-based mounts.

We definitely want to do this. The point of introducing text-based
mounts was to allow us to add functionality without having to worry
about legacy binary mount formats. The mask should be there in order to
ensure that binary formats don't start enabling features that they
cannot support. There is no justification for applying it to the text
mount path.

> I reviewed the Solaris mount_nfs(1M) man page (I hope this is the
> correct place to look).  There doesn't appear to be a mount option to
> make Solaris NFS clients use a reserved port. Not sure if there's some
> other UI (like a config file in /etc).
> 
> FreeBSD and Mac OS both use "[no]resvport" as Mike pointed out
> earlier.  That's my vote for the new Linux mount option

Agreed: we should try to follow the standard set by existing
implementations wherever we can...

> [ Sidebar: I found this in the Mac OS mount_nfs(8) man page:
> 
>      noconn  Do not connect UDP sockets.  For UDP mount points, do not do a
>              connect(2).  This must be used for servers that do not reply to
>              requests from the standard NFS port number 2049.  It may also be
>              required for servers with more than one IP address if replies come
>              from an address other than the one specified in the requests.
> 
> An interesting consideration if we support connected UDP sockets for
> NFS at some point. ]

Hmm... Well, we already don't support servers that reply to a UDP
request from a different IP address, and I can't see that we should
really care. Aside from the fact that most clients will use TCP by
default these days, it is quite trivial for a server to track on which
interface a UDP request was received, and ensure that the reply is sent
on the same interface. In fact, we already do this in the Linux server
AFAICR...

Cheers
  Trond
-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
  2008-08-15 20:47                                               ` Trond Myklebust
@ 2008-08-15 21:04                                                 ` Trond Myklebust
  2008-08-15 21:39                                                   ` Chuck Lever
  0 siblings, 1 reply; 29+ messages in thread
From: Trond Myklebust @ 2008-08-15 21:04 UTC (permalink / raw)
  To: chucklever
  Cc: Carsten Aulbert, linux-nfs, Henning Fehrmann, Steffen Grunewald

[-- Attachment #1: Type: text/plain, Size: 898 bytes --]

On Fri, 2008-08-15 at 16:47 -0400, Trond Myklebust wrote:
> On Fri, 2008-08-15 at 16:34 -0400, Chuck Lever wrote:
> > Trond, NFS_MOUNT_FLAGMASK is used in nfs_init_server() and
> > nfs4_init_server() for both legacy binary and text-based mounts.  This
> > needs to be moved to a legacy-only path if we want to use the
> > high-order 16 bits in the 'flags' field for text-based mounts.
> 
> We definitely want to do this. The point of introducing text-based
> mounts was to allow us to add functionality without having to worry
> about legacy binary mount formats. The mask should be there in order to
> ensure that binary formats don't start enabling features that they
> cannot support. There is no justification for applying it to the text
> mount path.

I've attached the patch...

Cheers
  Trond
-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

[-- Attachment #2: linux-2.6.27-005-dont_apply_nfs_mount_flagmask_to_text_mounts.dif --]
[-- Type: message/rfc822, Size: 1938 bytes --]

From: Trond Myklebust <Trond.Myklebust@netapp.com>
Subject: No Subject
Date: Fri, 15 Aug 2008 16:59:14 -0400
Message-ID: <1218834252.7037.32.camel@localhost>

The point of introducing text-based mounts was to allow us to add
functionality without having to worry about legacy binary mount formats.
The mask should be there in order to ensure that binary formats don't start
enabling features that they cannot support. There is no justification for
applying it to the text mount path.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
---

 fs/nfs/client.c |    4 ++--
 fs/nfs/super.c  |    2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 2accb67..7547600 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -675,7 +675,7 @@ static int nfs_init_server(struct nfs_server *server,
 	server->nfs_client = clp;
 
 	/* Initialise the client representation from the mount data */
-	server->flags = data->flags & NFS_MOUNT_FLAGMASK;
+	server->flags = data->flags;
 
 	if (data->rsize)
 		server->rsize = nfs_block_size(data->rsize, NULL);
@@ -1072,7 +1072,7 @@ static int nfs4_init_server(struct nfs_server *server,
 		goto error;
 
 	/* Initialise the client representation from the mount data */
-	server->flags = data->flags & NFS_MOUNT_FLAGMASK;
+	server->flags = data->flags;
 	server->caps |= NFS_CAP_ATOMIC_OPEN;
 
 	if (data->rsize)
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index f67d44c..5725af9 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -1544,7 +1544,7 @@ static int nfs_validate_mount_data(void *options,
 		 * Translate to nfs_parsed_mount_data, which nfs_fill_super
 		 * can deal with.
 		 */
-		args->flags		= data->flags;
+		args->flags		= data->flags & NFS_MOUNT_FLAGMASK;
 		args->rsize		= data->rsize;
 		args->wsize		= data->wsize;
 		args->timeo		= data->timeo;

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: Massive NFS problems on large cluster with large number of mounts
  2008-08-15 21:04                                                 ` Trond Myklebust
@ 2008-08-15 21:39                                                   ` Chuck Lever
  0 siblings, 0 replies; 29+ messages in thread
From: Chuck Lever @ 2008-08-15 21:39 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Carsten Aulbert, linux-nfs, Henning Fehrmann, Steffen Grunewald

On Fri, Aug 15, 2008 at 5:04 PM, Trond Myklebust
<Trond.Myklebust@netapp.com> wrote:
> On Fri, 2008-08-15 at 16:47 -0400, Trond Myklebust wrote:
>> On Fri, 2008-08-15 at 16:34 -0400, Chuck Lever wrote:
>> > Trond, NFS_MOUNT_FLAGMASK is used in nfs_init_server() and
>> > nfs4_init_server() for both legacy binary and text-based mounts.  This
>> > needs to be moved to a legacy-only path if we want to use the
>> > high-order 16 bits in the 'flags' field for text-based mounts.
>>
>> We definitely want to do this. The point of introducing text-based
>> mounts was to allow us to add functionality without having to worry
>> about legacy binary mount formats. The mask should be there in order to
>> ensure that binary formats don't start enabling features that they
>> cannot support. There is no justification for applying it to the text
>> mount path.
>
> I've attached the patch...

Thanks!

-- 
"Officer. Ma'am. Squeaker."
  -- Mr. Incredible

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2008-08-15 21:39 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-01  8:19 Massive NFS problems on large cluster with large number of mounts Carsten Aulbert
     [not found] ` <4869E8AB.4060905-l1a6w7hxd2yELgA04lAiVw@public.gmane.org>
2008-07-01 18:22   ` J. Bruce Fields
2008-07-01 18:26     ` J. Bruce Fields
2008-07-02 14:00     ` Carsten Aulbert
     [not found]       ` <486B89F5.9000109-l1a6w7hxd2yELgA04lAiVw@public.gmane.org>
2008-07-02 20:31         ` J. Bruce Fields
2008-07-02 21:04           ` Trond Myklebust
2008-07-02 21:08             ` J. Bruce Fields
2008-07-03  5:31             ` Carsten Aulbert
     [not found]               ` <486C642B.3020100-l1a6w7hxd2yELgA04lAiVw@public.gmane.org>
2008-07-03 12:35                 ` Carsten Aulbert
2008-07-16  9:49             ` Carsten Aulbert
     [not found]               ` <487DC43F.8040408-l1a6w7hxd2yELgA04lAiVw@public.gmane.org>
2008-07-16 19:06                 ` J. Bruce Fields
2008-07-17  5:53                   ` Carsten Aulbert
     [not found]                     ` <487EDE57.4070100-l1a6w7hxd2yELgA04lAiVw@public.gmane.org>
2008-07-17 14:27                       ` J. Bruce Fields
2008-07-17 14:47                   ` Chuck Lever
     [not found]                     ` <76bd70e30807170747r31af3280icf0bd3fdbde17bac-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-07-17 14:48                       ` J. Bruce Fields
2008-07-17 15:11                         ` Chuck Lever
     [not found]                           ` <76bd70e30807170811s78175c0ep3a52da7c0ef95fc6-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-07-28 20:55                             ` Chuck Lever
     [not found]                               ` <76bd70e30807281355t4890a9b2q6960d79552538f60-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-07-29 11:32                                 ` Jeff Layton
     [not found]                                   ` <20080729073203.546a4269-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2008-07-29 17:43                                     ` Mike Mackovitch
2008-07-30 17:53                                 ` J. Bruce Fields
2008-07-30 19:33                                   ` Chuck Lever
     [not found]                                     ` <76bd70e30807301233t73f92775tbdeb3f8efbb34a4f-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-07-30 22:01                                       ` Chuck Lever
     [not found]                                         ` <76bd70e30807301501p5c0ba3c6i38fee02a1e606e31-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-08-15 20:34                                           ` Chuck Lever
     [not found]                                             ` <76bd70e30808151334i19822280j67a08b92b17582ba-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-08-15 20:47                                               ` Trond Myklebust
2008-08-15 21:04                                                 ` Trond Myklebust
2008-08-15 21:39                                                   ` Chuck Lever
2008-07-30 22:13                                       ` J. Bruce Fields
2008-07-31 16:35                                         ` Chuck Lever
2008-07-17 15:35                       ` Trond Myklebust

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox