* [nfsd PATCH 0/3] address issues with shutdown while portmap is dead
@ 2009-05-28 6:33 NeilBrown
[not found] ` <20090528062730.15937.70579.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
0 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2009-05-28 6:33 UTC (permalink / raw)
To: linux-nfs
Hi all.
Currently if you shutdown nfsd while portmap is dead, it takes about
3.5 minutes. This isn't ideal.
The following patch series address some issues surrounding that.
The first two a minor fixes to nfsd which get it behave a little bit
more consistently. They don't directly affect the issue.
The third changes the timeouts used when rpcb_client talks to
portmap or rpcbind to unregister a service. It simply reduces the
timeout to 0.5 seconds with no retries.
As there should be no packet loss, this should only fail if
portmap/rpcbind isn't working. And in that case unregistering
wont serve any purpose anyway.
And as unregistering is merely for cleanliness, not for correctness,
there wouldn't be a problem even if something did go wrong.
With this patch, the 3.5 minutes drops to 3 seconds.
Comments welcome.
Thanks,
NeilBrown
---
NeilBrown (3):
sunrpc: reduce timeout when unregistering rpcbind registrations.
nfsd: optimise the starting of zero threads when none are running.
nfsd: don't take nfsd_mutex twice when setting number of threads.
fs/nfsd/nfsctl.c | 13 +++++++++----
fs/nfsd/nfssvc.c | 8 ++++++--
net/sunrpc/rpcb_clnt.c | 17 +++++++++++++++--
3 files changed, 30 insertions(+), 8 deletions(-)
--
Signature
^ permalink raw reply [flat|nested] 25+ messages in thread[parent not found: <20090528062730.15937.70579.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>]
* [PATCH 1/3] nfsd: don't take nfsd_mutex twice when setting number of threads. [not found] ` <20090528062730.15937.70579.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> @ 2009-05-28 6:33 ` NeilBrown [not found] ` <20090528063303.15937.55202.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 2009-05-28 6:33 ` [PATCH 2/3] nfsd: optimise the starting of zero threads when none are running NeilBrown 2009-05-28 6:33 ` [PATCH 3/3] sunrpc: reduce timeout when unregistering rpcbind registrations NeilBrown 2 siblings, 1 reply; 25+ messages in thread From: NeilBrown @ 2009-05-28 6:33 UTC (permalink / raw) To: linux-nfs; +Cc: NeilBrown Currently when we right a number to 'threads' in nfsdfs, we take the nfsd_mutex, update the number of threads, then take the mutex again to read the number of threads. Mostly this isn't a big deal. However if we are write '0', and portmap happens to be dead, then we can get unpredictable behaviour. If the nfsd threads all got killed quickly and the last thread is waiting for portmap to respond, then the second time we take the mutex we will block waiting for the last thread. However if the nfsd threads didn't die quite that fast, then there will be no contention when we try to take the mutex again. Unpredictability isn't fun, and waiting for the last thread to exit is pointless, so avoid taking the lock twice. To achieve this, get nfsd_svc return a non-negative number of active threads when not returning a negative error. Signed-off-by: NeilBrown <neilb@suse.de> --- fs/nfsd/nfsctl.c | 13 +++++++++---- fs/nfsd/nfssvc.c | 2 ++ 2 files changed, 11 insertions(+), 4 deletions(-) diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c index af16849..1ddd4f9 100644 --- a/fs/nfsd/nfsctl.c +++ b/fs/nfsd/nfsctl.c @@ -207,10 +207,14 @@ static struct file_operations pool_stats_operations = { static ssize_t write_svc(struct file *file, char *buf, size_t size) { struct nfsctl_svc *data; + int err; if (size < sizeof(*data)) return -EINVAL; data = (struct nfsctl_svc*) buf; - return nfsd_svc(data->svc_port, data->svc_nthreads); + err = nfsd_svc(data->svc_port, data->svc_nthreads); + if (err < 0) + return err; + return 0; } /** @@ -692,10 +696,11 @@ static ssize_t write_threads(struct file *file, char *buf, size_t size) if (newthreads < 0) return -EINVAL; rv = nfsd_svc(NFS_PORT, newthreads); - if (rv) + if (rv < 0) return rv; - } - sprintf(buf, "%d\n", nfsd_nrthreads()); + } else + rv = nfsd_nrthreads(); + sprintf(buf, "%d\n", rv); return strlen(buf); } diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c index cbba4a9..df6c59e 100644 --- a/fs/nfsd/nfssvc.c +++ b/fs/nfsd/nfssvc.c @@ -413,6 +413,8 @@ nfsd_svc(unsigned short port, int nrservs) goto failure; error = svc_set_num_threads(nfsd_serv, NULL, nrservs); + if (error == 0) + error = nfsd_serv->sv_nrthreads - 1; failure: svc_destroy(nfsd_serv); /* Release server */ out: ^ permalink raw reply related [flat|nested] 25+ messages in thread
[parent not found: <20090528063303.15937.55202.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>]
* Re: [PATCH 1/3] nfsd: don't take nfsd_mutex twice when setting number of threads. [not found] ` <20090528063303.15937.55202.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> @ 2009-06-11 17:52 ` Jeff Layton [not found] ` <20090611135255.0fa2f728-xSBYVWDuneFaJnirhKH9O4GKTjYczspe@public.gmane.org> 0 siblings, 1 reply; 25+ messages in thread From: Jeff Layton @ 2009-06-11 17:52 UTC (permalink / raw) To: NeilBrown; +Cc: linux-nfs On Thu, 28 May 2009 16:33:03 +1000 NeilBrown <neilb@suse.de> wrote: > Currently when we right a number to 'threads' in nfsdfs, > we take the nfsd_mutex, update the number of threads, then take the > mutex again to read the number of threads. > > Mostly this isn't a big deal. However if we are write '0', and > portmap happens to be dead, then we can get unpredictable behaviour. > If the nfsd threads all got killed quickly and the last thread is > waiting for portmap to respond, then the second time we take the mutex > we will block waiting for the last thread. > However if the nfsd threads didn't die quite that fast, then there > will be no contention when we try to take the mutex again. > > Unpredictability isn't fun, and waiting for the last thread to exit is > pointless, so avoid taking the lock twice. > To achieve this, get nfsd_svc return a non-negative number of active > threads when not returning a negative error. > > Signed-off-by: NeilBrown <neilb@suse.de> > --- > > fs/nfsd/nfsctl.c | 13 +++++++++---- > fs/nfsd/nfssvc.c | 2 ++ > 2 files changed, 11 insertions(+), 4 deletions(-) > > diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c > index af16849..1ddd4f9 100644 > --- a/fs/nfsd/nfsctl.c > +++ b/fs/nfsd/nfsctl.c > @@ -207,10 +207,14 @@ static struct file_operations pool_stats_operations = { > static ssize_t write_svc(struct file *file, char *buf, size_t size) > { > struct nfsctl_svc *data; > + int err; > if (size < sizeof(*data)) > return -EINVAL; > data = (struct nfsctl_svc*) buf; > - return nfsd_svc(data->svc_port, data->svc_nthreads); > + err = nfsd_svc(data->svc_port, data->svc_nthreads); > + if (err < 0) > + return err; > + return 0; > } > > /** > @@ -692,10 +696,11 @@ static ssize_t write_threads(struct file *file, char *buf, size_t size) > if (newthreads < 0) > return -EINVAL; > rv = nfsd_svc(NFS_PORT, newthreads); > - if (rv) > + if (rv < 0) > return rv; > - } > - sprintf(buf, "%d\n", nfsd_nrthreads()); > + } else > + rv = nfsd_nrthreads(); > + sprintf(buf, "%d\n", rv); > return strlen(buf); > } > > diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c > index cbba4a9..df6c59e 100644 > --- a/fs/nfsd/nfssvc.c > +++ b/fs/nfsd/nfssvc.c > @@ -413,6 +413,8 @@ nfsd_svc(unsigned short port, int nrservs) > goto failure; > > error = svc_set_num_threads(nfsd_serv, NULL, nrservs); > + if (error == 0) > + error = nfsd_serv->sv_nrthreads - 1; ^^^^ Why -1 here? I thought you wanted nfsd_svc to return the number of threads? Also, looks like Bruce's tree has changed a bit recently and I don't think this patch will apply cleanly now. You may need to respin... > failure: > svc_destroy(nfsd_serv); /* Release server */ > out: > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 25+ messages in thread
[parent not found: <20090611135255.0fa2f728-xSBYVWDuneFaJnirhKH9O4GKTjYczspe@public.gmane.org>]
* Re: [PATCH 1/3] nfsd: don't take nfsd_mutex twice when setting number of threads. [not found] ` <20090611135255.0fa2f728-xSBYVWDuneFaJnirhKH9O4GKTjYczspe@public.gmane.org> @ 2009-06-11 18:01 ` Jeff Layton 0 siblings, 0 replies; 25+ messages in thread From: Jeff Layton @ 2009-06-11 18:01 UTC (permalink / raw) To: NeilBrown; +Cc: linux-nfs On Thu, 11 Jun 2009 13:52:55 -0400 Jeff Layton <jlayton@redhat.com> wrote: > On Thu, 28 May 2009 16:33:03 +1000 > NeilBrown <neilb@suse.de> wrote: > > > Currently when we right a number to 'threads' in nfsdfs, > > we take the nfsd_mutex, update the number of threads, then take the > > mutex again to read the number of threads. > > > > Mostly this isn't a big deal. However if we are write '0', and > > portmap happens to be dead, then we can get unpredictable behaviour. > > If the nfsd threads all got killed quickly and the last thread is > > waiting for portmap to respond, then the second time we take the mutex > > we will block waiting for the last thread. > > However if the nfsd threads didn't die quite that fast, then there > > will be no contention when we try to take the mutex again. > > > > Unpredictability isn't fun, and waiting for the last thread to exit is > > pointless, so avoid taking the lock twice. > > To achieve this, get nfsd_svc return a non-negative number of active > > threads when not returning a negative error. > > > > Signed-off-by: NeilBrown <neilb@suse.de> > > --- > > > > fs/nfsd/nfsctl.c | 13 +++++++++---- > > fs/nfsd/nfssvc.c | 2 ++ > > 2 files changed, 11 insertions(+), 4 deletions(-) > > > > diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c > > index af16849..1ddd4f9 100644 > > --- a/fs/nfsd/nfsctl.c > > +++ b/fs/nfsd/nfsctl.c > > @@ -207,10 +207,14 @@ static struct file_operations pool_stats_operations = { > > static ssize_t write_svc(struct file *file, char *buf, size_t size) > > { > > struct nfsctl_svc *data; > > + int err; > > if (size < sizeof(*data)) > > return -EINVAL; > > data = (struct nfsctl_svc*) buf; > > - return nfsd_svc(data->svc_port, data->svc_nthreads); > > + err = nfsd_svc(data->svc_port, data->svc_nthreads); > > + if (err < 0) > > + return err; > > + return 0; > > } > > > > /** > > @@ -692,10 +696,11 @@ static ssize_t write_threads(struct file *file, char *buf, size_t size) > > if (newthreads < 0) > > return -EINVAL; > > rv = nfsd_svc(NFS_PORT, newthreads); > > - if (rv) > > + if (rv < 0) > > return rv; > > - } > > - sprintf(buf, "%d\n", nfsd_nrthreads()); > > + } else > > + rv = nfsd_nrthreads(); > > + sprintf(buf, "%d\n", rv); > > return strlen(buf); > > } > > > > diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c > > index cbba4a9..df6c59e 100644 > > --- a/fs/nfsd/nfssvc.c > > +++ b/fs/nfsd/nfssvc.c > > @@ -413,6 +413,8 @@ nfsd_svc(unsigned short port, int nrservs) > > goto failure; > > > > error = svc_set_num_threads(nfsd_serv, NULL, nrservs); > > + if (error == 0) > > + error = nfsd_serv->sv_nrthreads - 1; > ^^^^ > Why -1 here? I thought you wanted nfsd_svc to return the number of > threads? > Ahh nm...I see why. sv_nrthreads is artificially high due to the earlier svc_get. Hate the refcounting with this stuff -- blech... > Also, looks like Bruce's tree has changed a bit recently and I don't > think this patch will apply cleanly now. You may need to respin... > > > failure: > > svc_destroy(nfsd_serv); /* Release server */ > > out: > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > -- > Jeff Layton <jlayton@redhat.com> > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 2/3] nfsd: optimise the starting of zero threads when none are running. [not found] ` <20090528062730.15937.70579.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 2009-05-28 6:33 ` [PATCH 1/3] nfsd: don't take nfsd_mutex twice when setting number of threads NeilBrown @ 2009-05-28 6:33 ` NeilBrown [not found] ` <20090528063303.15937.57966.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 2009-05-28 6:33 ` [PATCH 3/3] sunrpc: reduce timeout when unregistering rpcbind registrations NeilBrown 2 siblings, 1 reply; 25+ messages in thread From: NeilBrown @ 2009-05-28 6:33 UTC (permalink / raw) To: linux-nfs; +Cc: NeilBrown Currently, if we ask to set then number of nfsd threads to zero when there are none running, we set up all the sockets and register the service, and then tear it all down again. This is pointless. So detect that case and exit promptly. (also remove an assignment to 'error' which was never used. Signed-off-by: NeilBrown <neilb@suse.de> --- fs/nfsd/nfssvc.c | 6 ++++-- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c index df6c59e..83b7d41 100644 --- a/fs/nfsd/nfssvc.c +++ b/fs/nfsd/nfssvc.c @@ -390,12 +390,14 @@ nfsd_svc(unsigned short port, int nrservs) mutex_lock(&nfsd_mutex); dprintk("nfsd: creating service\n"); - error = -EINVAL; if (nrservs <= 0) nrservs = 0; if (nrservs > NFSD_MAXSERVS) nrservs = NFSD_MAXSERVS; - + error = 0; + if (nrservs == 0 && nfsd_serv == NULL) + goto out; + /* Readahead param cache - will no-op if it already exists */ error = nfsd_racache_init(2*nrservs); if (error<0) ^ permalink raw reply related [flat|nested] 25+ messages in thread
[parent not found: <20090528063303.15937.57966.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>]
* Re: [PATCH 2/3] nfsd: optimise the starting of zero threads when none are running. [not found] ` <20090528063303.15937.57966.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> @ 2009-06-11 17:26 ` Jeff Layton 0 siblings, 0 replies; 25+ messages in thread From: Jeff Layton @ 2009-06-11 17:26 UTC (permalink / raw) To: NeilBrown; +Cc: linux-nfs, NeilBrown On Thu, 28 May 2009 16:33:03 +1000 NeilBrown <neilb@suse.de> wrote: > Currently, if we ask to set then number of nfsd threads to zero when > there are none running, we set up all the sockets and register the > service, and then tear it all down again. > This is pointless. > > So detect that case and exit promptly. > (also remove an assignment to 'error' which was never used. > > Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Jeff Layton <jlayton@redhat.com> FWIW, the patchset I did to add IPv6 support makes some similar optimizations to the userspace rpc.nfsd program too. ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 3/3] sunrpc: reduce timeout when unregistering rpcbind registrations. [not found] ` <20090528062730.15937.70579.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 2009-05-28 6:33 ` [PATCH 1/3] nfsd: don't take nfsd_mutex twice when setting number of threads NeilBrown 2009-05-28 6:33 ` [PATCH 2/3] nfsd: optimise the starting of zero threads when none are running NeilBrown @ 2009-05-28 6:33 ` NeilBrown [not found] ` <20090528063303.15937.62423.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 2 siblings, 1 reply; 25+ messages in thread From: NeilBrown @ 2009-05-28 6:33 UTC (permalink / raw) To: linux-nfs; +Cc: NeilBrown Unregistering an RPC service is not essential - but it is tidy. So it is unpleasant to wait a long time for it to complete. As unregistering is always directed to localhost, the most likely reason for any delay is the portmap (or rpcbind) is not running. In that case, waiting it totally pointless. In any case, we should expect a quick response, and zero packet loss. So reduce the timeouts to a total of half a second. This means that if we try to stop nfsd while portmap is not running, it takes 3 seconds instead of 3.5 minutes. Note that stopping portmap before the last nfsd thread has died is not simply a configuration error. When we kill nfsd threads they could take a while to die, and it is possible that portmap could then be killed before the last nfsd thread has had a change to finish up. [An alternate might be to make the sunrpc code always "connect" udp sockets so that "port not reachable" errors would get reported back. This requires a more intrusive change though and might have other consequences] Signed-off-by: NeilBrown <neilb@suse.de> --- net/sunrpc/rpcb_clnt.c | 17 +++++++++++++++-- 1 files changed, 15 insertions(+), 2 deletions(-) diff --git a/net/sunrpc/rpcb_clnt.c b/net/sunrpc/rpcb_clnt.c index beee6da..31f7e2e 100644 --- a/net/sunrpc/rpcb_clnt.c +++ b/net/sunrpc/rpcb_clnt.c @@ -132,7 +132,8 @@ static const struct sockaddr_in rpcb_inaddr_loopback = { }; static struct rpc_clnt *rpcb_create_local(struct sockaddr *addr, - size_t addrlen, u32 version) + size_t addrlen, u32 version, + int unregister) { struct rpc_create_args args = { .protocol = XPRT_TRANSPORT_UDP, @@ -144,6 +145,16 @@ static struct rpc_clnt *rpcb_create_local(struct sockaddr *addr, .authflavor = RPC_AUTH_UNIX, .flags = RPC_CLNT_CREATE_NOPING, }; + const struct rpc_timeout unreg_timeout = { + .to_initval = HZ/2, + .to_maxval = HZ/2, + .to_increment = 0, + .to_retries = 0, + .to_exponential = 0, + }; + + if (unregister) + args.timeout = &unreg_timeout; return rpc_create(&args); } @@ -183,10 +194,12 @@ static int rpcb_register_call(const u32 version, struct rpc_message *msg) size_t addrlen = sizeof(rpcb_inaddr_loopback); struct rpc_clnt *rpcb_clnt; int result, error = 0; + int unregister; msg->rpc_resp = &result; - rpcb_clnt = rpcb_create_local(addr, addrlen, version); + unregister = (msg->rpc_proc->p_proc == RPCBPROC_UNSET); + rpcb_clnt = rpcb_create_local(addr, addrlen, version, unregister); if (!IS_ERR(rpcb_clnt)) { error = rpc_call_sync(rpcb_clnt, msg, 0); rpc_shutdown_client(rpcb_clnt); ^ permalink raw reply related [flat|nested] 25+ messages in thread
[parent not found: <20090528063303.15937.62423.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>]
* Re: [PATCH 3/3] sunrpc: reduce timeout when unregistering rpcbind registrations. [not found] ` <20090528063303.15937.62423.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> @ 2009-05-28 13:07 ` Tom Talpey 2009-06-11 4:49 ` Neil Brown 2009-05-28 13:43 ` Chuck Lever 1 sibling, 1 reply; 25+ messages in thread From: Tom Talpey @ 2009-05-28 13:07 UTC (permalink / raw) To: NeilBrown; +Cc: linux-nfs At 02:33 AM 5/28/2009, NeilBrown wrote: >Unregistering an RPC service is not essential - but it is tidy. >So it is unpleasant to wait a long time for it to complete. > >As unregistering is always directed to localhost, the most likely >reason for any delay is the portmap (or rpcbind) is not running. >In that case, waiting it totally pointless. In any case, we should >expect a quick response, and zero packet loss. > >So reduce the timeouts to a total of half a second. Why wait for the response at all in this case? With zero retries and nothing to do with the result, it seems pointless to even wait for half a second. Tom. > >This means that if we try to stop nfsd while portmap is not running, >it takes 3 seconds instead of 3.5 minutes. > >Note that stopping portmap before the last nfsd thread has died is >not simply a configuration error. When we kill nfsd threads they >could take a while to die, and it is possible that portmap could >then be killed before the last nfsd thread has had a change to finish >up. > >[An alternate might be to make the sunrpc code always "connect" >udp sockets so that "port not reachable" errors would get reported >back. This requires a more intrusive change though and might have >other consequences] > >Signed-off-by: NeilBrown <neilb@suse.de> >--- > > net/sunrpc/rpcb_clnt.c | 17 +++++++++++++++-- > 1 files changed, 15 insertions(+), 2 deletions(-) > >diff --git a/net/sunrpc/rpcb_clnt.c b/net/sunrpc/rpcb_clnt.c >index beee6da..31f7e2e 100644 >--- a/net/sunrpc/rpcb_clnt.c >+++ b/net/sunrpc/rpcb_clnt.c >@@ -132,7 +132,8 @@ static const struct sockaddr_in rpcb_inaddr_loopback = { > }; > > static struct rpc_clnt *rpcb_create_local(struct sockaddr *addr, >- size_t addrlen, u32 version) >+ size_t addrlen, u32 version, >+ int unregister) > { > struct rpc_create_args args = { > .protocol = XPRT_TRANSPORT_UDP, >@@ -144,6 +145,16 @@ static struct rpc_clnt *rpcb_create_local(struct >sockaddr *addr, > .authflavor = RPC_AUTH_UNIX, > .flags = RPC_CLNT_CREATE_NOPING, > }; >+ const struct rpc_timeout unreg_timeout = { >+ .to_initval = HZ/2, >+ .to_maxval = HZ/2, >+ .to_increment = 0, >+ .to_retries = 0, >+ .to_exponential = 0, >+ }; >+ >+ if (unregister) >+ args.timeout = &unreg_timeout; > > return rpc_create(&args); > } >@@ -183,10 +194,12 @@ static int rpcb_register_call(const u32 version, >struct rpc_message *msg) > size_t addrlen = sizeof(rpcb_inaddr_loopback); > struct rpc_clnt *rpcb_clnt; > int result, error = 0; >+ int unregister; > > msg->rpc_resp = &result; > >- rpcb_clnt = rpcb_create_local(addr, addrlen, version); >+ unregister = (msg->rpc_proc->p_proc == RPCBPROC_UNSET); >+ rpcb_clnt = rpcb_create_local(addr, addrlen, version, unregister); > if (!IS_ERR(rpcb_clnt)) { > error = rpc_call_sync(rpcb_clnt, msg, 0); > rpc_shutdown_client(rpcb_clnt); > > >-- >To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] sunrpc: reduce timeout when unregistering rpcbind registrations. 2009-05-28 13:07 ` Tom Talpey @ 2009-06-11 4:49 ` Neil Brown [not found] ` <18992.36038.267957.467326-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 0 siblings, 1 reply; 25+ messages in thread From: Neil Brown @ 2009-06-11 4:49 UTC (permalink / raw) To: Tom Talpey; +Cc: linux-nfs On Thursday May 28, tmtalpey@gmail.com wrote: > At 02:33 AM 5/28/2009, NeilBrown wrote: > >Unregistering an RPC service is not essential - but it is tidy. > >So it is unpleasant to wait a long time for it to complete. > > > >As unregistering is always directed to localhost, the most likely > >reason for any delay is the portmap (or rpcbind) is not running. > >In that case, waiting it totally pointless. In any case, we should > >expect a quick response, and zero packet loss. > > > >So reduce the timeouts to a total of half a second. > > Why wait for the response at all in this case? With zero retries > and nothing to do with the result, it seems pointless to even wait > for half a second. That's a fair point, thanks. I'll keep that in mind if I revisit these patches. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 25+ messages in thread
[parent not found: <18992.36038.267957.467326-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>]
* Re: [PATCH 3/3] sunrpc: reduce timeout when unregistering rpcbind registrations. [not found] ` <18992.36038.267957.467326-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> @ 2009-06-11 15:02 ` Chuck Lever 0 siblings, 0 replies; 25+ messages in thread From: Chuck Lever @ 2009-06-11 15:02 UTC (permalink / raw) To: Neil Brown, Tom Talpey; +Cc: Linux NFS Mailing list On Jun 11, 2009, at 12:49 AM, Neil Brown wrote: > On Thursday May 28, tmtalpey@gmail.com wrote: >> At 02:33 AM 5/28/2009, NeilBrown wrote: >>> Unregistering an RPC service is not essential - but it is tidy. >>> So it is unpleasant to wait a long time for it to complete. >>> >>> As unregistering is always directed to localhost, the most likely >>> reason for any delay is the portmap (or rpcbind) is not running. >>> In that case, waiting it totally pointless. In any case, we should >>> expect a quick response, and zero packet loss. >>> >>> So reduce the timeouts to a total of half a second. >> >> Why wait for the response at all in this case? With zero retries >> and nothing to do with the result, it seems pointless to even wait >> for half a second. > > That's a fair point, thanks. I'll keep that in mind if I revisit > these patches. While not waiting is probably OK in the "shutdown" case, it could be a problem during normal operation. The kernel unregisters RPC services before it attempts to register them (to clear any old registrations). Not waiting for the unregister reply might be racy. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] sunrpc: reduce timeout when unregistering rpcbind registrations. [not found] ` <20090528063303.15937.62423.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 2009-05-28 13:07 ` Tom Talpey @ 2009-05-28 13:43 ` Chuck Lever 2009-06-11 4:48 ` Neil Brown 1 sibling, 1 reply; 25+ messages in thread From: Chuck Lever @ 2009-05-28 13:43 UTC (permalink / raw) To: NeilBrown; +Cc: linux-nfs On May 28, 2009, at 2:33 AM, NeilBrown wrote: > Unregistering an RPC service is not essential - but it is tidy. > So it is unpleasant to wait a long time for it to complete. > > As unregistering is always directed to localhost, the most likely > reason for any delay is the portmap (or rpcbind) is not running. > In that case, waiting it totally pointless. In any case, we should > expect a quick response, and zero packet loss. > > So reduce the timeouts to a total of half a second. A much better solution here would be to use a connected UDP socket. This would make start-up in the absense of rpcbind/portmap fail much quicker, we could keep the longer, safer timeout value, and the kernel could warn about exactly what the problem is. It would also have similar benefits for the kernel's mount and NSM clients, and when starting up services. It would additionally make UDP RPC ping checking fail very quickly. > This means that if we try to stop nfsd while portmap is not running, > it takes 3 seconds instead of 3.5 minutes. > > Note that stopping portmap before the last nfsd thread has died is > not simply a configuration error. When we kill nfsd threads they > could take a while to die, and it is possible that portmap could > then be killed before the last nfsd thread has had a change to finish > up. > > [An alternate might be to make the sunrpc code always "connect" > udp sockets so that "port not reachable" errors would get reported > back. This requires a more intrusive change though and might have > other consequences] We had discussed this about a year ago when I started adding IPv6 support. I had suggested switching the local rpc client to use TCP instead of UDP to solve exactly this time-out problem during start- up. There was some resistance to the idea because TCP would leave privileged ports in TIMEWAIT (at shutdown, this is probably not a significant concern). Trond had intended to introduce connected UDP socket support to the RPC client, although we were also interested in someday having a single UDP socket for all RPC traffic... the design never moved on from there. My feeling at this point is that having a connected UDP socket transport would be simpler and have broader benefits than waiting for an eventual design that can accommodate multiple transport instances sharing a single socket. > Signed-off-by: NeilBrown <neilb@suse.de> > --- > > net/sunrpc/rpcb_clnt.c | 17 +++++++++++++++-- > 1 files changed, 15 insertions(+), 2 deletions(-) > > diff --git a/net/sunrpc/rpcb_clnt.c b/net/sunrpc/rpcb_clnt.c > index beee6da..31f7e2e 100644 > --- a/net/sunrpc/rpcb_clnt.c > +++ b/net/sunrpc/rpcb_clnt.c > @@ -132,7 +132,8 @@ static const struct sockaddr_in > rpcb_inaddr_loopback = { > }; > > static struct rpc_clnt *rpcb_create_local(struct sockaddr *addr, > - size_t addrlen, u32 version) > + size_t addrlen, u32 version, > + int unregister) > { It might be cleaner to define the timeout structures as global statics, and pass in a pointer to the preferred structure instead of checking a passed-in an integer. One conditional branch instead of two. > struct rpc_create_args args = { > .protocol = XPRT_TRANSPORT_UDP, > @@ -144,6 +145,16 @@ static struct rpc_clnt > *rpcb_create_local(struct sockaddr *addr, > .authflavor = RPC_AUTH_UNIX, > .flags = RPC_CLNT_CREATE_NOPING, > }; > + const struct rpc_timeout unreg_timeout = { > + .to_initval = HZ/2, > + .to_maxval = HZ/2, > + .to_increment = 0, > + .to_retries = 0, > + .to_exponential = 0, > + }; > + > + if (unregister) > + args.timeout = &unreg_timeout; > > return rpc_create(&args); > } > @@ -183,10 +194,12 @@ static int rpcb_register_call(const u32 > version, struct rpc_message *msg) > size_t addrlen = sizeof(rpcb_inaddr_loopback); > struct rpc_clnt *rpcb_clnt; > int result, error = 0; > + int unregister; > > msg->rpc_resp = &result; > > - rpcb_clnt = rpcb_create_local(addr, addrlen, version); > + unregister = (msg->rpc_proc->p_proc == RPCBPROC_UNSET); > + rpcb_clnt = rpcb_create_local(addr, addrlen, version, unregister); > if (!IS_ERR(rpcb_clnt)) { > error = rpc_call_sync(rpcb_clnt, msg, 0); > rpc_shutdown_client(rpcb_clnt); -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] sunrpc: reduce timeout when unregistering rpcbind registrations. 2009-05-28 13:43 ` Chuck Lever @ 2009-06-11 4:48 ` Neil Brown [not found] ` <18992.35996.986951.556723-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 0 siblings, 1 reply; 25+ messages in thread From: Neil Brown @ 2009-06-11 4:48 UTC (permalink / raw) To: Chuck Lever; +Cc: linux-nfs On Thursday May 28, chuck.lever@oracle.com wrote: > On May 28, 2009, at 2:33 AM, NeilBrown wrote: > > > > [An alternate might be to make the sunrpc code always "connect" > > udp sockets so that "port not reachable" errors would get reported > > back. This requires a more intrusive change though and might have > > other consequences] > > We had discussed this about a year ago when I started adding IPv6 > support. I had suggested switching the local rpc client to use TCP > instead of UDP to solve exactly this time-out problem during start- > up. There was some resistance to the idea because TCP would leave > privileged ports in TIMEWAIT (at shutdown, this is probably not a > significant concern). > > Trond had intended to introduce connected UDP socket support to the > RPC client, although we were also interested in someday having a > single UDP socket for all RPC traffic... the design never moved on > from there. > > My feeling at this point is that having a connected UDP socket > transport would be simpler and have broader benefits than waiting for > an eventual design that can accommodate multiple transport instances > sharing a single socket. The use of connected UDP would have to be limited to known-safe cases such as contacting the local portmap. I believe there are still NFS servers out there that - if multihomed - can reply from a different address to the one the request was sent to. And we would need to check that rpcbind does the right thing. I recently discovered that rpcbind is buggy and will sometimes respond from the wrong interface - I suspect localhost addresses are safe, but we would need to check, or fix it (I fixed that bug in portmap (glibc actually) 6 years ago and now it appears again in rpcbind - groan!). How hard would it be to add (optional) connected UDP support? Would we just make the code more like the TCP version, or are there any gotchas that you know of that we would need to be careful of? Thanks, NeilBrown ^ permalink raw reply [flat|nested] 25+ messages in thread
[parent not found: <18992.35996.986951.556723-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>]
* Re: [PATCH 3/3] sunrpc: reduce timeout when unregistering rpcbind registrations. [not found] ` <18992.35996.986951.556723-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> @ 2009-06-11 15:44 ` Chuck Lever 2009-07-02 20:04 ` Chuck Lever 0 siblings, 1 reply; 25+ messages in thread From: Chuck Lever @ 2009-06-11 15:44 UTC (permalink / raw) To: Neil Brown; +Cc: linux-nfs On Jun 11, 2009, at 12:48 AM, Neil Brown wrote: > On Thursday May 28, chuck.lever@oracle.com wrote: >> On May 28, 2009, at 2:33 AM, NeilBrown wrote: >>> >>> [An alternate might be to make the sunrpc code always "connect" >>> udp sockets so that "port not reachable" errors would get reported >>> back. This requires a more intrusive change though and might have >>> other consequences] >> >> We had discussed this about a year ago when I started adding IPv6 >> support. I had suggested switching the local rpc client to use TCP >> instead of UDP to solve exactly this time-out problem during start- >> up. There was some resistance to the idea because TCP would leave >> privileged ports in TIMEWAIT (at shutdown, this is probably not a >> significant concern). >> >> Trond had intended to introduce connected UDP socket support to the >> RPC client, although we were also interested in someday having a >> single UDP socket for all RPC traffic... the design never moved on >> from there. >> >> My feeling at this point is that having a connected UDP socket >> transport would be simpler and have broader benefits than waiting for >> an eventual design that can accommodate multiple transport instances >> sharing a single socket. > > The use of connected UDP would have to be limited to known-safe cases > such as contacting the local portmap. I believe there are still NFS > servers out there that - if multihomed - can reply from a different > address to the one the request was sent to. I think I advocated for adding an entirely new transport capability called CUDP at the time. But this is definitely something to remember as we test. If a new transport capability is added, at this point we would likely need some additional logic in the NFS mount parsing logic to expose such a transport to user space. So, leaving that parsing logic alone should insulate the NFS client from the new transport until we have more confidence. > And we would need to check that rpcbind does the right thing. I > recently discovered that rpcbind is buggy and will sometimes respond > from the wrong interface - I suspect localhost addresses are safe, but > we would need to check, or fix it (I fixed that bug in portmap (glibc > actually) 6 years ago and now it appears again in rpcbind - groan!). Details welcome. We will probably need to fix libtirpc. > How hard would it be to add (optional) connected UDP support? Would > we just make the code more like the TCP version, or are there any > gotchas that you know of that we would need to be careful of? The code in net/sunrpc/xprtsock.c is a bunch of transport methods, many of which are shared between the UDP and TCP transport capabilities. You could probably do this easily by creating a new xprt_class structure and a new ops vector, then reuse as many UDP methods as possible. The TCP connect method could be usable as is, but it would be simple to copy-n-paste a new one if some variation is required. Then, define a new XPRT_ value, and use that in rpcb_create_local(). -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] sunrpc: reduce timeout when unregistering rpcbind registrations. 2009-06-11 15:44 ` Chuck Lever @ 2009-07-02 20:04 ` Chuck Lever 2009-07-06 12:42 ` Suresh Jayaraman 0 siblings, 1 reply; 25+ messages in thread From: Chuck Lever @ 2009-07-02 20:04 UTC (permalink / raw) To: Neil Brown; +Cc: Linux NFS mailing list On Jun 11, 2009, at 11:44 AM, Chuck Lever wrote: > On Jun 11, 2009, at 12:48 AM, Neil Brown wrote: >> On Thursday May 28, chuck.lever@oracle.com wrote: >>> On May 28, 2009, at 2:33 AM, NeilBrown wrote: >>>> >>>> [An alternate might be to make the sunrpc code always "connect" >>>> udp sockets so that "port not reachable" errors would get reported >>>> back. This requires a more intrusive change though and might have >>>> other consequences] >>> >>> We had discussed this about a year ago when I started adding IPv6 >>> support. I had suggested switching the local rpc client to use TCP >>> instead of UDP to solve exactly this time-out problem during start- >>> up. There was some resistance to the idea because TCP would leave >>> privileged ports in TIMEWAIT (at shutdown, this is probably not a >>> significant concern). >>> >>> Trond had intended to introduce connected UDP socket support to the >>> RPC client, although we were also interested in someday having a >>> single UDP socket for all RPC traffic... the design never moved on >>> from there. >>> >>> My feeling at this point is that having a connected UDP socket >>> transport would be simpler and have broader benefits than waiting >>> for >>> an eventual design that can accommodate multiple transport instances >>> sharing a single socket. >> >> The use of connected UDP would have to be limited to known-safe cases >> such as contacting the local portmap. I believe there are still NFS >> servers out there that - if multihomed - can reply from a different >> address to the one the request was sent to. > > I think I advocated for adding an entirely new transport capability > called CUDP at the time. But this is definitely something to > remember as we test. > > If a new transport capability is added, at this point we would > likely need some additional logic in the NFS mount parsing logic to > expose such a transport to user space. So, leaving that parsing > logic alone should insulate the NFS client from the new transport > until we have more confidence. > >> And we would need to check that rpcbind does the right thing. I >> recently discovered that rpcbind is buggy and will sometimes respond >> from the wrong interface - I suspect localhost addresses are safe, >> but >> we would need to check, or fix it (I fixed that bug in portmap (glibc >> actually) 6 years ago and now it appears again in rpcbind - groan!). > > Details welcome. We will probably need to fix libtirpc. > >> How hard would it be to add (optional) connected UDP support? Would >> we just make the code more like the TCP version, or are there any >> gotchas that you know of that we would need to be careful of? > > The code in net/sunrpc/xprtsock.c is a bunch of transport methods, > many of which are shared between the UDP and TCP transport > capabilities. You could probably do this easily by creating a new > xprt_class structure and a new ops vector, then reuse as many UDP > methods as possible. The TCP connect method could be usable as is, > but it would be simple to copy-n-paste a new one if some variation > is required. Then, define a new XPRT_ value, and use that in > rpcb_create_local(). I've thought about this some more... It seems to me that you might be better off using the existing UDP transport code, but adding a new RPC_CLNT_CREATE_ flag to enable connected UDP semantics. The two transports are otherwise exactly the same. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] sunrpc: reduce timeout when unregistering rpcbind registrations. 2009-07-02 20:04 ` Chuck Lever @ 2009-07-06 12:42 ` Suresh Jayaraman 2009-07-06 14:30 ` Chuck Lever 0 siblings, 1 reply; 25+ messages in thread From: Suresh Jayaraman @ 2009-07-06 12:42 UTC (permalink / raw) To: Chuck Lever; +Cc: Neil Brown, Linux NFS mailing list Chuck Lever wrote: > On Jun 11, 2009, at 11:44 AM, Chuck Lever wrote: >> On Jun 11, 2009, at 12:48 AM, Neil Brown wrote: >> >>> How hard would it be to add (optional) connected UDP support? Would >>> we just make the code more like the TCP version, or are there any >>> gotchas that you know of that we would need to be careful of? >> >> The code in net/sunrpc/xprtsock.c is a bunch of transport methods, >> many of which are shared between the UDP and TCP transport >> capabilities. You could probably do this easily by creating a new >> xprt_class structure and a new ops vector, then reuse as many UDP >> methods as possible. The TCP connect method could be usable as is, >> but it would be simple to copy-n-paste a new one if some variation is >> required. Then, define a new XPRT_ value, and use that in >> rpcb_create_local(). I attempted a patch based on your suggestions, while the socket seems to be getting the -ECONNREFUSED error, but it isn't propagating all the way up (yet to debug, why) include/linux/sunrpc/xprtsock.h | 6 ++ net/sunrpc/rpcb_clnt.c | 2 +- net/sunrpc/xprt.c | 5 + net/sunrpc/xprtsock.c | 186 +++++++++++++++++++++++++++++++++++++++ 4 files changed, 198 insertions(+), 1 deletions(-) diff --git a/include/linux/sunrpc/xprtsock.h b/include/linux/sunrpc/xprtsock.h index c2a46c4..3f05a45 100644 --- a/include/linux/sunrpc/xprtsock.h +++ b/include/linux/sunrpc/xprtsock.h @@ -22,6 +22,12 @@ void cleanup_socket_xprt(void); */ #define XPRT_TRANSPORT_UDP IPPROTO_UDP #define XPRT_TRANSPORT_TCP IPPROTO_TCP +/* + * Connected UDP doesn't seem to be a well-defined protocol, picking a number + * other than 17 and 6 should be OK. + * FIXME: If the above assumption is wrong. + */ +#define XPRT_TRANSPORT_CUDP 18 /* * RPC slot table sizes for UDP, TCP transports diff --git a/net/sunrpc/rpcb_clnt.c b/net/sunrpc/rpcb_clnt.c index beee6da..73f8048 100644 --- a/net/sunrpc/rpcb_clnt.c +++ b/net/sunrpc/rpcb_clnt.c @@ -135,7 +135,7 @@ static struct rpc_clnt *rpcb_create_local(struct sockaddr *addr, size_t addrlen, u32 version) { struct rpc_create_args args = { - .protocol = XPRT_TRANSPORT_UDP, + .protocol = XPRT_TRANSPORT_CUDP, .address = addr, .addrsize = addrlen, .servername = "localhost", diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c index f412a85..7e605f1 100644 --- a/net/sunrpc/xprt.c +++ b/net/sunrpc/xprt.c @@ -898,6 +898,11 @@ void xprt_transmit(struct rpc_task *task) req->rq_connect_cookie = xprt->connect_cookie; req->rq_xtime = jiffies; status = xprt->ops->send_request(task); + if (status == -ECONNREFUSED) { + dprintk("RPC: send_request returned (%d) in xprt_transmit\n", + status); + rpc_exit(task, status); + } if (status != 0) { task->tk_status = status; return; diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c index 83c73c4..6cc24c3 100644 --- a/net/sunrpc/xprtsock.c +++ b/net/sunrpc/xprtsock.c @@ -193,6 +193,8 @@ static ctl_table sunrpc_table[] = { */ #define XS_IDLE_DISC_TO (5U * 60 * HZ) +#define IPPROTO_CUDP XPRT_TRANSPORT_CUDP + #ifdef RPC_DEBUG # undef RPC_DEBUG_DATA # define RPCDBG_FACILITY RPCDBG_TRANS @@ -1832,6 +1834,100 @@ out: xprt_wake_pending_tasks(xprt, status); } +/** + * xs_cudp_connect_worker4 - set up a connected UDP socket + * @work: RPC transport to connect + * + * Invoked by a work queue tasklet. + */ +static void xs_cudp_connect_worker4(struct work_struct *work) +{ + struct sock_xprt *transport = + container_of(work, struct sock_xprt, connect_worker.work); + struct rpc_xprt *xprt = &transport->xprt; + struct socket *sock = transport->sock; + int err, status = -EIO; + + if (xprt->shutdown) + goto out; + + /* Start by resetting any existing state */ + xs_reset_transport(transport); + + err = sock_create_kern(PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock); + if (err < 0) { + dprintk("RPC: can't create UDP transport socket (%d).\n", -err); + goto out; + } + xs_reclassify_socket4(sock); + + if (xs_bind4(transport, sock)) { + sock_release(sock); + goto out; + } + + dprintk("RPC: worker connecting xprt %p to address: %s\n", + xprt, xprt->address_strings[RPC_DISPLAY_ALL]); + + xs_udp_finish_connecting(xprt, sock); + err = kernel_connect(sock, xs_addr(xprt), xprt->addrlen, O_NONBLOCK); + if (err < 0) { + dprintk("RPC: connect on UDP socket.. failed (%d).\n", -err); + goto out; + } + status = 0; +out: + xprt_clear_connecting(xprt); + xprt_wake_pending_tasks(xprt, status); +} + +/** + * xs_cudp_connect_worker6 - set up a Connected UDP socket + * @work: RPC transport to connect + * + * Invoked by a work queue tasklet. + */ +static void xs_cudp_connect_worker6(struct work_struct *work) +{ + struct sock_xprt *transport = + container_of(work, struct sock_xprt, connect_worker.work); + struct rpc_xprt *xprt = &transport->xprt; + struct socket *sock = transport->sock; + int err, status = -EIO; + + if (xprt->shutdown) + goto out; + + /* Start by resetting any existing state */ + xs_reset_transport(transport); + + err = sock_create_kern(PF_INET6, SOCK_DGRAM, IPPROTO_UDP, &sock); + if (err < 0) { + dprintk("RPC: can't create UDP transport socket (%d).\n", -err); + goto out; + } + xs_reclassify_socket6(sock); + + if (xs_bind6(transport, sock) < 0) { + sock_release(sock); + goto out; + } + + dprintk("RPC: worker connecting xprt %p to address: %s\n", + xprt, xprt->address_strings[RPC_DISPLAY_ALL]); + + xs_udp_finish_connecting(xprt, sock); + err = kernel_connect(sock, xs_addr(xprt), xprt->addrlen, O_NONBLOCK); + if (err < 0) { + dprintk("RPC: connect on UDP socket... failed (%d).\n", -err); + goto out; + } + status = 0; +out: + xprt_clear_connecting(xprt); + xprt_wake_pending_tasks(xprt, status); +} + /* * We need to preserve the port number so the reply cache on the server can * find our cached RPC replies when we get around to reconnecting. @@ -2192,6 +2288,24 @@ static struct rpc_xprt_ops xs_tcp_ops = { .print_stats = xs_tcp_print_stats, }; +static struct rpc_xprt_ops xs_cudp_ops = { + .set_buffer_size = xs_udp_set_buffer_size, + .reserve_xprt = xprt_reserve_xprt_cong, + .release_xprt = xprt_release_xprt_cong, + .rpcbind = rpcb_getport_async, + .set_port = xs_set_port, + .connect = xs_tcp_connect, + .buf_alloc = rpc_malloc, + .buf_free = rpc_free, + .send_request = xs_udp_send_request, + .set_retrans_timeout = xprt_set_retrans_timeout_rtt, + .timer = xs_udp_timer, + .release_request = xprt_release_rqst_cong, + .close = xs_close, + .destroy = xs_destroy, + .print_stats = xs_udp_print_stats, +}; + static struct rpc_xprt *xs_setup_xprt(struct xprt_create *args, unsigned int slot_table_size) { @@ -2298,6 +2412,68 @@ static struct rpc_xprt *xs_setup_udp(struct xprt_create *args) return ERR_PTR(-EINVAL); } +/** + * xs_setup_cudp - Set up transport to use a connected UDP socket + * @args: rpc transport creation arguments + * + */ +static struct rpc_xprt *xs_setup_cudp(struct xprt_create *args) +{ + struct sockaddr *addr = args->dstaddr; + struct rpc_xprt *xprt; + struct sock_xprt *transport; + + xprt = xs_setup_xprt(args, xprt_udp_slot_table_entries); + if (IS_ERR(xprt)) + return xprt; + transport = container_of(xprt, struct sock_xprt, xprt); + + xprt->prot = IPPROTO_CUDP; + xprt->tsh_size = 0; + /* XXX: header size can vary due to auth type, IPv6, etc. */ + xprt->max_payload = (1U << 16) - (MAX_HEADER << 3); + + xprt->bind_timeout = XS_BIND_TO; + xprt->connect_timeout = XS_UDP_CONN_TO; + xprt->reestablish_timeout = XS_UDP_REEST_TO; + xprt->idle_timeout = XS_IDLE_DISC_TO; + + xprt->ops = &xs_cudp_ops; + xprt->timeout = &xs_udp_default_timeout; + + switch (addr->sa_family) { + case AF_INET: + if (((struct sockaddr_in *)addr)->sin_port != htons(0)) + xprt_set_bound(xprt); + + INIT_DELAYED_WORK(&transport->connect_worker, + xs_cudp_connect_worker4); + xs_format_ipv4_peer_addresses(xprt, "cudp", RPCBIND_NETID_UDP); + break; + case AF_INET6: + if (((struct sockaddr_in6 *)addr)->sin6_port != htons(0)) + xprt_set_bound(xprt); + + INIT_DELAYED_WORK(&transport->connect_worker, + xs_cudp_connect_worker6); + xs_format_ipv6_peer_addresses(xprt, "cudp", RPCBIND_NETID_UDP); + break; + default: + kfree(xprt); + return ERR_PTR(-EAFNOSUPPORT); + } + + dprintk("RPC: set up transport to address %s\n", + xprt->address_strings[RPC_DISPLAY_ALL]); + + if (try_module_get(THIS_MODULE)) + return xprt; + + kfree(xprt->slot); + kfree(xprt); + return ERR_PTR(-EINVAL); +} + static const struct rpc_timeout xs_tcp_default_timeout = { .to_initval = 60 * HZ, .to_maxval = 60 * HZ, @@ -2379,6 +2555,14 @@ static struct xprt_class xs_tcp_transport = { .setup = xs_setup_tcp, }; +static struct xprt_class xs_cudp_transport = { + .list = LIST_HEAD_INIT(xs_cudp_transport.list), + .name = "cudp", + .owner = THIS_MODULE, + .ident = IPPROTO_CUDP, + .setup = xs_setup_cudp, +}; + /** * init_socket_xprt - set up xprtsock's sysctls, register with RPC client * @@ -2392,6 +2576,7 @@ int init_socket_xprt(void) xprt_register_transport(&xs_udp_transport); xprt_register_transport(&xs_tcp_transport); + xprt_register_transport(&xs_cudp_transport); return 0; } @@ -2411,4 +2596,5 @@ void cleanup_socket_xprt(void) xprt_unregister_transport(&xs_udp_transport); xprt_unregister_transport(&xs_tcp_transport); + xprt_unregister_transport(&xs_cudp_transport); } > I've thought about this some more... > > It seems to me that you might be better off using the existing UDP > transport code, but adding a new RPC_CLNT_CREATE_ flag to enable > connected UDP semantics. The two transports are otherwise exactly the > same. > It doesn't seem that there is a clean way of doing this. The function xs_setup_udp() sets up the corresponding connect_worker function which actually sets up the UDP socket. There doesn't seem to be a way to check whether this flag (or a new rpc_clnt->cl_ flag) is set or not in either of the functions. OTOH, why not use AF_LOCAL sockets since it's for local communication only? Thanks, -- Suresh Jayaraman ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] sunrpc: reduce timeout when unregistering rpcbind registrations. 2009-07-06 12:42 ` Suresh Jayaraman @ 2009-07-06 14:30 ` Chuck Lever 2009-07-06 16:08 ` Suresh Jayaraman 0 siblings, 1 reply; 25+ messages in thread From: Chuck Lever @ 2009-07-06 14:30 UTC (permalink / raw) To: Suresh Jayaraman; +Cc: Neil Brown, Trond Myklebust, Linux NFS mailing list On Jul 6, 2009, at 8:42 AM, Suresh Jayaraman wrote: > Chuck Lever wrote: >> On Jun 11, 2009, at 11:44 AM, Chuck Lever wrote: >>> On Jun 11, 2009, at 12:48 AM, Neil Brown wrote: > >>> >>>> How hard would it be to add (optional) connected UDP support? >>>> Would >>>> we just make the code more like the TCP version, or are there any >>>> gotchas that you know of that we would need to be careful of? >>> >>> The code in net/sunrpc/xprtsock.c is a bunch of transport methods, >>> many of which are shared between the UDP and TCP transport >>> capabilities. You could probably do this easily by creating a new >>> xprt_class structure and a new ops vector, then reuse as many UDP >>> methods as possible. The TCP connect method could be usable as is, >>> but it would be simple to copy-n-paste a new one if some variation >>> is >>> required. Then, define a new XPRT_ value, and use that in >>> rpcb_create_local(). > > I attempted a patch based on your suggestions, while the socket seems > to be getting the -ECONNREFUSED error, but it isn't propagating all > the > way up (yet to debug, why) I suspect it's because a while ago Trond changed the connect logic to retry everything, including ECONNREFUSED. I've hit this problem recently as well. The kernel's NFS mount client needs rpcbind to recognize when a port is not active so it can stop retrying that port and switch to a different transport, just as mount.nfs does. We will need to add a mechanism to allow ECONNREFUSED to be propagated up the stack again. Trond suggested adding a per-RPC flag (on rpc_call_sync()) to do this. Connection semantics seem to me to be an attribute of the transport, not of a single RPC, though. And, the protocol where we really need this is rpcbind, which usually creates a transport to send just a single RPC. > include/linux/sunrpc/xprtsock.h | 6 ++ > net/sunrpc/rpcb_clnt.c | 2 +- > net/sunrpc/xprt.c | 5 + > net/sunrpc/xprtsock.c | 186 ++++++++++++++++++++++++++++++ > +++++++++ > 4 files changed, 198 insertions(+), 1 deletions(-) > > diff --git a/include/linux/sunrpc/xprtsock.h b/include/linux/sunrpc/ > xprtsock.h > index c2a46c4..3f05a45 100644 > --- a/include/linux/sunrpc/xprtsock.h > +++ b/include/linux/sunrpc/xprtsock.h > @@ -22,6 +22,12 @@ void cleanup_socket_xprt(void); > */ > #define XPRT_TRANSPORT_UDP IPPROTO_UDP > #define XPRT_TRANSPORT_TCP IPPROTO_TCP > +/* > + * Connected UDP doesn't seem to be a well-defined protocol, > picking a number > + * other than 17 and 6 should be OK. > + * FIXME: If the above assumption is wrong. > + */ > +#define XPRT_TRANSPORT_CUDP 18 > > /* > * RPC slot table sizes for UDP, TCP transports > diff --git a/net/sunrpc/rpcb_clnt.c b/net/sunrpc/rpcb_clnt.c > index beee6da..73f8048 100644 > --- a/net/sunrpc/rpcb_clnt.c > +++ b/net/sunrpc/rpcb_clnt.c > @@ -135,7 +135,7 @@ static struct rpc_clnt *rpcb_create_local(struct > sockaddr *addr, > size_t addrlen, u32 version) > { > struct rpc_create_args args = { > - .protocol = XPRT_TRANSPORT_UDP, > + .protocol = XPRT_TRANSPORT_CUDP, > .address = addr, > .addrsize = addrlen, > .servername = "localhost", > diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c > index f412a85..7e605f1 100644 > --- a/net/sunrpc/xprt.c > +++ b/net/sunrpc/xprt.c > @@ -898,6 +898,11 @@ void xprt_transmit(struct rpc_task *task) > req->rq_connect_cookie = xprt->connect_cookie; > req->rq_xtime = jiffies; > status = xprt->ops->send_request(task); > + if (status == -ECONNREFUSED) { > + dprintk("RPC: send_request returned (%d) in xprt_transmit\n", > + status); > + rpc_exit(task, status); > + } > if (status != 0) { > task->tk_status = status; > return; > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c > index 83c73c4..6cc24c3 100644 > --- a/net/sunrpc/xprtsock.c > +++ b/net/sunrpc/xprtsock.c > @@ -193,6 +193,8 @@ static ctl_table sunrpc_table[] = { > */ > #define XS_IDLE_DISC_TO (5U * 60 * HZ) > > +#define IPPROTO_CUDP XPRT_TRANSPORT_CUDP > + > #ifdef RPC_DEBUG > # undef RPC_DEBUG_DATA > # define RPCDBG_FACILITY RPCDBG_TRANS > @@ -1832,6 +1834,100 @@ out: > xprt_wake_pending_tasks(xprt, status); > } > > +/** > + * xs_cudp_connect_worker4 - set up a connected UDP socket > + * @work: RPC transport to connect > + * > + * Invoked by a work queue tasklet. > + */ > +static void xs_cudp_connect_worker4(struct work_struct *work) > +{ > + struct sock_xprt *transport = > + container_of(work, struct sock_xprt, connect_worker.work); > + struct rpc_xprt *xprt = &transport->xprt; > + struct socket *sock = transport->sock; > + int err, status = -EIO; > + > + if (xprt->shutdown) > + goto out; > + > + /* Start by resetting any existing state */ > + xs_reset_transport(transport); > + > + err = sock_create_kern(PF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock); > + if (err < 0) { > + dprintk("RPC: can't create UDP transport socket (%d).\n", - > err); > + goto out; > + } > + xs_reclassify_socket4(sock); > + > + if (xs_bind4(transport, sock)) { > + sock_release(sock); > + goto out; > + } > + > + dprintk("RPC: worker connecting xprt %p to address: %s\n", > + xprt, xprt->address_strings[RPC_DISPLAY_ALL]); > + > + xs_udp_finish_connecting(xprt, sock); > + err = kernel_connect(sock, xs_addr(xprt), xprt->addrlen, > O_NONBLOCK); > + if (err < 0) { > + dprintk("RPC: connect on UDP socket.. failed (%d).\n", -err); > + goto out; > + } > + status = 0; > +out: > + xprt_clear_connecting(xprt); > + xprt_wake_pending_tasks(xprt, status); > +} > + > +/** > + * xs_cudp_connect_worker6 - set up a Connected UDP socket > + * @work: RPC transport to connect > + * > + * Invoked by a work queue tasklet. > + */ > +static void xs_cudp_connect_worker6(struct work_struct *work) > +{ > + struct sock_xprt *transport = > + container_of(work, struct sock_xprt, connect_worker.work); > + struct rpc_xprt *xprt = &transport->xprt; > + struct socket *sock = transport->sock; > + int err, status = -EIO; > + > + if (xprt->shutdown) > + goto out; > + > + /* Start by resetting any existing state */ > + xs_reset_transport(transport); > + > + err = sock_create_kern(PF_INET6, SOCK_DGRAM, IPPROTO_UDP, &sock); > + if (err < 0) { > + dprintk("RPC: can't create UDP transport socket (%d).\n", - > err); > + goto out; > + } > + xs_reclassify_socket6(sock); > + > + if (xs_bind6(transport, sock) < 0) { > + sock_release(sock); > + goto out; > + } > + > + dprintk("RPC: worker connecting xprt %p to address: %s\n", > + xprt, xprt->address_strings[RPC_DISPLAY_ALL]); > + > + xs_udp_finish_connecting(xprt, sock); > + err = kernel_connect(sock, xs_addr(xprt), xprt->addrlen, > O_NONBLOCK); > + if (err < 0) { > + dprintk("RPC: connect on UDP socket... failed (%d).\n", -err); > + goto out; > + } > + status = 0; > +out: > + xprt_clear_connecting(xprt); > + xprt_wake_pending_tasks(xprt, status); > +} > + > /* > * We need to preserve the port number so the reply cache on the > server can > * find our cached RPC replies when we get around to reconnecting. > @@ -2192,6 +2288,24 @@ static struct rpc_xprt_ops xs_tcp_ops = { > .print_stats = xs_tcp_print_stats, > }; > > +static struct rpc_xprt_ops xs_cudp_ops = { > + .set_buffer_size = xs_udp_set_buffer_size, > + .reserve_xprt = xprt_reserve_xprt_cong, > + .release_xprt = xprt_release_xprt_cong, > + .rpcbind = rpcb_getport_async, > + .set_port = xs_set_port, > + .connect = xs_tcp_connect, > + .buf_alloc = rpc_malloc, > + .buf_free = rpc_free, > + .send_request = xs_udp_send_request, > + .set_retrans_timeout = xprt_set_retrans_timeout_rtt, > + .timer = xs_udp_timer, > + .release_request = xprt_release_rqst_cong, > + .close = xs_close, > + .destroy = xs_destroy, > + .print_stats = xs_udp_print_stats, > +}; > + > static struct rpc_xprt *xs_setup_xprt(struct xprt_create *args, > unsigned int slot_table_size) > { > @@ -2298,6 +2412,68 @@ static struct rpc_xprt *xs_setup_udp(struct > xprt_create *args) > return ERR_PTR(-EINVAL); > } > > +/** > + * xs_setup_cudp - Set up transport to use a connected UDP socket > + * @args: rpc transport creation arguments > + * > + */ > +static struct rpc_xprt *xs_setup_cudp(struct xprt_create *args) > +{ > + struct sockaddr *addr = args->dstaddr; > + struct rpc_xprt *xprt; > + struct sock_xprt *transport; > + > + xprt = xs_setup_xprt(args, xprt_udp_slot_table_entries); > + if (IS_ERR(xprt)) > + return xprt; > + transport = container_of(xprt, struct sock_xprt, xprt); > + > + xprt->prot = IPPROTO_CUDP; > + xprt->tsh_size = 0; > + /* XXX: header size can vary due to auth type, IPv6, etc. */ > + xprt->max_payload = (1U << 16) - (MAX_HEADER << 3); > + > + xprt->bind_timeout = XS_BIND_TO; > + xprt->connect_timeout = XS_UDP_CONN_TO; > + xprt->reestablish_timeout = XS_UDP_REEST_TO; > + xprt->idle_timeout = XS_IDLE_DISC_TO; > + > + xprt->ops = &xs_cudp_ops; > + xprt->timeout = &xs_udp_default_timeout; > + > + switch (addr->sa_family) { > + case AF_INET: > + if (((struct sockaddr_in *)addr)->sin_port != htons(0)) > + xprt_set_bound(xprt); > + > + INIT_DELAYED_WORK(&transport->connect_worker, > + xs_cudp_connect_worker4); > + xs_format_ipv4_peer_addresses(xprt, "cudp", RPCBIND_NETID_UDP); > + break; > + case AF_INET6: > + if (((struct sockaddr_in6 *)addr)->sin6_port != htons(0)) > + xprt_set_bound(xprt); > + > + INIT_DELAYED_WORK(&transport->connect_worker, > + xs_cudp_connect_worker6); > + xs_format_ipv6_peer_addresses(xprt, "cudp", RPCBIND_NETID_UDP); > + break; > + default: > + kfree(xprt); > + return ERR_PTR(-EAFNOSUPPORT); > + } > + > + dprintk("RPC: set up transport to address %s\n", > + xprt->address_strings[RPC_DISPLAY_ALL]); > + > + if (try_module_get(THIS_MODULE)) > + return xprt; > + > + kfree(xprt->slot); > + kfree(xprt); > + return ERR_PTR(-EINVAL); > +} > + > static const struct rpc_timeout xs_tcp_default_timeout = { > .to_initval = 60 * HZ, > .to_maxval = 60 * HZ, > @@ -2379,6 +2555,14 @@ static struct xprt_class xs_tcp_transport = { > .setup = xs_setup_tcp, > }; > > +static struct xprt_class xs_cudp_transport = { > + .list = LIST_HEAD_INIT(xs_cudp_transport.list), > + .name = "cudp", > + .owner = THIS_MODULE, > + .ident = IPPROTO_CUDP, > + .setup = xs_setup_cudp, > +}; > + > /** > * init_socket_xprt - set up xprtsock's sysctls, register with RPC > client > * > @@ -2392,6 +2576,7 @@ int init_socket_xprt(void) > > xprt_register_transport(&xs_udp_transport); > xprt_register_transport(&xs_tcp_transport); > + xprt_register_transport(&xs_cudp_transport); > > return 0; > } > @@ -2411,4 +2596,5 @@ void cleanup_socket_xprt(void) > > xprt_unregister_transport(&xs_udp_transport); > xprt_unregister_transport(&xs_tcp_transport); > + xprt_unregister_transport(&xs_cudp_transport); > } > >> I've thought about this some more... >> >> It seems to me that you might be better off using the existing UDP >> transport code, but adding a new RPC_CLNT_CREATE_ flag to enable >> connected UDP semantics. The two transports are otherwise exactly >> the >> same. >> > > It doesn't seem that there is a clean way of doing this. The function > xs_setup_udp() sets up the corresponding connect_worker function which > actually sets up the UDP socket. There doesn't seem to be a way to > check > whether this flag (or a new rpc_clnt->cl_ flag) is set or not in > either of > the functions. > > OTOH, why not use AF_LOCAL sockets since it's for local > communication only? > > > Thanks, > > > -- > Suresh Jayaraman -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] sunrpc: reduce timeout when unregistering rpcbind registrations. 2009-07-06 14:30 ` Chuck Lever @ 2009-07-06 16:08 ` Suresh Jayaraman 2009-07-06 16:22 ` Trond Myklebust 2009-07-06 16:31 ` Chuck Lever 0 siblings, 2 replies; 25+ messages in thread From: Suresh Jayaraman @ 2009-07-06 16:08 UTC (permalink / raw) To: Chuck Lever; +Cc: Neil Brown, Trond Myklebust, Linux NFS mailing list Chuck Lever wrote: > On Jul 6, 2009, at 8:42 AM, Suresh Jayaraman wrote: >> Chuck Lever wrote: >>> On Jun 11, 2009, at 11:44 AM, Chuck Lever wrote: >>>> On Jun 11, 2009, at 12:48 AM, Neil Brown wrote: >> >>>> >>>>> How hard would it be to add (optional) connected UDP support? Would >>>>> we just make the code more like the TCP version, or are there any >>>>> gotchas that you know of that we would need to be careful of? >>>> >>>> The code in net/sunrpc/xprtsock.c is a bunch of transport methods, >>>> many of which are shared between the UDP and TCP transport >>>> capabilities. You could probably do this easily by creating a new >>>> xprt_class structure and a new ops vector, then reuse as many UDP >>>> methods as possible. The TCP connect method could be usable as is, >>>> but it would be simple to copy-n-paste a new one if some variation is >>>> required. Then, define a new XPRT_ value, and use that in >>>> rpcb_create_local(). >> >> I attempted a patch based on your suggestions, while the socket seems >> to be getting the -ECONNREFUSED error, but it isn't propagating all the >> way up (yet to debug, why) > > I suspect it's because a while ago Trond changed the connect logic to > retry everything, including ECONNREFUSED. > I've hit this problem recently as well. The kernel's NFS mount client > needs rpcbind to recognize when a port is not active so it can stop > retrying that port and switch to a different transport, just as > mount.nfs does. > > We will need to add a mechanism to allow ECONNREFUSED to be propagated > up the stack again. Trond suggested adding a per-RPC flag (on > rpc_call_sync()) to do this. Connection semantics seem to me to be an > attribute of the transport, not of a single RPC, though. And, the > protocol where we really need this is rpcbind, which usually creates a > transport to send just a single RPC. > Ah ok, good to know this. BTW, it seems my questions on using RPC_CLNT_CREATE_ flag and using AF_LOCAL sockets got overshadowed (seen below) by the patch. Would making rpcbind using AF_LOCAL sockets a good idea or connected UDP still seems a better solution? >> >>> I've thought about this some more... >>> >>> It seems to me that you might be better off using the existing UDP >>> transport code, but adding a new RPC_CLNT_CREATE_ flag to enable >>> connected UDP semantics. The two transports are otherwise exactly the >>> same. >>> >> >> It doesn't seem that there is a clean way of doing this. The function >> xs_setup_udp() sets up the corresponding connect_worker function which >> actually sets up the UDP socket. There doesn't seem to be a way to check >> whether this flag (or a new rpc_clnt->cl_ flag) is set or not in >> either of >> the functions. >> >> OTOH, why not use AF_LOCAL sockets since it's for local communication >> only? >> Thanks, -- Suresh Jayaraman ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] sunrpc: reduce timeout when unregistering rpcbind registrations. 2009-07-06 16:08 ` Suresh Jayaraman @ 2009-07-06 16:22 ` Trond Myklebust 2009-07-06 16:31 ` Chuck Lever 1 sibling, 0 replies; 25+ messages in thread From: Trond Myklebust @ 2009-07-06 16:22 UTC (permalink / raw) To: Suresh Jayaraman; +Cc: Chuck Lever, Neil Brown, Linux NFS mailing list On Mon, 2009-07-06 at 21:38 +0530, Suresh Jayaraman wrote: > BTW, it seems my questions on using RPC_CLNT_CREATE_ flag and using > AF_LOCAL sockets got overshadowed (seen below) by the patch. Would > making rpcbind using AF_LOCAL sockets a good idea or connected UDP still > seems a better solution? The kernel RPC client doesn't support AF_LOCAL sockets. Why would we want to add them when we already have other mechanisms to communicate with userland that provide the same functionality? Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] sunrpc: reduce timeout when unregistering rpcbind registrations. 2009-07-06 16:08 ` Suresh Jayaraman 2009-07-06 16:22 ` Trond Myklebust @ 2009-07-06 16:31 ` Chuck Lever 2009-07-06 16:40 ` Trond Myklebust 1 sibling, 1 reply; 25+ messages in thread From: Chuck Lever @ 2009-07-06 16:31 UTC (permalink / raw) To: Suresh Jayaraman; +Cc: Neil Brown, Trond Myklebust, Linux NFS mailing list On Jul 6, 2009, at 12:08 PM, Suresh Jayaraman wrote: > Chuck Lever wrote: >> On Jul 6, 2009, at 8:42 AM, Suresh Jayaraman wrote: >>> Chuck Lever wrote: >>>> On Jun 11, 2009, at 11:44 AM, Chuck Lever wrote: >>>>> On Jun 11, 2009, at 12:48 AM, Neil Brown wrote: >>> >>>>> >>>>>> How hard would it be to add (optional) connected UDP support? >>>>>> Would >>>>>> we just make the code more like the TCP version, or are there any >>>>>> gotchas that you know of that we would need to be careful of? >>>>> >>>>> The code in net/sunrpc/xprtsock.c is a bunch of transport methods, >>>>> many of which are shared between the UDP and TCP transport >>>>> capabilities. You could probably do this easily by creating a new >>>>> xprt_class structure and a new ops vector, then reuse as many UDP >>>>> methods as possible. The TCP connect method could be usable as >>>>> is, >>>>> but it would be simple to copy-n-paste a new one if some >>>>> variation is >>>>> required. Then, define a new XPRT_ value, and use that in >>>>> rpcb_create_local(). >>> >>> I attempted a patch based on your suggestions, while the socket >>> seems >>> to be getting the -ECONNREFUSED error, but it isn't propagating >>> all the >>> way up (yet to debug, why) >> >> I suspect it's because a while ago Trond changed the connect logic to >> retry everything, including ECONNREFUSED. > >> I've hit this problem recently as well. The kernel's NFS mount >> client >> needs rpcbind to recognize when a port is not active so it can stop >> retrying that port and switch to a different transport, just as >> mount.nfs does. >> >> We will need to add a mechanism to allow ECONNREFUSED to be >> propagated >> up the stack again. Trond suggested adding a per-RPC flag (on >> rpc_call_sync()) to do this. Connection semantics seem to me to be >> an >> attribute of the transport, not of a single RPC, though. And, the >> protocol where we really need this is rpcbind, which usually >> creates a >> transport to send just a single RPC. > > Ah ok, good to know this. > > BTW, it seems my questions on using RPC_CLNT_CREATE_ flag and using > AF_LOCAL sockets got overshadowed (seen below) by the patch. Would > making rpcbind using AF_LOCAL sockets a good idea or connected UDP > still > seems a better solution? See below. >>>> I've thought about this some more... >>>> >>>> It seems to me that you might be better off using the existing UDP >>>> transport code, but adding a new RPC_CLNT_CREATE_ flag to enable >>>> connected UDP semantics. The two transports are otherwise >>>> exactly the >>>> same. >>> >>> It doesn't seem that there is a clean way of doing this. The >>> function >>> xs_setup_udp() sets up the corresponding connect_worker function >>> which >>> actually sets up the UDP socket. There doesn't seem to be a way to >>> check >>> whether this flag (or a new rpc_clnt->cl_ flag) is set or not in >>> either of >>> the functions. See xs_tcp_connect(). On connect, you get an rpc_task structure which contains an rpc_clnt pointer, and can use that to switch connection semantics. >>> OTOH, why not use AF_LOCAL sockets since it's for local >>> communication >>> only? I have considered that. AF_LOCAL in fact could replace all of our upcall mechanisms. However, portmapper, which doesn't support AF_LOCAL, is still used in some distributions. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] sunrpc: reduce timeout when unregistering rpcbind registrations. 2009-07-06 16:31 ` Chuck Lever @ 2009-07-06 16:40 ` Trond Myklebust [not found] ` <1246898450.11267.12.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 25+ messages in thread From: Trond Myklebust @ 2009-07-06 16:40 UTC (permalink / raw) To: Chuck Lever; +Cc: Suresh Jayaraman, Neil Brown, Linux NFS mailing list On Mon, 2009-07-06 at 12:31 -0400, Chuck Lever wrote: > I have considered that. AF_LOCAL in fact could replace all of our > upcall mechanisms. However, portmapper, which doesn't support > AF_LOCAL, is still used in some distributions. As could AF_NETLINK, fork(), pipes, fifos, etc... Again: why would we want to saddle ourselves with rpc over AF_LOCAL? -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 25+ messages in thread
[parent not found: <1246898450.11267.12.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: [PATCH 3/3] sunrpc: reduce timeout when unregistering rpcbind registrations. [not found] ` <1246898450.11267.12.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-07-06 16:57 ` Chuck Lever 2009-07-06 17:14 ` Trond Myklebust 0 siblings, 1 reply; 25+ messages in thread From: Chuck Lever @ 2009-07-06 16:57 UTC (permalink / raw) To: Trond Myklebust; +Cc: Suresh Jayaraman, Neil Brown, Linux NFS mailing list On Jul 6, 2009, at 12:40 PM, Trond Myklebust wrote: > On Mon, 2009-07-06 at 12:31 -0400, Chuck Lever wrote: >> I have considered that. AF_LOCAL in fact could replace all of our >> upcall mechanisms. However, portmapper, which doesn't support >> AF_LOCAL, is still used in some distributions. > > As could AF_NETLINK, fork(), pipes, fifos, etc... Again: why would we > want to saddle ourselves with rpc over AF_LOCAL? TI-RPC supports AF_LOCAL RPC transports. [cel@matisse notify-one]$ rpcinfo program version netid address service owner 100000 4 tcp6 ::.0.111 portmapper superuser 100000 3 tcp6 ::.0.111 portmapper superuser 100000 4 udp6 ::.0.111 portmapper superuser 100000 3 udp6 ::.0.111 portmapper superuser 100000 4 tcp 0.0.0.0.0.111 portmapper superuser 100000 3 tcp 0.0.0.0.0.111 portmapper superuser 100000 2 tcp 0.0.0.0.0.111 portmapper superuser 100000 4 udp 0.0.0.0.0.111 portmapper superuser 100000 3 udp 0.0.0.0.0.111 portmapper superuser 100000 2 udp 0.0.0.0.0.111 portmapper superuser 100000 4 local /var/run/rpcbind.sock portmapper superuser 100000 3 local /var/run/rpcbind.sock portmapper superuser 100024 1 udp 0.0.0.0.206.127 status 29 100024 1 tcp 0.0.0.0.166.105 status 29 100024 1 udp6 ::.141.238 status 29 100024 1 tcp6 ::.192.160 status 29 [cel@matisse notify-one]$ The listing for '/var/run/rpcbind.sock' is rpcbind's AF_LOCAL listener. TI-RPC's rpcb_foo() calls use this method of accessing the rpcbind database rather than going over loopback. rpcbind scrapes the caller's effective UID off the transport socket and uses that for authentication. Note the "owner" column... that comes from the socket's UID, not from the r_owner field. When a service is registered over the network, the owner column says "unknown" and basically anyone can unset it. If the kernel used AF_LOCAL to register its services, it would mean we would never use a network port for local rpcbind calls between the kernel and rpcbind, and rpcbind could automatically prevent the kernel's RPC services from getting unset by malicious users. If /var/ run/rpcbind.sock isn't there, the kernel would know immediately that rpcbind wasn't running. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] sunrpc: reduce timeout when unregistering rpcbind registrations. 2009-07-06 16:57 ` Chuck Lever @ 2009-07-06 17:14 ` Trond Myklebust [not found] ` <1246900456.11267.34.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 25+ messages in thread From: Trond Myklebust @ 2009-07-06 17:14 UTC (permalink / raw) To: Chuck Lever; +Cc: Suresh Jayaraman, Neil Brown, Linux NFS mailing list On Mon, 2009-07-06 at 12:57 -0400, Chuck Lever wrote: > On Jul 6, 2009, at 12:40 PM, Trond Myklebust wrote: > > On Mon, 2009-07-06 at 12:31 -0400, Chuck Lever wrote: > >> I have considered that. AF_LOCAL in fact could replace all of our > >> upcall mechanisms. However, portmapper, which doesn't support > >> AF_LOCAL, is still used in some distributions. > > > > As could AF_NETLINK, fork(), pipes, fifos, etc... Again: why would we > > want to saddle ourselves with rpc over AF_LOCAL? > > TI-RPC supports AF_LOCAL RPC transports. > > [cel@matisse notify-one]$ rpcinfo > program version netid address service owner > 100000 4 tcp6 ::.0.111 portmapper > superuser > 100000 3 tcp6 ::.0.111 portmapper > superuser > 100000 4 udp6 ::.0.111 portmapper > superuser > 100000 3 udp6 ::.0.111 portmapper > superuser > 100000 4 tcp 0.0.0.0.0.111 portmapper > superuser > 100000 3 tcp 0.0.0.0.0.111 portmapper > superuser > 100000 2 tcp 0.0.0.0.0.111 portmapper > superuser > 100000 4 udp 0.0.0.0.0.111 portmapper > superuser > 100000 3 udp 0.0.0.0.0.111 portmapper > superuser > 100000 2 udp 0.0.0.0.0.111 portmapper > superuser > 100000 4 local /var/run/rpcbind.sock portmapper > superuser > 100000 3 local /var/run/rpcbind.sock portmapper > superuser > 100024 1 udp 0.0.0.0.206.127 status 29 > 100024 1 tcp 0.0.0.0.166.105 status 29 > 100024 1 udp6 ::.141.238 status 29 > 100024 1 tcp6 ::.192.160 status 29 > [cel@matisse notify-one]$ > > The listing for '/var/run/rpcbind.sock' is rpcbind's AF_LOCAL > listener. TI-RPC's rpcb_foo() calls use this method of accessing the > rpcbind database rather than going over loopback. > > rpcbind scrapes the caller's effective UID off the transport socket > and uses that for authentication. Note the "owner" column... that > comes from the socket's UID, not from the r_owner field. When a > service is registered over the network, the owner column says > "unknown" and basically anyone can unset it. > > If the kernel used AF_LOCAL to register its services, it would mean we > would never use a network port for local rpcbind calls between the > kernel and rpcbind, and rpcbind could automatically prevent the > kernel's RPC services from getting unset by malicious users. If /var/ > run/rpcbind.sock isn't there, the kernel would know immediately that > rpcbind wasn't running. So what? You can achieve the same with any number of communication channels (including the network). Just add a timeout to the current 'connect()' function, and set it to a low value when doing rpcbind upcalls. What's so special about libtirpc or rpcbind that we have to keep redesigning the kernel to work around their limitations instead of the other way round? -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 25+ messages in thread
[parent not found: <1246900456.11267.34.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: [PATCH 3/3] sunrpc: reduce timeout when unregistering rpcbind registrations. [not found] ` <1246900456.11267.34.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-07-06 17:51 ` Chuck Lever 2009-07-06 17:58 ` Trond Myklebust 0 siblings, 1 reply; 25+ messages in thread From: Chuck Lever @ 2009-07-06 17:51 UTC (permalink / raw) To: Trond Myklebust; +Cc: Suresh Jayaraman, Neil Brown, Linux NFS mailing list On Jul 6, 2009, at 1:14 PM, Trond Myklebust wrote: > On Mon, 2009-07-06 at 12:57 -0400, Chuck Lever wrote: >> On Jul 6, 2009, at 12:40 PM, Trond Myklebust wrote: >>> On Mon, 2009-07-06 at 12:31 -0400, Chuck Lever wrote: >>>> I have considered that. AF_LOCAL in fact could replace all of our >>>> upcall mechanisms. However, portmapper, which doesn't support >>>> AF_LOCAL, is still used in some distributions. >>> >>> As could AF_NETLINK, fork(), pipes, fifos, etc... Again: why would >>> we >>> want to saddle ourselves with rpc over AF_LOCAL? >> >> TI-RPC supports AF_LOCAL RPC transports. >> >> [cel@matisse notify-one]$ rpcinfo >> program version netid address service owner >> 100000 4 tcp6 ::.0.111 portmapper >> superuser >> 100000 3 tcp6 ::.0.111 portmapper >> superuser >> 100000 4 udp6 ::.0.111 portmapper >> superuser >> 100000 3 udp6 ::.0.111 portmapper >> superuser >> 100000 4 tcp 0.0.0.0.0.111 portmapper >> superuser >> 100000 3 tcp 0.0.0.0.0.111 portmapper >> superuser >> 100000 2 tcp 0.0.0.0.0.111 portmapper >> superuser >> 100000 4 udp 0.0.0.0.0.111 portmapper >> superuser >> 100000 3 udp 0.0.0.0.0.111 portmapper >> superuser >> 100000 2 udp 0.0.0.0.0.111 portmapper >> superuser >> 100000 4 local /var/run/rpcbind.sock portmapper >> superuser >> 100000 3 local /var/run/rpcbind.sock portmapper >> superuser >> 100024 1 udp 0.0.0.0.206.127 status 29 >> 100024 1 tcp 0.0.0.0.166.105 status 29 >> 100024 1 udp6 ::.141.238 status 29 >> 100024 1 tcp6 ::.192.160 status 29 >> [cel@matisse notify-one]$ >> >> The listing for '/var/run/rpcbind.sock' is rpcbind's AF_LOCAL >> listener. TI-RPC's rpcb_foo() calls use this method of accessing the >> rpcbind database rather than going over loopback. >> >> rpcbind scrapes the caller's effective UID off the transport socket >> and uses that for authentication. Note the "owner" column... that >> comes from the socket's UID, not from the r_owner field. When a >> service is registered over the network, the owner column says >> "unknown" and basically anyone can unset it. >> >> If the kernel used AF_LOCAL to register its services, it would mean >> we >> would never use a network port for local rpcbind calls between the >> kernel and rpcbind, and rpcbind could automatically prevent the >> kernel's RPC services from getting unset by malicious users. If / >> var/ >> run/rpcbind.sock isn't there, the kernel would know immediately that >> rpcbind wasn't running. > > So what? You can achieve the same with any number of communication > channels (including the network). Just add a timeout to the current > 'connect()' function, and set it to a low value when doing rpcbind > upcalls. I suggested such a scheme last year when we first discussed connected UDP, and it was decided that especially short timeouts for local rpcbind calls were not appropriate. In general, however, the network layer does tell us immediately when the service is not running (ICMP port unreachable or RST). The kernel's RPC client is basically ignoring that information. > What's so special about libtirpc or rpcbind that we have to keep > redesigning the kernel to work around their limitations instead of the > other way round? I'm not sure what you're referring to, in specific. However, since rpcbind is a standard network protocol, the kernel really does have to talk the protocol correctly if we want to interoperate with non-Linux implementations. For local-only cases, we need to ensure that the kernel is backwards compatible with portmapper. In this case, Suresh and Neil are dealing with a problem that occurs whether rpcbind or portmapper is running -- basically during shutdown, if user space has killed those processes, the kernel waits for a bit instead of deciding immediately that it should exit. Nothing to do with TI-RPC, though TI-RPC does offer a potential solution (AF_LOCAL). In the mount.nfs case, user space uses RST/port unreachable specifically for determining when the server does not support a particular transport (see nfs_probe_port). That code is actually baked into the mount command, it's not part of the library. If we want to see version/transport negotiation in the kernel, then the kernel rpcbind client has to have the ability to detect quickly when the remote does not support the requested transport. Again, nothing to do with TI-RPC. In both cases, it turns out that the library implementations in user space already fail quickly. RPC_CANTRECV is returned if an attempt is made to send an rpcbind query to an inactive UDP port. RPC_SYSTEMERROR/ECONNREFUSED is returned if an attempt is made to send an rpcbind query to an inactive TCP port. In my view, the kernel is lacking here, and should be made to emulate user space more closely. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] sunrpc: reduce timeout when unregistering rpcbind registrations. 2009-07-06 17:51 ` Chuck Lever @ 2009-07-06 17:58 ` Trond Myklebust [not found] ` <1246903105.23966.4.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 25+ messages in thread From: Trond Myklebust @ 2009-07-06 17:58 UTC (permalink / raw) To: Chuck Lever; +Cc: Suresh Jayaraman, Neil Brown, Linux NFS mailing list On Mon, 2009-07-06 at 13:51 -0400, Chuck Lever wrote: > In both cases, it turns out that the library implementations in user > space already fail quickly. RPC_CANTRECV is returned if an attempt is > made to send an rpcbind query to an inactive UDP port. > RPC_SYSTEMERROR/ECONNREFUSED is returned if an attempt is made to send > an rpcbind query to an inactive TCP port. In my view, the kernel is > lacking here, and should be made to emulate user space more closely. I fully agree that we can fix the above by more judicious interpretation of the returned networking errors. What I don't see eye to eye with is the assertion that has been floating about that the current behaviour should be a good enough reason to add kernel AF_LOCAL support. Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 25+ messages in thread
[parent not found: <1246903105.23966.4.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: [PATCH 3/3] sunrpc: reduce timeout when unregistering rpcbind registrations. [not found] ` <1246903105.23966.4.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-07-06 18:32 ` Chuck Lever 0 siblings, 0 replies; 25+ messages in thread From: Chuck Lever @ 2009-07-06 18:32 UTC (permalink / raw) To: Trond Myklebust; +Cc: Suresh Jayaraman, Neil Brown, Linux NFS mailing list On Jul 6, 2009, at 1:58 PM, Trond Myklebust wrote: > On Mon, 2009-07-06 at 13:51 -0400, Chuck Lever wrote: >> In both cases, it turns out that the library implementations in user >> space already fail quickly. RPC_CANTRECV is returned if an attempt >> is >> made to send an rpcbind query to an inactive UDP port. >> RPC_SYSTEMERROR/ECONNREFUSED is returned if an attempt is made to >> send >> an rpcbind query to an inactive TCP port. In my view, the kernel is >> lacking here, and should be made to emulate user space more closely. > > I fully agree that we can fix the above by more judicious > interpretation > of the returned networking errors. What I don't see eye to eye with is > the assertion that has been floating about that the current behaviour > should be a good enough reason to add kernel AF_LOCAL support. Since portmapper doesn't support it, and the kernel (for now) needs backwards compatibility with portmapper, AF_LOCAL for rpcbind is rather a non-starter. In my opinion there are other reasons to consider kernel-level AF_LOCAL support at some later point. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2009-07-06 18:32 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-28 6:33 [nfsd PATCH 0/3] address issues with shutdown while portmap is dead NeilBrown
[not found] ` <20090528062730.15937.70579.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2009-05-28 6:33 ` [PATCH 1/3] nfsd: don't take nfsd_mutex twice when setting number of threads NeilBrown
[not found] ` <20090528063303.15937.55202.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2009-06-11 17:52 ` Jeff Layton
[not found] ` <20090611135255.0fa2f728-xSBYVWDuneFaJnirhKH9O4GKTjYczspe@public.gmane.org>
2009-06-11 18:01 ` Jeff Layton
2009-05-28 6:33 ` [PATCH 2/3] nfsd: optimise the starting of zero threads when none are running NeilBrown
[not found] ` <20090528063303.15937.57966.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2009-06-11 17:26 ` Jeff Layton
2009-05-28 6:33 ` [PATCH 3/3] sunrpc: reduce timeout when unregistering rpcbind registrations NeilBrown
[not found] ` <20090528063303.15937.62423.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2009-05-28 13:07 ` Tom Talpey
2009-06-11 4:49 ` Neil Brown
[not found] ` <18992.36038.267957.467326-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2009-06-11 15:02 ` Chuck Lever
2009-05-28 13:43 ` Chuck Lever
2009-06-11 4:48 ` Neil Brown
[not found] ` <18992.35996.986951.556723-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2009-06-11 15:44 ` Chuck Lever
2009-07-02 20:04 ` Chuck Lever
2009-07-06 12:42 ` Suresh Jayaraman
2009-07-06 14:30 ` Chuck Lever
2009-07-06 16:08 ` Suresh Jayaraman
2009-07-06 16:22 ` Trond Myklebust
2009-07-06 16:31 ` Chuck Lever
2009-07-06 16:40 ` Trond Myklebust
[not found] ` <1246898450.11267.12.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-07-06 16:57 ` Chuck Lever
2009-07-06 17:14 ` Trond Myklebust
[not found] ` <1246900456.11267.34.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-07-06 17:51 ` Chuck Lever
2009-07-06 17:58 ` Trond Myklebust
[not found] ` <1246903105.23966.4.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-07-06 18:32 ` Chuck Lever
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox