* [PATCH] Do not hold clnt_fd_lock mutex during connect
@ 2016-05-18 17:54 Paulo Andrade
2016-05-19 3:43 ` [Libtirpc-devel] " Ian Kent
2016-05-19 5:19 ` Ian Kent
0 siblings, 2 replies; 4+ messages in thread
From: Paulo Andrade @ 2016-05-18 17:54 UTC (permalink / raw)
To: libtirpc-devel; +Cc: linux-nfs, Paulo Andrade
An user reports that their application connects to multiple servers
through a rpc interface using libtirpc. When one of the servers misbehaves
(goes down ungracefully or has a delay of a few seconds in the traffic
flow), it was observed that the traffic from the client to other servers is
decreased by the traffic anomaly of the failing server, i.e. traffic
decreases or goes to 0 in all the servers.
When investigated further, specifically into the behavior of the libtirpc
at the time of the issue, it was observed that all of the application
threads specifically interacting with libtirpc were locked into one single
lock inside the libtirpc library. This was a race condition which had
resulted in a deadlock and hence the resultant dip/stoppage of traffic.
As an experiment, the user removed the libtirpc from the application build
and used the standard glibc library for rpc communication. In that case,
everything worked perfectly even in the time of the issue of server nodes
misbehaving.
Signed-off-by: Paulo Andrade <pcpa@gnu.org>
---
src/clnt_vc.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/src/clnt_vc.c b/src/clnt_vc.c
index a72f9f7..2396f34 100644
--- a/src/clnt_vc.c
+++ b/src/clnt_vc.c
@@ -229,27 +229,23 @@ clnt_vc_create(fd, raddr, prog, vers, sendsz, recvsz)
} else
assert(vc_cv != (cond_t *) NULL);
- /*
- * XXX - fvdl connecting while holding a mutex?
- */
+ mutex_unlock(&clnt_fd_lock);
+
slen = sizeof ss;
if (getpeername(fd, (struct sockaddr *)&ss, &slen) < 0) {
if (errno != ENOTCONN) {
rpc_createerr.cf_stat = RPC_SYSTEMERROR;
rpc_createerr.cf_error.re_errno = errno;
- mutex_unlock(&clnt_fd_lock);
thr_sigsetmask(SIG_SETMASK, &(mask), NULL);
goto err;
}
if (connect(fd, (struct sockaddr *)raddr->buf, raddr->len) < 0){
rpc_createerr.cf_stat = RPC_SYSTEMERROR;
rpc_createerr.cf_error.re_errno = errno;
- mutex_unlock(&clnt_fd_lock);
thr_sigsetmask(SIG_SETMASK, &(mask), NULL);
goto err;
}
}
- mutex_unlock(&clnt_fd_lock);
if (!__rpc_fd2sockinfo(fd, &si))
goto err;
thr_sigsetmask(SIG_SETMASK, &(mask), NULL);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 4+ messages in thread* Re: [Libtirpc-devel] [PATCH] Do not hold clnt_fd_lock mutex during connect
2016-05-18 17:54 [PATCH] Do not hold clnt_fd_lock mutex during connect Paulo Andrade
@ 2016-05-19 3:43 ` Ian Kent
2016-05-19 5:19 ` Ian Kent
1 sibling, 0 replies; 4+ messages in thread
From: Ian Kent @ 2016-05-19 3:43 UTC (permalink / raw)
To: Paulo Andrade, libtirpc-devel; +Cc: linux-nfs, Paulo Andrade
On Wed, 2016-05-18 at 14:54 -0300, Paulo Andrade wrote:
> An user reports that their application connects to multiple servers
> through a rpc interface using libtirpc. When one of the servers misbehaves
> (goes down ungracefully or has a delay of a few seconds in the traffic
> flow), it was observed that the traffic from the client to other servers is
> decreased by the traffic anomaly of the failing server, i.e. traffic
> decreases or goes to 0 in all the servers.
>
> When investigated further, specifically into the behavior of the libtirpc
> at the time of the issue, it was observed that all of the application
> threads specifically interacting with libtirpc were locked into one single
> lock inside the libtirpc library. This was a race condition which had
> resulted in a deadlock and hence the resultant dip/stoppage of traffic.
>
> As an experiment, the user removed the libtirpc from the application build
> and used the standard glibc library for rpc communication. In that case,
> everything worked perfectly even in the time of the issue of server nodes
> misbehaving.
I recommend simplifying this.
It should be a concise description of what is wrong and how this patch resolves
it.
The description of the investigation will probably make reading the history more
difficult when trying to find changes at later times so less is more I think.
>
> Signed-off-by: Paulo Andrade <pcpa@gnu.org>
> ---
> src/clnt_vc.c | 8 ++------
> 1 file changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/src/clnt_vc.c b/src/clnt_vc.c
> index a72f9f7..2396f34 100644
> --- a/src/clnt_vc.c
> +++ b/src/clnt_vc.c
> @@ -229,27 +229,23 @@ clnt_vc_create(fd, raddr, prog, vers, sendsz, recvsz)
> } else
> assert(vc_cv != (cond_t *) NULL);
>
> - /*
> - * XXX - fvdl connecting while holding a mutex?
> - */
> + mutex_unlock(&clnt_fd_lock);
> +
> slen = sizeof ss;
> if (getpeername(fd, (struct sockaddr *)&ss, &slen) < 0) {
> if (errno != ENOTCONN) {
> rpc_createerr.cf_stat = RPC_SYSTEMERROR;
> rpc_createerr.cf_error.re_errno = errno;-
> mutex_unlock(&clnt_fd_lock);
> thr_sigsetmask(SIG_SETMASK, &(mask), NULL);
> goto err;
> }
> if (connect(fd, (struct sockaddr *)raddr->buf, raddr->len) <
> 0){
> rpc_createerr.cf_stat = RPC_SYSTEMERROR;
> rpc_createerr.cf_error.re_errno = errno;
> - mutex_unlock(&clnt_fd_lock);
> thr_sigsetmask(SIG_SETMASK, &(mask), NULL);
> goto err;
> }
> }
> - mutex_unlock(&clnt_fd_lock);
> if (!__rpc_fd2sockinfo(fd, &si))
> goto err;
> thr_sigsetmask(SIG_SETMASK, &(mask), NULL);
We will need to review the code in the other clnt_*_create() functions for this
to be a complete resolution for the problem.
Ian
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [Libtirpc-devel] [PATCH] Do not hold clnt_fd_lock mutex during connect
2016-05-18 17:54 [PATCH] Do not hold clnt_fd_lock mutex during connect Paulo Andrade
2016-05-19 3:43 ` [Libtirpc-devel] " Ian Kent
@ 2016-05-19 5:19 ` Ian Kent
2016-05-19 23:53 ` Ian Kent
1 sibling, 1 reply; 4+ messages in thread
From: Ian Kent @ 2016-05-19 5:19 UTC (permalink / raw)
To: Paulo Andrade, libtirpc-devel; +Cc: linux-nfs, Paulo Andrade
On Wed, 2016-05-18 at 14:54 -0300, Paulo Andrade wrote:
> An user reports that their application connects to multiple servers
> through a rpc interface using libtirpc. When one of the servers misbehaves
> (goes down ungracefully or has a delay of a few seconds in the traffic
> flow), it was observed that the traffic from the client to other servers is
> decreased by the traffic anomaly of the failing server, i.e. traffic
> decreases or goes to 0 in all the servers.
>
> When investigated further, specifically into the behavior of the libtirpc
> at the time of the issue, it was observed that all of the application
> threads specifically interacting with libtirpc were locked into one single
> lock inside the libtirpc library. This was a race condition which had
> resulted in a deadlock and hence the resultant dip/stoppage of traffic.
>
> As an experiment, the user removed the libtirpc from the application build
> and used the standard glibc library for rpc communication. In that case,
> everything worked perfectly even in the time of the issue of server nodes
> misbehaving.
>
> Signed-off-by: Paulo Andrade <pcpa@gnu.org>
> ---
> src/clnt_vc.c | 8 ++------
> 1 file changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/src/clnt_vc.c b/src/clnt_vc.c
> index a72f9f7..2396f34 100644
> --- a/src/clnt_vc.c
> +++ b/src/clnt_vc.c
> @@ -229,27 +229,23 @@ clnt_vc_create(fd, raddr, prog, vers, sendsz, recvsz)
> } else
> assert(vc_cv != (cond_t *) NULL);
>
> - /*
> - * XXX - fvdl connecting while holding a mutex?
> - */
> + mutex_unlock(&clnt_fd_lock);
> +
> slen = sizeof ss;
> if (getpeername(fd, (struct sockaddr *)&ss, &slen) < 0) {
> if (errno != ENOTCONN) {
> rpc_createerr.cf_stat = RPC_SYSTEMERROR;
> rpc_createerr.cf_error.re_errno = errno;
> - mutex_unlock(&clnt_fd_lock);
> thr_sigsetmask(SIG_SETMASK, &(mask), NULL);
> goto err;
> }
Oh, right, the mutex is probably needed to ensure that errno is reliable.
> if (connect(fd, (struct sockaddr *)raddr->buf, raddr->len) <
> 0){
But this is probably where the caller is blocking so a small variation of this
patch should achieve the required result.
btw, I had a quick look at some of the other code and so far it looks like they
lead to clnt_tp_create() or clnt_dg_create() calls.
clnt_dg_create() is not connection oriented so it doesn't have the same mutex
lock problem.
So this patch might be all that's needed.
Ian
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [Libtirpc-devel] [PATCH] Do not hold clnt_fd_lock mutex during connect
2016-05-19 5:19 ` Ian Kent
@ 2016-05-19 23:53 ` Ian Kent
0 siblings, 0 replies; 4+ messages in thread
From: Ian Kent @ 2016-05-19 23:53 UTC (permalink / raw)
To: Paulo Andrade, libtirpc-devel; +Cc: linux-nfs, Paulo Andrade
On Thu, 2016-05-19 at 13:19 +0800, Ian Kent wrote:
> On Wed, 2016-05-18 at 14:54 -0300, Paulo Andrade wrote:
> > An user reports that their application connects to multiple servers
> > through a rpc interface using libtirpc. When one of the servers misbehaves
> > (goes down ungracefully or has a delay of a few seconds in the traffic
> > flow), it was observed that the traffic from the client to other servers is
> > decreased by the traffic anomaly of the failing server, i.e. traffic
> > decreases or goes to 0 in all the servers.
> >
> > When investigated further, specifically into the behavior of the libtirpc
> > at the time of the issue, it was observed that all of the application
> > threads specifically interacting with libtirpc were locked into one single
> > lock inside the libtirpc library. This was a race condition which had
> > resulted in a deadlock and hence the resultant dip/stoppage of traffic.
> >
> > As an experiment, the user removed the libtirpc from the application build
> > and used the standard glibc library for rpc communication. In that case,
> > everything worked perfectly even in the time of the issue of server nodes
> > misbehaving.
> >
> > Signed-off-by: Paulo Andrade <pcpa@gnu.org>
> > ---
> > src/clnt_vc.c | 8 ++------
> > 1 file changed, 2 insertions(+), 6 deletions(-)
> >
> > diff --git a/src/clnt_vc.c b/src/clnt_vc.c
> > index a72f9f7..2396f34 100644
> > --- a/src/clnt_vc.c
> > +++ b/src/clnt_vc.c
> > @@ -229,27 +229,23 @@ clnt_vc_create(fd, raddr, prog, vers, sendsz, recvsz)
> > } else
> > assert(vc_cv != (cond_t *) NULL);
> >
> > - /*
> > - * XXX - fvdl connecting while holding a mutex?
> > - */
> > + mutex_unlock(&clnt_fd_lock);
> > +
> > slen = sizeof ss;
> > if (getpeername(fd, (struct sockaddr *)&ss, &slen) < 0) {
> > if (errno != ENOTCONN) {
> > rpc_createerr.cf_stat = RPC_SYSTEMERROR;
> > rpc_createerr.cf_error.re_errno = errno;
> > - mutex_unlock(&clnt_fd_lock);
> > thr_sigsetmask(SIG_SETMASK, &(mask), NULL);
> > goto err;
> > }
>
> Oh, right, the mutex is probably needed to ensure that errno is reliable.
I realized later how dumb this comment was so I checked.
pThreads does provide a per-thread errno so there's no reason I can see to hold
this lock over the getpeername() and connect() calls.
Ian
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2016-05-19 23:53 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-05-18 17:54 [PATCH] Do not hold clnt_fd_lock mutex during connect Paulo Andrade
2016-05-19 3:43 ` [Libtirpc-devel] " Ian Kent
2016-05-19 5:19 ` Ian Kent
2016-05-19 23:53 ` Ian Kent
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).