RE: Race condition in xprt

All of lore.kernel.org
 help / color / mirror / Atom feed

* RE: Race condition in xprt_disconnect
@ 2004-04-05 20:39 Trond Myklebust
  2004-04-06  9:24 ` Olaf Kirch
  0 siblings, 1 reply; 12+ messages in thread
From: Trond Myklebust @ 2004-04-05 20:39 UTC (permalink / raw)
  To: Olaf Kirch; +Cc: nfs

[-- Attachment #1: Type: text/plain, Size: 1603 bytes --]

        it seems bad things can happen if two RPC processes call
        xprt_disconnect at the same time. I have two bug reports of ppc
        machines oopsing somewhere deep in some SELinux functions
        because inode->i_security is NULL. inode->i_security is
        allocated when the inode is created, and cleared when the inode
        is destroyed.

        The stack looks something like
        xprt_disconnect->sock_release->...->selinux_foofah

        The only way I can see this happening is if sock_release is
        called twice, and indeed we do not seem to protect against this.
        Please take a look at the attached patch. It should prevent
        these oopses, but I'm not entirely sure it's safe.

I initially thought that patch you sent out was correct, however the
more I think about it, the more I'm convinced it isn't sufficient.

The only way I see that we can have duplicate calls to sock_release() is
if we are scheduling xprt_socket_autoclose() or xprt_socket_connect()
more than once.
Normally xprt_lock_write() should protect us against this, and so AFAICS
any duplicates must imply that xprt_lock_write() is getting lost before
xprt_socket_connect() has had a chance to run. Presumably this is
occurring because xprt->snd_task is timing out and then being woken up
from xprt->pending.
Note that in that case we don't actually *want* to schedule a new
connect(), since the problem is not that the networking operation timed
out, but rather that keventd is being too slow...

The following patch should hopefully protect against this problem.

Cheers,
  Trond



[-- Attachment #2: linux-2.6.5-34-sunrpc_disconnect.dif --]
[-- Type: text/plain, Size: 4253 bytes --]

 include/linux/sunrpc/xprt.h |    5 ++-
 net/sunrpc/xprt.c           |   58 ++++++++++++++++++--------------------------
 2 files changed, 27 insertions(+), 36 deletions(-)

diff -u --recursive --new-file --show-c-function linux-2.6.5-33-grace/include/linux/sunrpc/xprt.h linux-2.6.5-34-sunrpc_disconnect/include/linux/sunrpc/xprt.h
--- linux-2.6.5-33-grace/include/linux/sunrpc/xprt.h	2004-04-05 13:24:48.000000000 -0400
+++ linux-2.6.5-34-sunrpc_disconnect/include/linux/sunrpc/xprt.h	2004-04-05 14:49:48.000000000 -0400
@@ -216,8 +216,9 @@ void			xprt_connect(struct rpc_task *);
 int			xprt_clear_backlog(struct rpc_xprt *);
 void			xprt_sock_setbufsize(struct rpc_xprt *);
 
-#define XPRT_CONNECT	0
-#define XPRT_LOCKED	1
+#define XPRT_LOCKED	0
+#define XPRT_CONNECT	1
+#define XPRT_CONNECTING	2
 
 #define xprt_connected(xp)		(test_bit(XPRT_CONNECT, &(xp)->sockstate))
 #define xprt_set_connected(xp)		(set_bit(XPRT_CONNECT, &(xp)->sockstate))
diff -u --recursive --new-file --show-c-function linux-2.6.5-33-grace/net/sunrpc/xprt.c linux-2.6.5-34-sunrpc_disconnect/net/sunrpc/xprt.c
--- linux-2.6.5-33-grace/net/sunrpc/xprt.c	2004-04-05 13:24:58.000000000 -0400
+++ linux-2.6.5-34-sunrpc_disconnect/net/sunrpc/xprt.c	2004-04-05 15:27:42.000000000 -0400
@@ -448,7 +448,10 @@ xprt_init_autodisconnect(unsigned long d
 		goto out_abort;
 	spin_unlock(&xprt->sock_lock);
 	/* Let keventd close the socket */
-	schedule_work(&xprt->task_cleanup);
+	if (test_bit(XPRT_CONNECTING, &xprt->sockstate) != 0)
+		xprt_release_write(xprt, NULL);
+	else
+		schedule_work(&xprt->task_cleanup);
 	return;
 out_abort:
 	spin_unlock(&xprt->sock_lock);
@@ -460,12 +463,8 @@ static void xprt_socket_connect(void *ar
 	struct socket *sock = xprt->sock;
 	int status = -EIO;
 
-	if (xprt->shutdown) {
-		rpc_wake_up_status(&xprt->pending, -EIO);
-		return;
-	}
-	if (!xprt->addr.sin_port)
-		goto out_err;
+	if (xprt->shutdown || xprt->addr.sin_port == 0)
+		goto out;
 
 	/*
 	 * Start by resetting any existing state
@@ -475,12 +474,12 @@ static void xprt_socket_connect(void *ar
 	if (sock == NULL) {
 		/* couldn't create socket or bind to reserved port;
 		 * this is likely a permanent error, so cause an abort */
-		goto out_err;
-		return;
+		goto out;
 	}
 	xprt_bind_socket(xprt, sock);
 	xprt_sock_setbufsize(xprt);
 
+	status = 0;
 	if (!xprt->stream)
 		goto out;
 
@@ -491,29 +490,22 @@ static void xprt_socket_connect(void *ar
 			sizeof(xprt->addr), O_NONBLOCK);
 	dprintk("RPC: %p  connect status %d connected %d sock state %d\n",
 			xprt, -status, xprt_connected(xprt), sock->sk->sk_state);
-	if (status >= 0)
-		goto out;
-	switch (status) {
-		case -EINPROGRESS:
-		case -EALREADY:
-			return;
-		default:
-			goto out_err;
+	if (status < 0) {
+		switch (status) {
+			case -EINPROGRESS:
+			case -EALREADY:
+				goto out_clear;
+		}
 	}
 out:
-	spin_lock_bh(&xprt->sock_lock);
-	if (xprt->snd_task)
-		rpc_wake_up_task(xprt->snd_task);
-	spin_unlock_bh(&xprt->sock_lock);
-	return;
-out_err:
-	spin_lock_bh(&xprt->sock_lock);
-	if (xprt->snd_task) {
-		xprt->snd_task->tk_status = status;
-		rpc_wake_up_task(xprt->snd_task);
-	} else
-		rpc_wake_up_status(&xprt->pending, -ENOTCONN);
-	spin_unlock_bh(&xprt->sock_lock);
+	if (status < 0)
+		rpc_wake_up_status(&xprt->pending, status);
+	else
+		rpc_wake_up(&xprt->pending);
+out_clear:
+	smp_mb__before_clear_bit();
+	clear_bit(XPRT_CONNECTING, &xprt->sockstate);
+	smp_mb__after_clear_bit();
 }
 
 /*
@@ -545,7 +537,8 @@ void xprt_connect(struct rpc_task *task)
 
 	task->tk_timeout = RPC_CONNECT_TIMEOUT;
 	rpc_sleep_on(&xprt->pending, task, xprt_connect_status, NULL);
-	schedule_work(&xprt->sock_connect);
+	if (!test_and_set_bit(XPRT_CONNECTING, &xprt->sockstate))
+		schedule_work(&xprt->sock_connect);
 	return;
  out_write:
 	xprt_release_write(xprt, task);
@@ -1027,9 +1020,6 @@ tcp_state_change(struct sock *sk)
 			xprt->tcp_reclen = 0;
 			xprt->tcp_copied = 0;
 			xprt->tcp_flags = XPRT_COPY_RECM | XPRT_COPY_XID;
-
-			if (xprt->snd_task)
-				rpc_wake_up_task(xprt->snd_task);
 			rpc_wake_up(&xprt->pending);
 		}
 		spin_unlock_bh(&xprt->sock_lock);

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RE: Race condition in xprt_disconnect
  2004-04-05 20:39 Race condition in xprt_disconnect Trond Myklebust
@ 2004-04-06  9:24 ` Olaf Kirch
  2004-04-06 13:51   ` Trond Myklebust
  2004-04-06 14:11   ` Trond Myklebust
  0 siblings, 2 replies; 12+ messages in thread
From: Olaf Kirch @ 2004-04-06  9:24 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: nfs

Hi Trond,

On Mon, Apr 05, 2004 at 04:39:30PM -0400, Trond Myklebust wrote:
> The only way I see that we can have duplicate calls to sock_release() is
> if we are scheduling xprt_socket_autoclose() or xprt_socket_connect()
> more than once.
> Normally xprt_lock_write() should protect us against this, and so AFAICS
> any duplicates must imply that xprt_lock_write() is getting lost before
> xprt_socket_connect() has had a chance to run.

Yes, that's what it looks like. I had another oops now happening
in some selinux function called from tcp_connect. This looks like
two processes in xprt_socket_connect; the first having just opened
the socket and calling connect, and the second one having released
the socket inbetween.

So yes, you're right. My patch is probably not sufficient.

> Presumably this is
> occurring because xprt->snd_task is timing out and then being woken up
> from xprt->pending.

I'm not sure it's a timeout, actually. Looking at the syslog, the NFS
Server closes the connection, and the oops happens one second later.

I've been looking around the code a little more, and here's a
different theory.

 -	server drops connection
 -	tcp_state_change is called and wakes up all tasks with ENOTCONN
 -	rpciod schedules all tasks. The first one gets the write lock,
 	and schedules the xprt_socket_connect worker. Goes to
	sleep on xprt->pending.
 -	xprt_socket_connect runs and calls xprt_close->xprt_disconnect
 	which wakes up all tasks with ENOTCONN. While xprt_socket_connect
	is calling various network functions to create a socket, bind
	and connect it, the following happens on CPU B:
 -	rpciod wakes up, schedules the task that was waiting for the
 	connect. Oops, we already have the send lock (we're snd_task),
	so go ahead and try to reconnect. Thus we schedule
	xprt_socket_connect a second time, which gets run. Note that
	the workqueue stuff resets the pending bit before calling
	xprt_socket_connect. First thing xprt_socket_connect does
	is close xprt->sock.
 -	Back on CPU A, we find the socket we're just trying to
 	connect is gone. Oops.

My patch proposed earlier to change xprt_close to not wake up all tasks
when called from xprt_socket_connect should also prevent this race. The
general case of snd_task waking up before the worker thread has run is
not covered by this, though. But that should only happen due to timeout,
and a 60 second timeout should be sufficient for keventd even on a slow
day.

> Note that in that case we don't actually *want* to schedule a new
> connect(), since the problem is not that the networking operation timed
> out, but rather that keventd is being too slow...

The additional XPRT_PENDING bit may help. It should be sufficient to
set it in xprt_connect and clear it in xprt_connect_status, I think.

Alternatively, it would also help if we canceled the pending work request
in __xprt_release_write. Unfortunately the workqueue api doesn't seem
to support cancellation. Alternatively we could just call
flush_pending_work.

Et ceterum censeo: sunrpc needs a rewrite :)
I'd like to spend some time on it this summer...

Olaf
-- 
Olaf Kirch     |  The Hardware Gods hate me.
okir@suse.de   |
---------------+ 

-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RE: Race condition in xprt_disconnect
  2004-04-06  9:24 ` Olaf Kirch
@ 2004-04-06 13:51   ` Trond Myklebust
  2004-04-06 14:11   ` Trond Myklebust
  1 sibling, 0 replies; 12+ messages in thread
From: Trond Myklebust @ 2004-04-06 13:51 UTC (permalink / raw)
  To: Olaf Kirch; +Cc: nfs

[-- Attachment #1: Type: text/plain, Size: 494 bytes --]

On Tue, 2004-04-06 at 05:24, Olaf Kirch wrote:
> Hi Trond,
> Et ceterum censeo: sunrpc needs a rewrite :)
> I'd like to spend some time on it this summer...

I've already got some ideas for some short-term changes...

Here's a couple of patches I've been playing around with lately. The
idea is to make rpciod a little more generic, *and* to make it easier to
migrate rpc_tasks onto other threads when we have a need to perform
operations that might otherwise deadlock rpciod.

Cheers,
  Trond

[-- Attachment #2: linux-2.6.5-36-rpc_workqueue.dif --]
[-- Type: text/plain, Size: 29735 bytes --]

 fs/nfs/unlink.c              |    3 
 include/linux/sunrpc/sched.h |   51 +++--
 net/sunrpc/clnt.c            |    4 
 net/sunrpc/sched.c           |  428 ++++++++++++-------------------------------
 net/sunrpc/xprt.c            |    2 
 5 files changed, 157 insertions(+), 331 deletions(-)

diff -u --recursive --new-file --show-c-function linux-2.6.5-NFS_ALL/fs/nfs/unlink.c linux-2.6.5-34-rpc_workqueue/fs/nfs/unlink.c
--- linux-2.6.5-NFS_ALL/fs/nfs/unlink.c	2004-04-02 21:16:33.000000000 -0500
+++ linux-2.6.5-34-rpc_workqueue/fs/nfs/unlink.c	2004-04-02 22:57:29.000000000 -0500
@@ -211,7 +211,6 @@ nfs_complete_unlink(struct dentry *dentr
 	data->count++;
 	nfs_copy_dname(dentry, data);
 	dentry->d_flags &= ~DCACHE_NFSFS_RENAMED;
-	if (data->task.tk_rpcwait == &nfs_delete_queue)
-		rpc_wake_up_task(&data->task);
+	rpc_wake_up_task(&data->task);
 	nfs_put_unlinkdata(data);
 }
diff -u --recursive --new-file --show-c-function linux-2.6.5-NFS_ALL/include/linux/sunrpc/sched.h linux-2.6.5-34-rpc_workqueue/include/linux/sunrpc/sched.h
--- linux-2.6.5-NFS_ALL/include/linux/sunrpc/sched.h	2004-04-02 21:15:51.000000000 -0500
+++ linux-2.6.5-34-rpc_workqueue/include/linux/sunrpc/sched.h	2004-04-03 10:48:27.000000000 -0500
@@ -12,6 +12,7 @@
 #include <linux/timer.h>
 #include <linux/sunrpc/types.h>
 #include <linux/wait.h>
+#include <linux/workqueue.h>
 #include <linux/sunrpc/xdr.h>
 
 /*
@@ -25,11 +26,18 @@ struct rpc_message {
 	struct rpc_cred *	rpc_cred;	/* Credentials */
 };
 
+struct rpc_wait_queue;
+struct rpc_wait {
+	struct list_head	list;		/* wait queue links */
+	struct list_head	links;		/* Links to related tasks */
+	wait_queue_head_t	waitq;		/* sync: sleep on this q */
+	struct rpc_wait_queue *	rpc_waitq;	/* RPC wait queue we're on */
+};
+
 /*
  * This is the RPC task struct
  */
 struct rpc_task {
-	struct list_head	tk_list;	/* wait queue links */
 #ifdef RPC_DEBUG
 	unsigned long		tk_magic;	/* 0xf00baa */
 #endif
@@ -37,7 +45,6 @@ struct rpc_task {
 	struct rpc_clnt *	tk_client;	/* RPC client */
 	struct rpc_rqst *	tk_rqstp;	/* RPC request */
 	int			tk_status;	/* result of last operation */
-	struct rpc_wait_queue *	tk_rpcwait;	/* RPC wait queue we're on */
 
 	/*
 	 * RPC call state
@@ -70,13 +77,18 @@ struct rpc_task {
 	 * you have a pathological interest in kernel oopses.
 	 */
 	struct timer_list	tk_timer;	/* kernel timer */
-	wait_queue_head_t	tk_wait;	/* sync: sleep on this q */
 	unsigned long		tk_timeout;	/* timeout for rpc_sleep() */
 	unsigned short		tk_flags;	/* misc flags */
 	unsigned char		tk_active   : 1;/* Task has been activated */
 	unsigned char		tk_priority : 2;/* Task priority */
 	unsigned long		tk_runstate;	/* Task run status */
-	struct list_head	tk_links;	/* links to related tasks */
+	struct workqueue_struct	*tk_workqueue;	/* Normally rpciod, but could
+						 * be any workqueue
+						 */
+	union {
+		struct work_struct	tk_work;	/* Async task work queue */
+		struct rpc_wait		tk_wait;	/* RPC wait */
+	} u;
 #ifdef RPC_DEBUG
 	unsigned short		tk_pid;		/* debugging aid */
 #endif
@@ -87,11 +99,11 @@ struct rpc_task {
 /* support walking a list of tasks on a wait queue */
 #define	task_for_each(task, pos, head) \
 	list_for_each(pos, head) \
-		if ((task=list_entry(pos, struct rpc_task, tk_list)),1)
+		if ((task=list_entry(pos, struct rpc_task, u.tk_wait.list)),1)
 
 #define	task_for_first(task, head) \
 	if (!list_empty(head) &&  \
-	    ((task=list_entry((head)->next, struct rpc_task, tk_list)),1))
+	    ((task=list_entry((head)->next, struct rpc_task, u.tk_wait.list)),1))
 
 /* .. and walking list of all tasks */
 #define	alltask_for_each(task, pos, head) \
@@ -124,22 +136,24 @@ typedef void			(*rpc_action)(struct rpc_
 #define RPC_DO_CALLBACK(t)	((t)->tk_callback != NULL)
 #define RPC_IS_SOFT(t)		((t)->tk_flags & RPC_TASK_SOFT)
 
-#define RPC_TASK_SLEEPING	0
-#define RPC_TASK_RUNNING	1
-#define RPC_IS_SLEEPING(t)	(test_bit(RPC_TASK_SLEEPING, &(t)->tk_runstate))
-#define RPC_IS_RUNNING(t)	(test_bit(RPC_TASK_RUNNING, &(t)->tk_runstate))
+#define RPC_TASK_RUNNING	0
+#define RPC_TASK_QUEUED		1
 
+#define RPC_IS_RUNNING(t)	(test_bit(RPC_TASK_RUNNING, &(t)->tk_runstate))
 #define rpc_set_running(t)	(set_bit(RPC_TASK_RUNNING, &(t)->tk_runstate))
-#define rpc_clear_running(t)	(clear_bit(RPC_TASK_RUNNING, &(t)->tk_runstate))
-
-#define rpc_set_sleeping(t)	(set_bit(RPC_TASK_SLEEPING, &(t)->tk_runstate))
-
-#define rpc_clear_sleeping(t) \
+#define rpc_test_and_set_running(t) \
+				(test_and_set_bit(RPC_TASK_RUNNING, &(t)->tk_runstate))
+#define rpc_clear_running(t)	\
 	do { \
 		smp_mb__before_clear_bit(); \
-		clear_bit(RPC_TASK_SLEEPING, &(t)->tk_runstate); \
+		clear_bit(RPC_TASK_RUNNING, &(t)->tk_runstate); \
 		smp_mb__after_clear_bit(); \
-	} while(0)
+	} while (0)
+
+#define RPC_IS_QUEUED(t)	(test_bit(RPC_TASK_QUEUED, &(t)->tk_runstate))
+#define rpc_set_queued(t)	(set_bit(RPC_TASK_QUEUED, &(t)->tk_runstate))
+#define rpc_test_and_clear_queued(t) \
+		(test_and_clear_bit(RPC_TASK_QUEUED, &(t)->tk_runstate))
 
 /*
  * Task priorities.
@@ -207,13 +221,10 @@ void		rpc_killall_tasks(struct rpc_clnt 
 int		rpc_execute(struct rpc_task *);
 void		rpc_run_child(struct rpc_task *parent, struct rpc_task *child,
 					rpc_action action);
-int		rpc_add_wait_queue(struct rpc_wait_queue *, struct rpc_task *);
-void		rpc_remove_wait_queue(struct rpc_task *);
 void		rpc_init_priority_wait_queue(struct rpc_wait_queue *, const char *);
 void		rpc_init_wait_queue(struct rpc_wait_queue *, const char *);
 void		rpc_sleep_on(struct rpc_wait_queue *, struct rpc_task *,
 					rpc_action action, rpc_action timer);
-void		rpc_add_timer(struct rpc_task *, rpc_action);
 void		rpc_wake_up_task(struct rpc_task *);
 void		rpc_wake_up(struct rpc_wait_queue *);
 struct rpc_task *rpc_wake_up_next(struct rpc_wait_queue *);
diff -u --recursive --new-file --show-c-function linux-2.6.5-NFS_ALL/net/sunrpc/clnt.c linux-2.6.5-34-rpc_workqueue/net/sunrpc/clnt.c
--- linux-2.6.5-NFS_ALL/net/sunrpc/clnt.c	2004-04-02 21:15:17.000000000 -0500
+++ linux-2.6.5-34-rpc_workqueue/net/sunrpc/clnt.c	2004-04-03 11:11:17.000000000 -0500
@@ -351,7 +351,9 @@ int rpc_call_sync(struct rpc_clnt *clnt,
 	rpc_clnt_sigmask(clnt, &oldset);		
 
 	/* Create/initialize a new RPC task */
-	rpc_init_task(task, clnt, NULL, flags);
+	task = rpc_new_task(clnt, NULL, flags);
+	if (task == NULL)
+		return -ENOMEM;
 	rpc_call_setup(task, msg, 0);
 
 	/* Set up the call info struct and execute the task */
diff -u --recursive --new-file --show-c-function linux-2.6.5-NFS_ALL/net/sunrpc/sched.c linux-2.6.5-34-rpc_workqueue/net/sunrpc/sched.c
--- linux-2.6.5-NFS_ALL/net/sunrpc/sched.c	2004-04-02 21:15:30.000000000 -0500
+++ linux-2.6.5-34-rpc_workqueue/net/sunrpc/sched.c	2004-04-03 10:56:01.000000000 -0500
@@ -41,13 +41,7 @@ static mempool_t	*rpc_buffer_mempool;
 
 static void			__rpc_default_timer(struct rpc_task *task);
 static void			rpciod_killall(void);
-
-/*
- * When an asynchronous RPC task is activated within a bottom half
- * handler, or while executing another RPC task, it is put on
- * schedq, and rpciod is woken up.
- */
-static RPC_WAITQ(schedq, "schedq");
+static void			rpc_async_schedule(void *);
 
 /*
  * RPC tasks that create another task (e.g. for contacting the portmapper)
@@ -68,11 +62,9 @@ static LIST_HEAD(all_tasks);
 /*
  * rpciod-related stuff
  */
-static DECLARE_WAIT_QUEUE_HEAD(rpciod_idle);
-static DECLARE_COMPLETION(rpciod_killer);
 static DECLARE_MUTEX(rpciod_sema);
 static unsigned int		rpciod_users;
-static pid_t			rpciod_pid;
+static struct workqueue_struct *rpciod_workqueue;
 static int			rpc_inhibit;
 
 /*
@@ -105,16 +97,13 @@ __rpc_disable_timer(struct rpc_task *tas
  * without calling del_timer_sync(). The latter could cause a
  * deadlock if called while we're holding spinlocks...
  */
-static void
-rpc_run_timer(struct rpc_task *task)
+static void rpc_run_timer(struct rpc_task *task)
 {
 	void (*callback)(struct rpc_task *);
 
-	spin_lock_bh(&rpc_queue_lock);
 	callback = task->tk_timeout_fn;
 	task->tk_timeout_fn = NULL;
-	spin_unlock_bh(&rpc_queue_lock);
-	if (callback) {
+	if (callback && RPC_IS_QUEUED(task)) {
 		dprintk("RPC: %4d running timer\n", task->tk_pid);
 		callback(task);
 	}
@@ -140,17 +129,6 @@ __rpc_add_timer(struct rpc_task *task, r
 }
 
 /*
- * Set up a timer for an already sleeping task.
- */
-void rpc_add_timer(struct rpc_task *task, rpc_action timer)
-{
-	spin_lock_bh(&rpc_queue_lock);
-	if (!RPC_IS_RUNNING(task))
-		__rpc_add_timer(task, timer);
-	spin_unlock_bh(&rpc_queue_lock);
-}
-
-/*
  * Delete any timer for the current task. Because we use del_timer_sync(),
  * this function should never be called while holding rpc_queue_lock.
  */
@@ -169,16 +147,17 @@ static void __rpc_add_wait_queue_priorit
 	struct list_head *q;
 	struct rpc_task *t;
 
+	INIT_LIST_HEAD(&task->u.tk_wait.links);
 	q = &queue->tasks[task->tk_priority];
 	if (unlikely(task->tk_priority > queue->maxpriority))
 		q = &queue->tasks[queue->maxpriority];
-	list_for_each_entry(t, q, tk_list) {
+	list_for_each_entry(t, q, u.tk_wait.list) {
 		if (t->tk_cookie == task->tk_cookie) {
-			list_add_tail(&task->tk_list, &t->tk_links);
+			list_add_tail(&task->u.tk_wait.list, &t->u.tk_wait.links);
 			return;
 		}
 	}
-	list_add_tail(&task->tk_list, q);
+	list_add_tail(&task->u.tk_wait.list, q);
 }
 
 /*
@@ -189,37 +168,21 @@ static void __rpc_add_wait_queue_priorit
  * improve overall performance.
  * Everyone else gets appended to the queue to ensure proper FIFO behavior.
  */
-static int __rpc_add_wait_queue(struct rpc_wait_queue *queue, struct rpc_task *task)
+static void __rpc_add_wait_queue(struct rpc_wait_queue *queue, struct rpc_task *task)
 {
-	if (task->tk_rpcwait == queue)
-		return 0;
+	BUG_ON (RPC_IS_QUEUED(task));
 
-	if (task->tk_rpcwait) {
-		printk(KERN_WARNING "RPC: doubly enqueued task!\n");
-		return -EWOULDBLOCK;
-	}
 	if (RPC_IS_PRIORITY(queue))
 		__rpc_add_wait_queue_priority(queue, task);
 	else if (RPC_IS_SWAPPER(task))
-		list_add(&task->tk_list, &queue->tasks[0]);
+		list_add(&task->u.tk_wait.list, &queue->tasks[0]);
 	else
-		list_add_tail(&task->tk_list, &queue->tasks[0]);
-	task->tk_rpcwait = queue;
+		list_add_tail(&task->u.tk_wait.list, &queue->tasks[0]);
+	task->u.tk_wait.rpc_waitq = queue;
+	rpc_set_queued(task);
 
 	dprintk("RPC: %4d added to queue %p \"%s\"\n",
 				task->tk_pid, queue, rpc_qname(queue));
-
-	return 0;
-}
-
-int rpc_add_wait_queue(struct rpc_wait_queue *q, struct rpc_task *task)
-{
-	int		result;
-
-	spin_lock_bh(&rpc_queue_lock);
-	result = __rpc_add_wait_queue(q, task);
-	spin_unlock_bh(&rpc_queue_lock);
-	return result;
 }
 
 /*
@@ -229,12 +192,12 @@ static void __rpc_remove_wait_queue_prio
 {
 	struct rpc_task *t;
 
-	if (!list_empty(&task->tk_links)) {
-		t = list_entry(task->tk_links.next, struct rpc_task, tk_list);
-		list_move(&t->tk_list, &task->tk_list);
-		list_splice_init(&task->tk_links, &t->tk_links);
+	if (!list_empty(&task->u.tk_wait.links)) {
+		t = list_entry(task->u.tk_wait.links.next, struct rpc_task, u.tk_wait.list);
+		list_move(&t->u.tk_wait.list, &task->u.tk_wait.list);
+		list_splice_init(&task->u.tk_wait.links, &t->u.tk_wait.links);
 	}
-	list_del(&task->tk_list);
+	list_del(&task->u.tk_wait.list);
 }
 
 /*
@@ -243,31 +206,17 @@ static void __rpc_remove_wait_queue_prio
  */
 static void __rpc_remove_wait_queue(struct rpc_task *task)
 {
-	struct rpc_wait_queue *queue = task->tk_rpcwait;
-
-	if (!queue)
-		return;
+	struct rpc_wait_queue *queue;
+	queue = task->u.tk_wait.rpc_waitq;
 
 	if (RPC_IS_PRIORITY(queue))
 		__rpc_remove_wait_queue_priority(task);
 	else
-		list_del(&task->tk_list);
-	task->tk_rpcwait = NULL;
-
+		list_del(&task->u.tk_wait.list);
 	dprintk("RPC: %4d removed from queue %p \"%s\"\n",
 				task->tk_pid, queue, rpc_qname(queue));
 }
 
-void
-rpc_remove_wait_queue(struct rpc_task *task)
-{
-	if (!task->tk_rpcwait)
-		return;
-	spin_lock_bh(&rpc_queue_lock);
-	__rpc_remove_wait_queue(task);
-	spin_unlock_bh(&rpc_queue_lock);
-}
-
 static inline void rpc_set_waitqueue_priority(struct rpc_wait_queue *queue, int priority)
 {
 	queue->priority = priority;
@@ -316,34 +265,27 @@ EXPORT_SYMBOL(rpc_init_wait_queue);
  * Note: If the task is ASYNC, this must be called with 
  * the spinlock held to protect the wait queue operation.
  */
-static inline void
-rpc_make_runnable(struct rpc_task *task)
+static void rpc_make_runnable(struct rpc_task *task)
 {
-	if (task->tk_timeout_fn) {
-		printk(KERN_ERR "RPC: task w/ running timer in rpc_make_runnable!!\n");
+	if (rpc_test_and_set_running(task))
 		return;
-	}
-	rpc_set_running(task);
+	BUG_ON(task->tk_timeout_fn);
 	if (RPC_IS_ASYNC(task)) {
-		if (RPC_IS_SLEEPING(task)) {
-			int status;
-			status = __rpc_add_wait_queue(&schedq, task);
-			if (status < 0) {
-				printk(KERN_WARNING "RPC: failed to add task to queue: error: %d!\n", status);
-				task->tk_status = status;
-				return;
-			}
-			rpc_clear_sleeping(task);
-			wake_up(&rpciod_idle);
+		int status;
+
+		INIT_WORK(&task->u.tk_work, rpc_async_schedule, (void *)task);
+		status = queue_work(task->tk_workqueue, &task->u.tk_work);
+		if (status < 0) {
+			printk(KERN_WARNING "RPC: failed to add task to queue: error: %d!\n", status);
+			task->tk_status = status;
+			return;
 		}
-	} else {
-		rpc_clear_sleeping(task);
-		wake_up(&task->tk_wait);
-	}
+	} else
+		wake_up(&task->u.tk_wait.waitq);
 }
 
 /*
- * Place a newly initialized task on the schedq.
+ * Place a newly initialized task on the workqueue.
  */
 static inline void
 rpc_schedule_run(struct rpc_task *task)
@@ -352,33 +294,18 @@ rpc_schedule_run(struct rpc_task *task)
 	if (RPC_IS_ACTIVATED(task))
 		return;
 	task->tk_active = 1;
-	rpc_set_sleeping(task);
 	rpc_make_runnable(task);
 }
 
 /*
- *	For other people who may need to wake the I/O daemon
- *	but should (for now) know nothing about its innards
- */
-void rpciod_wake_up(void)
-{
-	if(rpciod_pid==0)
-		printk(KERN_ERR "rpciod: wot no daemon?\n");
-	wake_up(&rpciod_idle);
-}
-
-/*
  * Prepare for sleeping on a wait queue.
  * By always appending tasks to the list we ensure FIFO behavior.
  * NB: An RPC task will only receive interrupt-driven events as long
  * as it's on a wait queue.
  */
-static void
-__rpc_sleep_on(struct rpc_wait_queue *q, struct rpc_task *task,
+static void __rpc_sleep_on(struct rpc_wait_queue *q, struct rpc_task *task,
 			rpc_action action, rpc_action timer)
 {
-	int status;
-
 	dprintk("RPC: %4d sleep_on(queue \"%s\" time %ld)\n", task->tk_pid,
 				rpc_qname(q), jiffies);
 
@@ -388,24 +315,14 @@ __rpc_sleep_on(struct rpc_wait_queue *q,
 	}
 
 	/* Mark the task as being activated if so needed */
-	if (!RPC_IS_ACTIVATED(task)) {
+	if (!RPC_IS_ACTIVATED(task))
 		task->tk_active = 1;
-		rpc_set_sleeping(task);
-	}
 
-	status = __rpc_add_wait_queue(q, task);
-	if (status) {
-		printk(KERN_WARNING "RPC: failed to add task to queue: error: %d!\n", status);
-		task->tk_status = status;
-	} else {
-		rpc_clear_running(task);
-		if (task->tk_callback) {
-			dprintk(KERN_ERR "RPC: %4d overwrites an active callback\n", task->tk_pid);
-			BUG();
-		}
-		task->tk_callback = action;
-		__rpc_add_timer(task, timer);
-	}
+	__rpc_add_wait_queue(q, task);
+
+	BUG_ON(task->tk_callback != NULL);
+	task->tk_callback = action;
+	__rpc_add_timer(task, timer);
 }
 
 void
@@ -421,13 +338,12 @@ rpc_sleep_on(struct rpc_wait_queue *q, s
 }
 
 /**
- * __rpc_wake_up_task - wake up a single rpc_task
+ * __rpc_do_wake_up_task - wake up a single rpc_task
  * @task: task to be woken up
  *
- * Caller must hold rpc_queue_lock
+ * Caller must hold rpc_queue_lock, and have cleared the task queued flag.
  */
-static void
-__rpc_wake_up_task(struct rpc_task *task)
+static void __rpc_do_wake_up_task(struct rpc_task *task)
 {
 	dprintk("RPC: %4d __rpc_wake_up_task (now %ld inh %d)\n",
 					task->tk_pid, jiffies, rpc_inhibit);
@@ -445,12 +361,9 @@ __rpc_wake_up_task(struct rpc_task *task
 		printk(KERN_ERR "RPC: Inactive task (%p) being woken up!\n", task);
 		return;
 	}
-	if (RPC_IS_RUNNING(task))
-		return;
 
 	__rpc_disable_timer(task);
-	if (task->tk_rpcwait != &schedq)
-		__rpc_remove_wait_queue(task);
+	__rpc_remove_wait_queue(task);
 
 	rpc_make_runnable(task);
 
@@ -458,6 +371,15 @@ __rpc_wake_up_task(struct rpc_task *task
 }
 
 /*
+ * Wake up the specified task
+ */
+static void __rpc_wake_up_task(struct rpc_task *task)
+{
+	if (rpc_test_and_clear_queued(task))
+		__rpc_do_wake_up_task(task);
+}
+
+/*
  * Default timeout handler if none specified by user
  */
 static void
@@ -471,14 +393,13 @@ __rpc_default_timer(struct rpc_task *tas
 /*
  * Wake up the specified task
  */
-void
-rpc_wake_up_task(struct rpc_task *task)
+void rpc_wake_up_task(struct rpc_task *task)
 {
-	if (RPC_IS_RUNNING(task))
-		return;
-	spin_lock_bh(&rpc_queue_lock);
-	__rpc_wake_up_task(task);
-	spin_unlock_bh(&rpc_queue_lock);
+	if (rpc_test_and_clear_queued(task)) {
+		spin_lock_bh(&rpc_queue_lock);
+		__rpc_do_wake_up_task(task);
+		spin_unlock_bh(&rpc_queue_lock);
+	}
 }
 
 /*
@@ -494,11 +415,11 @@ static struct rpc_task * __rpc_wake_up_n
 	 */
 	q = &queue->tasks[queue->priority];
 	if (!list_empty(q)) {
-		task = list_entry(q->next, struct rpc_task, tk_list);
+		task = list_entry(q->next, struct rpc_task, u.tk_wait.list);
 		if (queue->cookie == task->tk_cookie) {
 			if (--queue->nr)
 				goto out;
-			list_move_tail(&task->tk_list, q);
+			list_move_tail(&task->u.tk_wait.list, q);
 		}
 		/*
 		 * Check if we need to switch queues.
@@ -516,7 +437,7 @@ static struct rpc_task * __rpc_wake_up_n
 		else
 			q = q - 1;
 		if (!list_empty(q)) {
-			task = list_entry(q->next, struct rpc_task, tk_list);
+			task = list_entry(q->next, struct rpc_task, u.tk_wait.list);
 			goto new_queue;
 		}
 	} while (q != &queue->tasks[queue->priority]);
@@ -568,7 +489,7 @@ void rpc_wake_up(struct rpc_wait_queue *
 	head = &queue->tasks[queue->maxpriority];
 	for (;;) {
 		while (!list_empty(head)) {
-			task = list_entry(head->next, struct rpc_task, tk_list);
+			task = list_entry(head->next, struct rpc_task, u.tk_wait.list);
 			__rpc_wake_up_task(task);
 		}
 		if (head == &queue->tasks[0])
@@ -594,7 +515,7 @@ void rpc_wake_up_status(struct rpc_wait_
 	head = &queue->tasks[queue->maxpriority];
 	for (;;) {
 		while (!list_empty(head)) {
-			task = list_entry(head->next, struct rpc_task, tk_list);
+			task = list_entry(head->next, struct rpc_task, u.tk_wait.list);
 			task->tk_status = status;
 			__rpc_wake_up_task(task);
 		}
@@ -626,18 +547,14 @@ __rpc_atrun(struct rpc_task *task)
 /*
  * This is the RPC `scheduler' (or rather, the finite state machine).
  */
-static int
-__rpc_execute(struct rpc_task *task)
+static int __rpc_execute(struct rpc_task *task)
 {
 	int		status = 0;
 
 	dprintk("RPC: %4d rpc_execute flgs %x\n",
 				task->tk_pid, task->tk_flags);
 
-	if (!RPC_IS_RUNNING(task)) {
-		printk(KERN_WARNING "RPC: rpc_execute called for sleeping task!!\n");
-		return 0;
-	}
+	BUG_ON(RPC_IS_QUEUED(task));
 
  restarted:
 	while (1) {
@@ -657,7 +574,9 @@ __rpc_execute(struct rpc_task *task)
 			 */
 			save_callback=task->tk_callback;
 			task->tk_callback=NULL;
+			lock_kernel();
 			save_callback(task);
+			unlock_kernel();
 		}
 
 		/*
@@ -665,43 +584,40 @@ __rpc_execute(struct rpc_task *task)
 		 * tk_action may be NULL when the task has been killed
 		 * by someone else.
 		 */
-		if (RPC_IS_RUNNING(task)) {
+		if (!RPC_IS_QUEUED(task)) {
 			/*
 			 * Garbage collection of pending timers...
 			 */
 			rpc_delete_timer(task);
 			if (!task->tk_action)
 				break;
+			lock_kernel();
 			task->tk_action(task);
-			/* micro-optimization to avoid spinlock */
-			if (RPC_IS_RUNNING(task))
-				continue;
+			unlock_kernel();
 		}
 
 		/*
-		 * Check whether task is sleeping.
+		 * Lockless check for whether task is sleeping or not.
 		 */
-		spin_lock_bh(&rpc_queue_lock);
-		if (!RPC_IS_RUNNING(task)) {
-			rpc_set_sleeping(task);
-			if (RPC_IS_ASYNC(task)) {
-				spin_unlock_bh(&rpc_queue_lock);
+		if (!RPC_IS_QUEUED(task))
+			continue;
+		rpc_clear_running(task);
+		if (RPC_IS_ASYNC(task)) {
+			/* Careful! we may have raced... */
+			if (RPC_IS_QUEUED(task))
 				return 0;
-			}
+			if (rpc_test_and_set_running(task))
+				return 0;
+			continue;
 		}
-		spin_unlock_bh(&rpc_queue_lock);
 
-		if (!RPC_IS_SLEEPING(task))
-			continue;
+		init_waitqueue_head(&task->u.tk_wait.waitq);
 		/* sync task: sleep here */
 		dprintk("RPC: %4d sync task going to sleep\n", task->tk_pid);
-		if (current->pid == rpciod_pid)
-			printk(KERN_ERR "RPC: rpciod waiting on sync task!\n");
-
 		if (!task->tk_client->cl_intr) {
-			__wait_event(task->tk_wait, !RPC_IS_SLEEPING(task));
+			__wait_event(task->u.tk_wait.waitq, RPC_IS_RUNNING(task));
 		} else {
-			__wait_event_interruptible(task->tk_wait, !RPC_IS_SLEEPING(task), status);
+			__wait_event_interruptible(task->u.tk_wait.waitq, RPC_IS_RUNNING(task), status);
 			/*
 			 * When a sync task receives a signal, it exits with
 			 * -ERESTARTSYS. In order to catch any callbacks that
@@ -719,7 +635,9 @@ __rpc_execute(struct rpc_task *task)
 	}
 
 	if (task->tk_exit) {
+		lock_kernel();
 		task->tk_exit(task);
+		unlock_kernel();
 		/* If tk_action is non-null, the user wants us to restart */
 		if (task->tk_action) {
 			if (!RPC_ASSASSINATED(task)) {
@@ -738,7 +656,6 @@ __rpc_execute(struct rpc_task *task)
 
 	/* Release all resources associated with the task */
 	rpc_release_task(task);
-
 	return status;
 }
 
@@ -775,36 +692,9 @@ rpc_execute(struct rpc_task *task)
 	return status;
 }
 
-/*
- * This is our own little scheduler for async RPC tasks.
- */
-static void
-__rpc_schedule(void)
+static void rpc_async_schedule(void *arg)
 {
-	struct rpc_task	*task;
-	int		count = 0;
-
-	dprintk("RPC:      rpc_schedule enter\n");
-	while (1) {
-
-		task_for_first(task, &schedq.tasks[0]) {
-			__rpc_remove_wait_queue(task);
-			spin_unlock_bh(&rpc_queue_lock);
-
-			__rpc_execute(task);
-			spin_lock_bh(&rpc_queue_lock);
-		} else {
-			break;
-		}
-
-		if (++count >= 200 || need_resched()) {
-			count = 0;
-			spin_unlock_bh(&rpc_queue_lock);
-			schedule();
-			spin_lock_bh(&rpc_queue_lock);
-		}
-	}
-	dprintk("RPC:      rpc_schedule leave\n");
+	__rpc_execute((struct rpc_task *)arg);
 }
 
 /*
@@ -862,7 +752,6 @@ void rpc_init_task(struct rpc_task *task
 	task->tk_client = clnt;
 	task->tk_flags  = flags;
 	task->tk_exit   = callback;
-	init_waitqueue_head(&task->tk_wait);
 	if (current->uid != current->fsuid || current->gid != current->fsgid)
 		task->tk_flags |= RPC_TASK_SETUID;
 
@@ -873,7 +762,9 @@ void rpc_init_task(struct rpc_task *task
 
 	task->tk_priority = RPC_PRIORITY_NORMAL;
 	task->tk_cookie = (unsigned long)current;
-	INIT_LIST_HEAD(&task->tk_links);
+
+	/* Initialize workqueue for async tasks */
+	task->tk_workqueue = rpciod_workqueue;
 
 	/* Add to global list of all tasks */
 	spin_lock(&rpc_sched_lock);
@@ -961,19 +852,9 @@ rpc_release_task(struct rpc_task *task)
 	list_del(&task->tk_task);
 	spin_unlock(&rpc_sched_lock);
 
-	/* Protect the execution below. */
-	spin_lock_bh(&rpc_queue_lock);
-
-	/* Disable timer to prevent zombie wakeup */
-	__rpc_disable_timer(task);
-
-	/* Remove from any wait queue we're still on */
-	__rpc_remove_wait_queue(task);
-
+	BUG_ON (rpc_test_and_clear_queued(task));
 	task->tk_active = 0;
 
-	spin_unlock_bh(&rpc_queue_lock);
-
 	/* Synchronously delete any running timer */
 	rpc_delete_timer(task);
 
@@ -1089,82 +970,6 @@ rpc_killall_tasks(struct rpc_clnt *clnt)
 
 static DECLARE_MUTEX_LOCKED(rpciod_running);
 
-static inline int
-rpciod_task_pending(void)
-{
-	return !list_empty(&schedq.tasks[0]);
-}
-
-
-/*
- * This is the rpciod kernel thread
- */
-static int
-rpciod(void *ptr)
-{
-	int		rounds = 0;
-
-	lock_kernel();
-	/*
-	 * Let our maker know we're running ...
-	 */
-	rpciod_pid = current->pid;
-	up(&rpciod_running);
-
-	daemonize("rpciod");
-	allow_signal(SIGKILL);
-
-	dprintk("RPC: rpciod starting (pid %d)\n", rpciod_pid);
-	spin_lock_bh(&rpc_queue_lock);
-	while (rpciod_users) {
-		DEFINE_WAIT(wait);
-		if (signalled()) {
-			spin_unlock_bh(&rpc_queue_lock);
-			rpciod_killall();
-			flush_signals(current);
-			spin_lock_bh(&rpc_queue_lock);
-		}
-		__rpc_schedule();
-		if (current->flags & PF_FREEZE) {
-			spin_unlock_bh(&rpc_queue_lock);
-			refrigerator(PF_IOTHREAD);
-			spin_lock_bh(&rpc_queue_lock);
-		}
-
-		if (++rounds >= 64) {	/* safeguard */
-			spin_unlock_bh(&rpc_queue_lock);
-			schedule();
-			rounds = 0;
-			spin_lock_bh(&rpc_queue_lock);
-		}
-
-		dprintk("RPC: rpciod back to sleep\n");
-		prepare_to_wait(&rpciod_idle, &wait, TASK_INTERRUPTIBLE);
-		if (!rpciod_task_pending() && !signalled()) {
-			spin_unlock_bh(&rpc_queue_lock);
-			schedule();
-			rounds = 0;
-			spin_lock_bh(&rpc_queue_lock);
-		}
-		finish_wait(&rpciod_idle, &wait);
-		dprintk("RPC: switch to rpciod\n");
-	}
-	spin_unlock_bh(&rpc_queue_lock);
-
-	dprintk("RPC: rpciod shutdown commences\n");
-	if (!list_empty(&all_tasks)) {
-		printk(KERN_ERR "rpciod: active tasks at shutdown?!\n");
-		rpciod_killall();
-	}
-
-	dprintk("RPC: rpciod exiting\n");
-	unlock_kernel();
-
-	rpciod_pid = 0;
-	complete_and_exit(&rpciod_killer, 0);
-	return 0;
-}
-
 static void
 rpciod_killall(void)
 {
@@ -1173,9 +978,7 @@ rpciod_killall(void)
 	while (!list_empty(&all_tasks)) {
 		clear_thread_flag(TIF_SIGPENDING);
 		rpc_killall_tasks(NULL);
-		spin_lock_bh(&rpc_queue_lock);
-		__rpc_schedule();
-		spin_unlock_bh(&rpc_queue_lock);
+		flush_workqueue(rpciod_workqueue);
 		if (!list_empty(&all_tasks)) {
 			dprintk("rpciod_killall: waiting for tasks to exit\n");
 			yield();
@@ -1193,28 +996,31 @@ rpciod_killall(void)
 int
 rpciod_up(void)
 {
+	struct workqueue_struct *wq;
 	int error = 0;
 
 	down(&rpciod_sema);
-	dprintk("rpciod_up: pid %d, users %d\n", rpciod_pid, rpciod_users);
+	dprintk("rpciod_up: users %d\n", rpciod_users);
 	rpciod_users++;
-	if (rpciod_pid)
+	if (rpciod_workqueue)
 		goto out;
 	/*
 	 * If there's no pid, we should be the first user.
 	 */
 	if (rpciod_users > 1)
-		printk(KERN_WARNING "rpciod_up: no pid, %d users??\n", rpciod_users);
+		printk(KERN_WARNING "rpciod_up: no workqueue, %d users??\n", rpciod_users);
 	/*
 	 * Create the rpciod thread and wait for it to start.
 	 */
-	error = kernel_thread(rpciod, NULL, 0);
-	if (error < 0) {
-		printk(KERN_WARNING "rpciod_up: create thread failed, error=%d\n", error);
+	error = -ENOMEM;
+	wq = create_workqueue("rpciod");
+	if (wq == NULL) {
+		printk(KERN_WARNING "rpciod_up: create workqueue failed, error=%d\n", error);
 		rpciod_users--;
 		goto out;
 	}
-	down(&rpciod_running);
+	rpciod_workqueue = wq;
+	rpc_inhibit = 0;
 	error = 0;
 out:
 	up(&rpciod_sema);
@@ -1225,20 +1031,22 @@ void
 rpciod_down(void)
 {
 	down(&rpciod_sema);
-	dprintk("rpciod_down pid %d sema %d\n", rpciod_pid, rpciod_users);
+	dprintk("rpciod_down sema %d\n", rpciod_users);
 	if (rpciod_users) {
 		if (--rpciod_users)
 			goto out;
 	} else
-		printk(KERN_WARNING "rpciod_down: pid=%d, no users??\n", rpciod_pid);
+		printk(KERN_WARNING "rpciod_down: no users??\n");
 
-	if (!rpciod_pid) {
+	rpc_inhibit = 1;
+	if (!rpciod_workqueue) {
 		dprintk("rpciod_down: Nothing to do!\n");
 		goto out;
 	}
+	rpciod_killall();
 
-	kill_proc(rpciod_pid, SIGKILL, 1);
-	wait_for_completion(&rpciod_killer);
+	destroy_workqueue(rpciod_workqueue);
+	rpciod_workqueue = NULL;
  out:
 	up(&rpciod_sema);
 }
@@ -1256,7 +1064,12 @@ void rpc_show_tasks(void)
 	}
 	printk("-pid- proc flgs status -client- -prog- --rqstp- -timeout "
 		"-rpcwait -action- --exit--\n");
-	alltask_for_each(t, le, &all_tasks)
+	alltask_for_each(t, le, &all_tasks) {
+		const char *rpc_waitq = "none";
+
+		if (RPC_IS_QUEUED(t))
+			rpc_waitq = rpc_qname(t->u.tk_wait.rpc_waitq);
+
 		printk("%05d %04d %04x %06d %8p %6d %8p %08ld %8s %8p %8p\n",
 			t->tk_pid,
 			(t->tk_msg.rpc_proc ? t->tk_msg.rpc_proc->p_proc : -1),
@@ -1264,8 +1077,9 @@ void rpc_show_tasks(void)
 			t->tk_client,
 			(t->tk_client ? t->tk_client->cl_prog : 0),
 			t->tk_rqstp, t->tk_timeout,
-			rpc_qname(t->tk_rpcwait),
+			rpc_waitq,
 			t->tk_action, t->tk_exit);
+	}
 	spin_unlock(&rpc_sched_lock);
 }
 #endif
diff -u --recursive --new-file --show-c-function linux-2.6.5-NFS_ALL/net/sunrpc/xprt.c linux-2.6.5-34-rpc_workqueue/net/sunrpc/xprt.c
--- linux-2.6.5-NFS_ALL/net/sunrpc/xprt.c	2004-04-02 21:15:52.000000000 -0500
+++ linux-2.6.5-34-rpc_workqueue/net/sunrpc/xprt.c	2004-04-02 22:57:29.000000000 -0500
@@ -1078,7 +1078,7 @@ xprt_write_space(struct sock *sk)
 		goto out;
 
 	spin_lock_bh(&xprt->sock_lock);
-	if (xprt->snd_task && xprt->snd_task->tk_rpcwait == &xprt->pending)
+	if (xprt->snd_task)
 		rpc_wake_up_task(xprt->snd_task);
 	spin_unlock_bh(&xprt->sock_lock);
 out:

[-- Attachment #3: linux-2.6.5-37-rpc_queue_lock.dif --]
[-- Type: text/plain, Size: 8364 bytes --]

 include/linux/sunrpc/sched.h |    4 ++
 net/sunrpc/sched.c           |   69 ++++++++++++++++++-------------------------
 2 files changed, 34 insertions(+), 39 deletions(-)

diff -u --recursive --new-file --show-c-function linux-2.6.5-32-workqueue/include/linux/sunrpc/sched.h linux-2.6.5-33-spin/include/linux/sunrpc/sched.h
--- linux-2.6.5-32-workqueue/include/linux/sunrpc/sched.h	2004-03-27 17:27:14.000000000 -0500
+++ linux-2.6.5-33-spin/include/linux/sunrpc/sched.h	2004-03-27 17:37:08.000000000 -0500
@@ -11,6 +11,7 @@
 
 #include <linux/timer.h>
 #include <linux/sunrpc/types.h>
+#include <linux/spinlock.h>
 #include <linux/wait.h>
 #include <linux/workqueue.h>
 #include <linux/sunrpc/xdr.h>
@@ -166,6 +167,7 @@ typedef void			(*rpc_action)(struct rpc_
  * RPC synchronization objects
  */
 struct rpc_wait_queue {
+	spinlock_t		lock;
 	struct list_head	tasks[RPC_NR_PRIORITY];	/* task queue for each priority level */
 	unsigned long		cookie;			/* cookie of last task serviced */
 	unsigned char		maxpriority;		/* maximum priority (0 if queue is not a priority queue) */
@@ -186,6 +188,7 @@ struct rpc_wait_queue {
 
 #ifndef RPC_DEBUG
 # define RPC_WAITQ_INIT(var,qname) { \
+		.lock = SPIN_LOCK_UNLOCKED, \
 		.tasks = { \
 			[0] = LIST_HEAD_INIT(var.tasks[0]), \
 			[1] = LIST_HEAD_INIT(var.tasks[1]), \
@@ -194,6 +197,7 @@ struct rpc_wait_queue {
 	}
 #else
 # define RPC_WAITQ_INIT(var,qname) { \
+		.lock = SPIN_LOCK_UNLOCKED, \
 		.tasks = { \
 			[0] = LIST_HEAD_INIT(var.tasks[0]), \
 			[1] = LIST_HEAD_INIT(var.tasks[1]), \
diff -u --recursive --new-file --show-c-function linux-2.6.5-32-workqueue/net/sunrpc/sched.c linux-2.6.5-33-spin/net/sunrpc/sched.c
--- linux-2.6.5-32-workqueue/net/sunrpc/sched.c	2004-03-27 17:28:31.000000000 -0500
+++ linux-2.6.5-33-spin/net/sunrpc/sched.c	2004-03-27 17:33:24.000000000 -0500
@@ -68,18 +68,13 @@ static struct workqueue_struct *rpciod_w
 static int			rpc_inhibit;
 
 /*
- * Spinlock for wait queues. Access to the latter also has to be
- * interrupt-safe in order to allow timers to wake up sleeping tasks.
- */
-static spinlock_t rpc_queue_lock = SPIN_LOCK_UNLOCKED;
-/*
  * Spinlock for other critical sections of code.
  */
 static spinlock_t rpc_sched_lock = SPIN_LOCK_UNLOCKED;
 
 /*
  * Disable the timer for a given RPC task. Should be called with
- * rpc_queue_lock and bh_disabled in order to avoid races within
+ * queue->lock and bh_disabled in order to avoid races within
  * rpc_run_timer().
  */
 static inline void
@@ -130,7 +125,7 @@ __rpc_add_timer(struct rpc_task *task, r
 
 /*
  * Delete any timer for the current task. Because we use del_timer_sync(),
- * this function should never be called while holding rpc_queue_lock.
+ * this function should never be called while holding queue->lock.
  */
 static inline void
 rpc_delete_timer(struct rpc_task *task)
@@ -239,6 +234,7 @@ static void __rpc_init_priority_wait_que
 {
 	int i;
 
+	spin_lock_init(&queue->lock);
 	for (i = 0; i < ARRAY_SIZE(queue->tasks); i++)
 		INIT_LIST_HEAD(&queue->tasks[i]);
 	queue->maxpriority = maxprio;
@@ -325,23 +321,22 @@ static void __rpc_sleep_on(struct rpc_wa
 	__rpc_add_timer(task, timer);
 }
 
-void
-rpc_sleep_on(struct rpc_wait_queue *q, struct rpc_task *task,
+void rpc_sleep_on(struct rpc_wait_queue *q, struct rpc_task *task,
 				rpc_action action, rpc_action timer)
 {
 	/*
 	 * Protect the queue operations.
 	 */
-	spin_lock_bh(&rpc_queue_lock);
+	spin_lock_bh(&q->lock);
 	__rpc_sleep_on(q, task, action, timer);
-	spin_unlock_bh(&rpc_queue_lock);
+	spin_unlock_bh(&q->lock);
 }
 
 /**
  * __rpc_do_wake_up_task - wake up a single rpc_task
  * @task: task to be woken up
  *
- * Caller must hold rpc_queue_lock, and have cleared the task queued flag.
+ * Caller must hold queue->lock, and have cleared the task queued flag.
  */
 static void __rpc_do_wake_up_task(struct rpc_task *task)
 {
@@ -396,9 +391,11 @@ __rpc_default_timer(struct rpc_task *tas
 void rpc_wake_up_task(struct rpc_task *task)
 {
 	if (rpc_test_and_clear_queued(task)) {
-		spin_lock_bh(&rpc_queue_lock);
+		struct rpc_wait_queue *queue = task->u.tk_wait.rpc_waitq;
+
+		spin_lock_bh(&queue->lock);
 		__rpc_do_wake_up_task(task);
-		spin_unlock_bh(&rpc_queue_lock);
+		spin_unlock_bh(&queue->lock);
 	}
 }
 
@@ -462,14 +459,14 @@ struct rpc_task * rpc_wake_up_next(struc
 	struct rpc_task	*task = NULL;
 
 	dprintk("RPC:      wake_up_next(%p \"%s\")\n", queue, rpc_qname(queue));
-	spin_lock_bh(&rpc_queue_lock);
+	spin_lock_bh(&queue->lock);
 	if (RPC_IS_PRIORITY(queue))
 		task = __rpc_wake_up_next_priority(queue);
 	else {
 		task_for_first(task, &queue->tasks[0])
 			__rpc_wake_up_task(task);
 	}
-	spin_unlock_bh(&rpc_queue_lock);
+	spin_unlock_bh(&queue->lock);
 
 	return task;
 }
@@ -478,14 +475,14 @@ struct rpc_task * rpc_wake_up_next(struc
  * rpc_wake_up - wake up all rpc_tasks
  * @queue: rpc_wait_queue on which the tasks are sleeping
  *
- * Grabs rpc_queue_lock
+ * Grabs queue->lock
  */
 void rpc_wake_up(struct rpc_wait_queue *queue)
 {
 	struct rpc_task *task;
 
 	struct list_head *head;
-	spin_lock_bh(&rpc_queue_lock);
+	spin_lock_bh(&queue->lock);
 	head = &queue->tasks[queue->maxpriority];
 	for (;;) {
 		while (!list_empty(head)) {
@@ -496,7 +493,7 @@ void rpc_wake_up(struct rpc_wait_queue *
 			break;
 		head--;
 	}
-	spin_unlock_bh(&rpc_queue_lock);
+	spin_unlock_bh(&queue->lock);
 }
 
 /**
@@ -504,14 +501,14 @@ void rpc_wake_up(struct rpc_wait_queue *
  * @queue: rpc_wait_queue on which the tasks are sleeping
  * @status: status value to set
  *
- * Grabs rpc_queue_lock
+ * Grabs queue->lock
  */
 void rpc_wake_up_status(struct rpc_wait_queue *queue, int status)
 {
 	struct list_head *head;
 	struct rpc_task *task;
 
-	spin_lock_bh(&rpc_queue_lock);
+	spin_lock_bh(&queue->lock);
 	head = &queue->tasks[queue->maxpriority];
 	for (;;) {
 		while (!list_empty(head)) {
@@ -523,7 +520,7 @@ void rpc_wake_up_status(struct rpc_wait_
 			break;
 		head--;
 	}
-	spin_unlock_bh(&rpc_queue_lock);
+	spin_unlock_bh(&queue->lock);
 }
 
 /*
@@ -830,8 +827,7 @@ cleanup:
 	goto out;
 }
 
-void
-rpc_release_task(struct rpc_task *task)
+void rpc_release_task(struct rpc_task *task)
 {
 	dprintk("RPC: %4d release task\n", task->tk_pid);
 
@@ -881,10 +877,9 @@ rpc_release_task(struct rpc_task *task)
  * queue 'childq'. If so returns a pointer to the parent.
  * Upon failure returns NULL.
  *
- * Caller must hold rpc_queue_lock
+ * Caller must hold childq.lock
  */
-static inline struct rpc_task *
-rpc_find_parent(struct rpc_task *child)
+static inline struct rpc_task *rpc_find_parent(struct rpc_task *child)
 {
 	struct rpc_task	*task, *parent;
 	struct list_head *le;
@@ -897,17 +892,16 @@ rpc_find_parent(struct rpc_task *child)
 	return NULL;
 }
 
-static void
-rpc_child_exit(struct rpc_task *child)
+static void rpc_child_exit(struct rpc_task *child)
 {
 	struct rpc_task	*parent;
 
-	spin_lock_bh(&rpc_queue_lock);
+	spin_lock_bh(&childq.lock);
 	if ((parent = rpc_find_parent(child)) != NULL) {
 		parent->tk_status = child->tk_status;
 		__rpc_wake_up_task(parent);
 	}
-	spin_unlock_bh(&rpc_queue_lock);
+	spin_unlock_bh(&childq.lock);
 }
 
 /*
@@ -930,22 +924,20 @@ fail:
 	return NULL;
 }
 
-void
-rpc_run_child(struct rpc_task *task, struct rpc_task *child, rpc_action func)
+void rpc_run_child(struct rpc_task *task, struct rpc_task *child, rpc_action func)
 {
-	spin_lock_bh(&rpc_queue_lock);
+	spin_lock_bh(&childq.lock);
 	/* N.B. Is it possible for the child to have already finished? */
 	__rpc_sleep_on(&childq, task, func, NULL);
 	rpc_schedule_run(child);
-	spin_unlock_bh(&rpc_queue_lock);
+	spin_unlock_bh(&childq.lock);
 }
 
 /*
  * Kill all tasks for the given client.
  * XXX: kill their descendants as well?
  */
-void
-rpc_killall_tasks(struct rpc_clnt *clnt)
+void rpc_killall_tasks(struct rpc_clnt *clnt)
 {
 	struct rpc_task	*rovr;
 	struct list_head *le;
@@ -967,8 +959,7 @@ rpc_killall_tasks(struct rpc_clnt *clnt)
 
 static DECLARE_MUTEX_LOCKED(rpciod_running);
 
-static void
-rpciod_killall(void)
+static void rpciod_killall(void)
 {
 	unsigned long flags;
 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RE: Race condition in xprt_disconnect
  2004-04-06  9:24 ` Olaf Kirch
  2004-04-06 13:51   ` Trond Myklebust
@ 2004-04-06 14:11   ` Trond Myklebust
  2004-04-06 14:26     ` Olaf Kirch
  1 sibling, 1 reply; 12+ messages in thread
From: Trond Myklebust @ 2004-04-06 14:11 UTC (permalink / raw)
  To: Olaf Kirch; +Cc: nfs

On Tue, 2004-04-06 at 05:24, Olaf Kirch wrote:
> > Presumably this is
> > occurring because xprt->snd_task is timing out and then being woken up
> > from xprt->pending.
> 
> I'm not sure it's a timeout, actually. Looking at the syslog, the NFS
> Server closes the connection, and the oops happens one second later.
> 
> I've been looking around the code a little more, and here's a
> different theory.
> 
>  -	server drops connection
>  -	tcp_state_change is called and wakes up all tasks with ENOTCONN
>  -	rpciod schedules all tasks. The first one gets the write lock,
>  	and schedules the xprt_socket_connect worker. Goes to
> 	sleep on xprt->pending.
>  -	xprt_socket_connect runs and calls xprt_close->xprt_disconnect
>  	which wakes up all tasks with ENOTCONN. While xprt_socket_connect
> 	is calling various network functions to create a socket, bind
> 	and connect it, the following happens on CPU B:
>  -	rpciod wakes up, schedules the task that was waiting for the
>  	connect. Oops, we already have the send lock (we're snd_task),
> 	so go ahead and try to reconnect. Thus we schedule
> 	xprt_socket_connect a second time, which gets run. Note that
> 	the workqueue stuff resets the pending bit before calling
> 	xprt_socket_connect. First thing xprt_socket_connect does
> 	is close xprt->sock.
>  -	Back on CPU A, we find the socket we're just trying to
>  	connect is gone. Oops.
> 
> My patch proposed earlier to change xprt_close to not wake up all tasks
> when called from xprt_socket_connect should also prevent this race. The
> general case of snd_task waking up before the worker thread has run is
> not covered by this, though. But that should only happen due to timeout,
> and a 60 second timeout should be sufficient for keventd even on a slow
> day.

Can't tcp_state_change() inject a few more wakeups when we call
xprt_close()?

Cheers,
  Trond


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RE: Race condition in xprt_disconnect
  2004-04-06 14:11   ` Trond Myklebust
@ 2004-04-06 14:26     ` Olaf Kirch
  2004-04-06 14:36       ` Trond Myklebust
  2004-04-06 15:27       ` Mounting any other os besides RHEW 3 doesn't work Brent M. Clements
  0 siblings, 2 replies; 12+ messages in thread
From: Olaf Kirch @ 2004-04-06 14:26 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: nfs

On Tue, Apr 06, 2004 at 10:11:29AM -0400, Trond Myklebust wrote:
> Can't tcp_state_change() inject a few more wakeups when we call
> xprt_close()?

I don't think I understand. tcp_state_change can't call xprt_close.

Or are you referring to your alternative patch that simply removed
the xprt_disconnect() from xprt_close? Yes, that would cure both
issues, too.

Olaf
-- 
Olaf Kirch     |  The Hardware Gods hate me.
okir@suse.de   |
---------------+ 


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RE: Race condition in xprt_disconnect
  2004-04-06 14:26     ` Olaf Kirch
@ 2004-04-06 14:36       ` Trond Myklebust
  2004-04-06 14:54         ` Trond Myklebust
  2004-04-06 15:27       ` Mounting any other os besides RHEW 3 doesn't work Brent M. Clements
  1 sibling, 1 reply; 12+ messages in thread
From: Trond Myklebust @ 2004-04-06 14:36 UTC (permalink / raw)
  To: Olaf Kirch; +Cc: nfs

On Tue, 2004-04-06 at 10:26, Olaf Kirch wrote:
> On Tue, Apr 06, 2004 at 10:11:29AM -0400, Trond Myklebust wrote:
> > Can't tcp_state_change() inject a few more wakeups when we call
> > xprt_close()?
> 
> I don't think I understand. tcp_state_change can't call xprt_close.

No, but it does currently call xprt_disconnect().

I'm trying to work out whether or not the socket can change state again
after the server sends the FIN (or whatever it is it sends), but before
xprt_sock_connect()/keventd swaps out sk->state_change in its call to
xprt_close().

Cheers,
  Trond


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RE: Race condition in xprt_disconnect
  2004-04-06 14:36       ` Trond Myklebust
@ 2004-04-06 14:54         ` Trond Myklebust
  2004-04-06 15:23           ` Olaf Kirch
  0 siblings, 1 reply; 12+ messages in thread
From: Trond Myklebust @ 2004-04-06 14:54 UTC (permalink / raw)
  To: Olaf Kirch; +Cc: nfs

[-- Attachment #1: Type: text/plain, Size: 767 bytes --]

On Tue, 2004-04-06 at 10:36, Trond Myklebust wrote:
> I'm trying to work out whether or not the socket can change state again
> after the server sends the FIN (or whatever it is it sends), but before
> xprt_sock_connect()/keventd swaps out sk->state_change in its call to
> xprt_close().

I don't think we have to care... Here's a slight variation on the old
patch that should take care of that scenario too:

The policy should be that once tcp_state_change() clears the "connected"
bit, and has woken up those requests that were waiting on a reply, it
should just stay out of the way. We do not want it to interfere with the
reconnection process (the task timeouts etc should suffice for error
handling). This patch should ensure that is the case.

Cheers,
  Trond


[-- Attachment #2: linux-2.6.5-34-sunrpc_disconnect.dif --]
[-- Type: text/plain, Size: 4813 bytes --]

 include/linux/sunrpc/xprt.h |    7 +++--
 net/sunrpc/xprt.c           |   61 ++++++++++++++++++--------------------------
 2 files changed, 31 insertions(+), 37 deletions(-)

diff -u --recursive --new-file --show-c-function linux-2.6.5-33-grace/include/linux/sunrpc/xprt.h linux-2.6.5-34-sunrpc_disconnect/include/linux/sunrpc/xprt.h
--- linux-2.6.5-33-grace/include/linux/sunrpc/xprt.h	2004-04-05 18:23:12.000000000 -0400
+++ linux-2.6.5-34-sunrpc_disconnect/include/linux/sunrpc/xprt.h	2004-04-06 10:17:22.000000000 -0400
@@ -216,12 +216,15 @@ void			xprt_connect(struct rpc_task *);
 int			xprt_clear_backlog(struct rpc_xprt *);
 void			xprt_sock_setbufsize(struct rpc_xprt *);
 
-#define XPRT_CONNECT	0
-#define XPRT_LOCKED	1
+#define XPRT_LOCKED	0
+#define XPRT_CONNECT	1
+#define XPRT_CONNECTING	2
 
 #define xprt_connected(xp)		(test_bit(XPRT_CONNECT, &(xp)->sockstate))
 #define xprt_set_connected(xp)		(set_bit(XPRT_CONNECT, &(xp)->sockstate))
 #define xprt_test_and_set_connected(xp)	(test_and_set_bit(XPRT_CONNECT, &(xp)->sockstate))
+#define xprt_test_and_clear_connected(xp) \
+					(test_and_clear_bit(XPRT_CONNECT, &(xp)->sockstate))
 #define xprt_clear_connected(xp)	(clear_bit(XPRT_CONNECT, &(xp)->sockstate))
 
 #endif /* __KERNEL__*/
diff -u --recursive --new-file --show-c-function linux-2.6.5-33-grace/net/sunrpc/xprt.c linux-2.6.5-34-sunrpc_disconnect/net/sunrpc/xprt.c
--- linux-2.6.5-33-grace/net/sunrpc/xprt.c	2004-04-05 18:23:22.000000000 -0400
+++ linux-2.6.5-34-sunrpc_disconnect/net/sunrpc/xprt.c	2004-04-06 10:16:29.000000000 -0400
@@ -448,7 +448,10 @@ xprt_init_autodisconnect(unsigned long d
 		goto out_abort;
 	spin_unlock(&xprt->sock_lock);
 	/* Let keventd close the socket */
-	schedule_work(&xprt->task_cleanup);
+	if (test_bit(XPRT_CONNECTING, &xprt->sockstate) != 0)
+		xprt_release_write(xprt, NULL);
+	else
+		schedule_work(&xprt->task_cleanup);
 	return;
 out_abort:
 	spin_unlock(&xprt->sock_lock);
@@ -460,12 +463,8 @@ static void xprt_socket_connect(void *ar
 	struct socket *sock = xprt->sock;
 	int status = -EIO;
 
-	if (xprt->shutdown) {
-		rpc_wake_up_status(&xprt->pending, -EIO);
-		return;
-	}
-	if (!xprt->addr.sin_port)
-		goto out_err;
+	if (xprt->shutdown || xprt->addr.sin_port == 0)
+		goto out;
 
 	/*
 	 * Start by resetting any existing state
@@ -475,12 +474,12 @@ static void xprt_socket_connect(void *ar
 	if (sock == NULL) {
 		/* couldn't create socket or bind to reserved port;
 		 * this is likely a permanent error, so cause an abort */
-		goto out_err;
-		return;
+		goto out;
 	}
 	xprt_bind_socket(xprt, sock);
 	xprt_sock_setbufsize(xprt);
 
+	status = 0;
 	if (!xprt->stream)
 		goto out;
 
@@ -491,29 +490,22 @@ static void xprt_socket_connect(void *ar
 			sizeof(xprt->addr), O_NONBLOCK);
 	dprintk("RPC: %p  connect status %d connected %d sock state %d\n",
 			xprt, -status, xprt_connected(xprt), sock->sk->sk_state);
-	if (status >= 0)
-		goto out;
-	switch (status) {
-		case -EINPROGRESS:
-		case -EALREADY:
-			return;
-		default:
-			goto out_err;
+	if (status < 0) {
+		switch (status) {
+			case -EINPROGRESS:
+			case -EALREADY:
+				goto out_clear;
+		}
 	}
 out:
-	spin_lock_bh(&xprt->sock_lock);
-	if (xprt->snd_task)
-		rpc_wake_up_task(xprt->snd_task);
-	spin_unlock_bh(&xprt->sock_lock);
-	return;
-out_err:
-	spin_lock_bh(&xprt->sock_lock);
-	if (xprt->snd_task) {
-		xprt->snd_task->tk_status = status;
-		rpc_wake_up_task(xprt->snd_task);
-	} else
-		rpc_wake_up_status(&xprt->pending, -ENOTCONN);
-	spin_unlock_bh(&xprt->sock_lock);
+	if (status < 0)
+		rpc_wake_up_status(&xprt->pending, status);
+	else
+		rpc_wake_up(&xprt->pending);
+out_clear:
+	smp_mb__before_clear_bit();
+	clear_bit(XPRT_CONNECTING, &xprt->sockstate);
+	smp_mb__after_clear_bit();
 }
 
 /*
@@ -545,7 +537,8 @@ void xprt_connect(struct rpc_task *task)
 
 	task->tk_timeout = RPC_CONNECT_TIMEOUT;
 	rpc_sleep_on(&xprt->pending, task, xprt_connect_status, NULL);
-	schedule_work(&xprt->sock_connect);
+	if (!test_and_set_bit(XPRT_CONNECTING, &xprt->sockstate))
+		schedule_work(&xprt->sock_connect);
 	return;
  out_write:
 	xprt_release_write(xprt, task);
@@ -1027,9 +1020,6 @@ tcp_state_change(struct sock *sk)
 			xprt->tcp_reclen = 0;
 			xprt->tcp_copied = 0;
 			xprt->tcp_flags = XPRT_COPY_RECM | XPRT_COPY_XID;
-
-			if (xprt->snd_task)
-				rpc_wake_up_task(xprt->snd_task);
 			rpc_wake_up(&xprt->pending);
 		}
 		spin_unlock_bh(&xprt->sock_lock);
@@ -1038,7 +1028,8 @@ tcp_state_change(struct sock *sk)
 	case TCP_SYN_RECV:
 		break;
 	default:
-		xprt_disconnect(xprt);
+		if (xprt_test_and_clear_connected(xprt))
+			rpc_wake_up_status(&xprt->pending, -ENOTCONN);
 		break;
 	}
  out:

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RE: Race condition in xprt_disconnect
  2004-04-06 14:54         ` Trond Myklebust
@ 2004-04-06 15:23           ` Olaf Kirch
  2004-04-06 16:02             ` Trond Myklebust
  0 siblings, 1 reply; 12+ messages in thread
From: Olaf Kirch @ 2004-04-06 15:23 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: nfs

On Tue, Apr 06, 2004 at 10:54:15AM -0400, Trond Myklebust wrote:
> The policy should be that once tcp_state_change() clears the "connected"
> bit, and has woken up those requests that were waiting on a reply, it
> should just stay out of the way. We do not want it to interfere with the
> reconnection process (the task timeouts etc should suffice for error
> handling). This patch should ensure that is the case.

Are you sure this helps? Your patch doesn't remove the xprt_disconnect
call from xprt_close. This way, xprt_socket_connect still wakes
up all tasks with ENOTCONN, triggering the connect worker to be
scheduled a second time.

Olaf
-- 
Olaf Kirch     |  The Hardware Gods hate me.
okir@suse.de   |
---------------+ 


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Mounting any other os besides RHEW 3 doesn't work
  2004-04-06 14:26     ` Olaf Kirch
  2004-04-06 14:36       ` Trond Myklebust
@ 2004-04-06 15:27       ` Brent M. Clements
  2004-04-06 20:05         ` Steve Dickson
  1 sibling, 1 reply; 12+ messages in thread
From: Brent M. Clements @ 2004-04-06 15:27 UTC (permalink / raw)
  To: nfs

We are stuck on a problem that no one seems to be able to answer, even
RedHat.

We noticed this problem when we tried deploying RHEW 3 Update 1 in a
production environment.

The problem can be reproduced in the following test environment:

1 HPUX 11.22 NFS Server
1 RHAW 2.1 IA64 System
2 RHEW 3 Update 1 IA64 Systems.

All systems are setup with no firewall settings.

Here are the scenarios:

Scenario 1:

The RHAW 2.1 system can properly mount and access the HPUX NFS system

Scenario 2:

The RHEW 3 system can properly mount the HPUX system but when you try to
do any access on the mounted nfs share, it hangs and the messages file on
the client state that the NFS server doesn't. We have noticed that the
client DOES get the nfs filehandle.

Scenario 3:

The RHEW 3 System can properly mount the RHAW 2.1(setup as an nfs server)
but when you try to do any access on the mounted nfs share, it hangs and
the messages file on the client states that the NFS server doesn't
respond. We have noticed that the client DOES get the nfs filehandle.

Scenario 4:
The RHEW 3 system can properly mount and access the other RHEW 3(setup as
an nfs server)

This is extremely wierd. it appears that the only thing that a RHEW 3
update 1 system can mount is another RHEW 3 update 1 system.

We have looked it over and every time we mount other types of nfs servers,
we receive a file handle but it looks like it fails in the data
connection.

Any help would be appreciated. We have submitted a bug report to RedHat
concerning this, but have not received any updates in a long time.

Thanks,

Brent Clements
Linux Technology Specialist
Information Technology
Rice University

-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RE: Race condition in xprt_disconnect
  2004-04-06 15:23           ` Olaf Kirch
@ 2004-04-06 16:02             ` Trond Myklebust
  2004-04-06 16:17               ` Olaf Kirch
  0 siblings, 1 reply; 12+ messages in thread
From: Trond Myklebust @ 2004-04-06 16:02 UTC (permalink / raw)
  To: Olaf Kirch; +Cc: nfs

On Tue, 2004-04-06 at 11:23, Olaf Kirch wrote:

> Are you sure this helps? Your patch doesn't remove the xprt_disconnect
> call from xprt_close. This way, xprt_socket_connect still wakes
> up all tasks with ENOTCONN, triggering the connect worker to be
> scheduled a second time.

Oh, but I already have the xprt_disconnect() removal in a separate patch
which is applied *before* this one (see the patch I published in
response to yours last week).

Cheers,
   Trond


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RE: Race condition in xprt_disconnect
  2004-04-06 16:02             ` Trond Myklebust
@ 2004-04-06 16:17               ` Olaf Kirch
  0 siblings, 0 replies; 12+ messages in thread
From: Olaf Kirch @ 2004-04-06 16:17 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: nfs

On Tue, Apr 06, 2004 at 12:02:38PM -0400, Trond Myklebust wrote:
> Oh, but I already have the xprt_disconnect() removal in a separate patch
> which is applied *before* this one (see the patch I published in
> response to yours last week).

Ah, alright. I wasn't aware of that. Then you probably don't even
need to XPRT_CONNECTING bit at all. It makes things a little more
robust though.

Olaf
-- 
Olaf Kirch     |  The Hardware Gods hate me.
okir@suse.de   |
---------------+ 


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Mounting any other os besides RHEW 3 doesn't work
  2004-04-06 15:27       ` Mounting any other os besides RHEW 3 doesn't work Brent M. Clements
@ 2004-04-06 20:05         ` Steve Dickson
  0 siblings, 0 replies; 12+ messages in thread
From: Steve Dickson @ 2004-04-06 20:05 UTC (permalink / raw)
  To: Brent M. Clements; +Cc: nfs

Brent M. Clements wrote:

>We are stuck on a problem that no one seems to be able to answer, even
>RedHat.
>
This is the first I've heard of this.... what's the bugzilla number 
associated
with these issues?

SteveD.


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2004-04-06 20:06 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-04-05 20:39 Race condition in xprt_disconnect Trond Myklebust
2004-04-06  9:24 ` Olaf Kirch
2004-04-06 13:51   ` Trond Myklebust
2004-04-06 14:11   ` Trond Myklebust
2004-04-06 14:26     ` Olaf Kirch
2004-04-06 14:36       ` Trond Myklebust
2004-04-06 14:54         ` Trond Myklebust
2004-04-06 15:23           ` Olaf Kirch
2004-04-06 16:02             ` Trond Myklebust
2004-04-06 16:17               ` Olaf Kirch
2004-04-06 15:27       ` Mounting any other os besides RHEW 3 doesn't work Brent M. Clements
2004-04-06 20:05         ` Steve Dickson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.