cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed
* [Cluster-devel] [PATCH] [GFS2] bz 276631 : GFS2: chmod hung - TRY 3
@ 2007-09-14 14:27 Bob Peterson
  2007-09-14 15:12 ` Steven Whitehouse
  2007-09-19 21:13 ` David Teigland
  0 siblings, 2 replies; 5+ messages in thread
From: Bob Peterson @ 2007-09-14 14:27 UTC (permalink / raw)
  To: cluster-devel.redhat.com

This is a rewrite of the patch.  We decided it was a better
approach to call separate wrapper functions than trying to work around
the problem with a spin_lock.
--
The problem boiled down to a race between the gdlm_init_threads()
function initializing thread1 and its setting of blist = 1.
Essentially, "if (current == ls->thread1)" was checked by the thread
before the thread creator set ls->thread1.

Since thread1 is the only thread who is allowed to work on the
blocking queue, and since neither thread thought it was thread1, no one
was working on the queue.  So everything just sat.

This patch reuses the ls->async_lock spin_lock to fix the race,
and it fixes the problem.  I've done more than 2000 iterations of the
loop that was recreating the failure and it seems to work.

Dave Teigland brought up the question of whether we should do this
another way.  For example, by checking for the task name "lock_dlm1"
instead.  I'm open to opinions.
--
Signed-off-by: Bob Peterson <rpeterso@redhat.com> 
--
diff -pur a/fs/gfs2/locking/dlm/thread.c b/fs/gfs2/locking/dlm/thread.c
--- a/fs/gfs2/locking/dlm/thread.c	2007-09-13 17:33:58.000000000 -0500
+++ b/fs/gfs2/locking/dlm/thread.c	2007-09-14 09:16:07.000000000 -0500
@@ -268,20 +268,16 @@ static inline int check_drop(struct gdlm
 	return 0;
 }
 
-static int gdlm_thread(void *data)
+static int gdlm_thread(void *data, int blist)
 {
 	struct gdlm_ls *ls = (struct gdlm_ls *) data;
 	struct gdlm_lock *lp = NULL;
-	int blist = 0;
 	uint8_t complete, blocking, submit, drop;
 	DECLARE_WAITQUEUE(wait, current);
 
 	/* Only thread1 is allowed to do blocking callbacks since gfs
 	   may wait for a completion callback within a blocking cb. */
 
-	if (current == ls->thread1)
-		blist = 1;
-
 	while (!kthread_should_stop()) {
 		set_current_state(TASK_INTERRUPTIBLE);
 		add_wait_queue(&ls->thread_wait, &wait);
@@ -333,12 +329,22 @@ static int gdlm_thread(void *data)
 	return 0;
 }
 
+static int gdlm_thread1(void *data)
+{
+	return gdlm_thread(data, 1);
+}
+
+static int gdlm_thread2(void *data)
+{
+	return gdlm_thread(data, 0);
+}
+
 int gdlm_init_threads(struct gdlm_ls *ls)
 {
 	struct task_struct *p;
 	int error;
 
-	p = kthread_run(gdlm_thread, ls, "lock_dlm1");
+	p = kthread_run(gdlm_thread1, ls, "lock_dlm1");
 	error = IS_ERR(p);
 	if (error) {
 		log_error("can't start lock_dlm1 thread %d", error);
@@ -346,7 +352,7 @@ int gdlm_init_threads(struct gdlm_ls *ls
 	}
 	ls->thread1 = p;
 
-	p = kthread_run(gdlm_thread, ls, "lock_dlm2");
+	p = kthread_run(gdlm_thread2, ls, "lock_dlm2");
 	error = IS_ERR(p);
 	if (error) {
 		log_error("can't start lock_dlm2 thread %d", error);




^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Cluster-devel] [PATCH] [GFS2] bz 276631 : GFS2: chmod hung - TRY 3
  2007-09-14 14:27 [Cluster-devel] [PATCH] [GFS2] bz 276631 : GFS2: chmod hung - TRY 3 Bob Peterson
@ 2007-09-14 15:12 ` Steven Whitehouse
  2007-09-19 21:13 ` David Teigland
  1 sibling, 0 replies; 5+ messages in thread
From: Steven Whitehouse @ 2007-09-14 15:12 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

Now in the -nmw git tree. Thanks,

Steve.

On Fri, 2007-09-14 at 09:27 -0500, Bob Peterson wrote:
> This is a rewrite of the patch.  We decided it was a better
> approach to call separate wrapper functions than trying to work around
> the problem with a spin_lock.
> --
> The problem boiled down to a race between the gdlm_init_threads()
> function initializing thread1 and its setting of blist = 1.
> Essentially, "if (current == ls->thread1)" was checked by the thread
> before the thread creator set ls->thread1.
> 
> Since thread1 is the only thread who is allowed to work on the
> blocking queue, and since neither thread thought it was thread1, no one
> was working on the queue.  So everything just sat.
> 
> This patch reuses the ls->async_lock spin_lock to fix the race,
> and it fixes the problem.  I've done more than 2000 iterations of the
> loop that was recreating the failure and it seems to work.
> 
> Dave Teigland brought up the question of whether we should do this
> another way.  For example, by checking for the task name "lock_dlm1"
> instead.  I'm open to opinions.
> --
> Signed-off-by: Bob Peterson <rpeterso@redhat.com> 
> --
> diff -pur a/fs/gfs2/locking/dlm/thread.c b/fs/gfs2/locking/dlm/thread.c
> --- a/fs/gfs2/locking/dlm/thread.c	2007-09-13 17:33:58.000000000 -0500
> +++ b/fs/gfs2/locking/dlm/thread.c	2007-09-14 09:16:07.000000000 -0500
> @@ -268,20 +268,16 @@ static inline int check_drop(struct gdlm
>  	return 0;
>  }
>  
> -static int gdlm_thread(void *data)
> +static int gdlm_thread(void *data, int blist)
>  {
>  	struct gdlm_ls *ls = (struct gdlm_ls *) data;
>  	struct gdlm_lock *lp = NULL;
> -	int blist = 0;
>  	uint8_t complete, blocking, submit, drop;
>  	DECLARE_WAITQUEUE(wait, current);
>  
>  	/* Only thread1 is allowed to do blocking callbacks since gfs
>  	   may wait for a completion callback within a blocking cb. */
>  
> -	if (current == ls->thread1)
> -		blist = 1;
> -
>  	while (!kthread_should_stop()) {
>  		set_current_state(TASK_INTERRUPTIBLE);
>  		add_wait_queue(&ls->thread_wait, &wait);
> @@ -333,12 +329,22 @@ static int gdlm_thread(void *data)
>  	return 0;
>  }
>  
> +static int gdlm_thread1(void *data)
> +{
> +	return gdlm_thread(data, 1);
> +}
> +
> +static int gdlm_thread2(void *data)
> +{
> +	return gdlm_thread(data, 0);
> +}
> +
>  int gdlm_init_threads(struct gdlm_ls *ls)
>  {
>  	struct task_struct *p;
>  	int error;
>  
> -	p = kthread_run(gdlm_thread, ls, "lock_dlm1");
> +	p = kthread_run(gdlm_thread1, ls, "lock_dlm1");
>  	error = IS_ERR(p);
>  	if (error) {
>  		log_error("can't start lock_dlm1 thread %d", error);
> @@ -346,7 +352,7 @@ int gdlm_init_threads(struct gdlm_ls *ls
>  	}
>  	ls->thread1 = p;
>  
> -	p = kthread_run(gdlm_thread, ls, "lock_dlm2");
> +	p = kthread_run(gdlm_thread2, ls, "lock_dlm2");
>  	error = IS_ERR(p);
>  	if (error) {
>  		log_error("can't start lock_dlm2 thread %d", error);
> 
> 



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Cluster-devel] [PATCH] [GFS2] bz 276631 : GFS2: chmod hung - TRY 3
  2007-09-14 14:27 [Cluster-devel] [PATCH] [GFS2] bz 276631 : GFS2: chmod hung - TRY 3 Bob Peterson
  2007-09-14 15:12 ` Steven Whitehouse
@ 2007-09-19 21:13 ` David Teigland
  2007-09-19 22:23   ` Bob Peterson
  1 sibling, 1 reply; 5+ messages in thread
From: David Teigland @ 2007-09-19 21:13 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Sep 14, 2007 at 09:27:59AM -0500, Bob Peterson wrote:
> This is a rewrite of the patch.  We decided it was a better
> approach to call separate wrapper functions than trying to work around
> the problem with a spin_lock.

ACK, I noticed that this isn't in git yet


> --
> The problem boiled down to a race between the gdlm_init_threads()
> function initializing thread1 and its setting of blist = 1.
> Essentially, "if (current == ls->thread1)" was checked by the thread
> before the thread creator set ls->thread1.
> 
> Since thread1 is the only thread who is allowed to work on the
> blocking queue, and since neither thread thought it was thread1, no one
> was working on the queue.  So everything just sat.
> 
> This patch reuses the ls->async_lock spin_lock to fix the race,
> and it fixes the problem.  I've done more than 2000 iterations of the
> loop that was recreating the failure and it seems to work.
> 
> Dave Teigland brought up the question of whether we should do this
> another way.  For example, by checking for the task name "lock_dlm1"
> instead.  I'm open to opinions.
> --
> Signed-off-by: Bob Peterson <rpeterso@redhat.com> 
> --
> diff -pur a/fs/gfs2/locking/dlm/thread.c b/fs/gfs2/locking/dlm/thread.c
> --- a/fs/gfs2/locking/dlm/thread.c	2007-09-13 17:33:58.000000000 -0500
> +++ b/fs/gfs2/locking/dlm/thread.c	2007-09-14 09:16:07.000000000 -0500
> @@ -268,20 +268,16 @@ static inline int check_drop(struct gdlm
>  	return 0;
>  }
>  
> -static int gdlm_thread(void *data)
> +static int gdlm_thread(void *data, int blist)
>  {
>  	struct gdlm_ls *ls = (struct gdlm_ls *) data;
>  	struct gdlm_lock *lp = NULL;
> -	int blist = 0;
>  	uint8_t complete, blocking, submit, drop;
>  	DECLARE_WAITQUEUE(wait, current);
>  
>  	/* Only thread1 is allowed to do blocking callbacks since gfs
>  	   may wait for a completion callback within a blocking cb. */
>  
> -	if (current == ls->thread1)
> -		blist = 1;
> -
>  	while (!kthread_should_stop()) {
>  		set_current_state(TASK_INTERRUPTIBLE);
>  		add_wait_queue(&ls->thread_wait, &wait);
> @@ -333,12 +329,22 @@ static int gdlm_thread(void *data)
>  	return 0;
>  }
>  
> +static int gdlm_thread1(void *data)
> +{
> +	return gdlm_thread(data, 1);
> +}
> +
> +static int gdlm_thread2(void *data)
> +{
> +	return gdlm_thread(data, 0);
> +}
> +
>  int gdlm_init_threads(struct gdlm_ls *ls)
>  {
>  	struct task_struct *p;
>  	int error;
>  
> -	p = kthread_run(gdlm_thread, ls, "lock_dlm1");
> +	p = kthread_run(gdlm_thread1, ls, "lock_dlm1");
>  	error = IS_ERR(p);
>  	if (error) {
>  		log_error("can't start lock_dlm1 thread %d", error);
> @@ -346,7 +352,7 @@ int gdlm_init_threads(struct gdlm_ls *ls
>  	}
>  	ls->thread1 = p;
>  
> -	p = kthread_run(gdlm_thread, ls, "lock_dlm2");
> +	p = kthread_run(gdlm_thread2, ls, "lock_dlm2");
>  	error = IS_ERR(p);
>  	if (error) {
>  		log_error("can't start lock_dlm2 thread %d", error);
> 



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Cluster-devel] [PATCH] [GFS2] bz 276631 : GFS2: chmod hung - TRY 3
  2007-09-19 21:13 ` David Teigland
@ 2007-09-19 22:23   ` Bob Peterson
  2007-09-20 13:35     ` David Teigland
  0 siblings, 1 reply; 5+ messages in thread
From: Bob Peterson @ 2007-09-19 22:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed, 2007-09-19 at 16:13 -0500, David Teigland wrote:
> ACK, I noticed that this isn't in git yet

AFAICT, it's in the git tree:

http://git.kernel.org/?p=linux/kernel/git/steve/gfs2-2.6-nmw.git;a=commitdiff;h=e76700466a04c244d1d29fe51b282838becd4f2d

Are you sure you were looking at nmw?




^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Cluster-devel] [PATCH] [GFS2] bz 276631 : GFS2: chmod hung - TRY 3
  2007-09-19 22:23   ` Bob Peterson
@ 2007-09-20 13:35     ` David Teigland
  0 siblings, 0 replies; 5+ messages in thread
From: David Teigland @ 2007-09-20 13:35 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed, Sep 19, 2007 at 05:23:29PM -0500, Bob Peterson wrote:
> On Wed, 2007-09-19 at 16:13 -0500, David Teigland wrote:
> > ACK, I noticed that this isn't in git yet
> 
> AFAICT, it's in the git tree:
> 
> http://git.kernel.org/?p=linux/kernel/git/steve/gfs2-2.6-nmw.git;a=commitdiff;h=e76700466a04c244d1d29fe51b282838becd4f2d
> 
> Are you sure you were looking at nmw?

I see, the problem was that the subject started with "chmod hung" at which
point I moved to the next.  It would be great if we could pick slightly
better subjects that zero in on the specific bug or fix.

I also just noticed that the description of the patch includes:
"This patch reuses the ls->async_lock spin_lock to fix the race..."
which isn't true any more, of course.



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2007-09-20 13:35 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-14 14:27 [Cluster-devel] [PATCH] [GFS2] bz 276631 : GFS2: chmod hung - TRY 3 Bob Peterson
2007-09-14 15:12 ` Steven Whitehouse
2007-09-19 21:13 ` David Teigland
2007-09-19 22:23   ` Bob Peterson
2007-09-20 13:35     ` David Teigland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).