[Cluster-devel] [PATCH] [GFS2] bz 276631 : GFS2: chmod hung

cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed

* [Cluster-devel] [PATCH] [GFS2] bz 276631 : GFS2: chmod hung
@ 2007-09-13 21:05 Bob Peterson
  0 siblings, 0 replies; 3+ messages in thread
From: Bob Peterson @ 2007-09-13 21:05 UTC (permalink / raw)
  To: cluster-devel.redhat.com

The problem boiled down to a race between the gdlm_init_threads()
function initializing thread1 and its setting of blist = 1.
Essentially, "if (current == ls->thread1)" was checked by the thread
before the thread creator set ls->thread1.

Since thread1 is the only thread who is allowed to work on the
blocking queue, and since neither thread thought it was thread1, no one
was working on the queue.  So everything just sat.

This patch reuses the ls->async_lock spin_lock to fix the race,
and it fixes the problem.  I've done more than 2000 iterations of the
loop that was recreating the failure and it seems to work.

Dave Teigland brought up the question of whether we should do this
another way.  For example, by checking for the task name "lock_dlm1"
instead.  I'm open to opinions.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Cluster-devel] [PATCH] [GFS2] bz 276631 : GFS2: chmod hung
@ 2007-09-13 21:46 Bob Peterson
  2007-09-13 22:21 ` Josef Bacik
  0 siblings, 1 reply; 3+ messages in thread
From: Bob Peterson @ 2007-09-13 21:46 UTC (permalink / raw)
  To: cluster-devel.redhat.com

The following is a patch for bugzilla bug 276631.

The problem boiled down to a race between the gdlm_init_threads()
function initializing thread1 and its setting of blist = 1.
Essentially, "if (current == ls->thread1)" was checked by the thread
before the thread creator set ls->thread1.

Since thread1 is the only thread who is allowed to work on the
blocking queue, and since neither thread thought it was thread1, no one
was working on the queue.  So everything just sat.

This patch reuses the ls->async_lock spin_lock to fix the race,
and it fixes the problem.  I've done more than 2000 iterations of the
loop that was recreating the failure and it seems to work.

Dave Teigland brought up the question of whether we should do this
another way.  For example, by checking for the task name "lock_dlm1"
instead.  I'm open to opinions.
--
Signed-off-by: Bob Peterson <rpeterso@redhat.com> 
--
diff -pur a/fs/gfs2/locking/dlm/thread.c b/fs/gfs2/locking/dlm/thread.c
--- a/fs/gfs2/locking/dlm/thread.c	2007-09-13 15:51:08.000000000 -0500
+++ b/fs/gfs2/locking/dlm/thread.c	2007-09-13 15:21:07.000000000 -0500
@@ -279,8 +279,10 @@ static int gdlm_thread(void *data)
 	/* Only thread1 is allowed to do blocking callbacks since gfs
 	   may wait for a completion callback within a blocking cb. */

+	spin_lock(&ls->async_lock);
 	if (current == ls->thread1)
 		blist = 1;
+	spin_unlock(&ls->async_lock);

 	while (!kthread_should_stop()) {
 		set_current_state(TASK_INTERRUPTIBLE);
@@ -338,6 +340,7 @@ int gdlm_init_threads(struct gdlm_ls *ls
 	struct task_struct *p;
 	int error;

+	spin_lock(&ls->async_lock);
 	p = kthread_run(gdlm_thread, ls, "lock_dlm1");
 	error = IS_ERR(p);
 	if (error) {
@@ -354,6 +357,7 @@ int gdlm_init_threads(struct gdlm_ls *ls
 		return error;
 	}
 	ls->thread2 = p;
+	spin_unlock(&ls->async_lock);

 	return 0;
 }

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Cluster-devel] [PATCH] [GFS2] bz 276631 : GFS2: chmod hung
  2007-09-13 21:46 Bob Peterson
@ 2007-09-13 22:21 ` Josef Bacik
  0 siblings, 0 replies; 3+ messages in thread
From: Josef Bacik @ 2007-09-13 22:21 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Thu, Sep 13, 2007 at 04:46:37PM -0500, Bob Peterson wrote:
> The following is a patch for bugzilla bug 276631.
> 
> The problem boiled down to a race between the gdlm_init_threads()
> function initializing thread1 and its setting of blist = 1.
> Essentially, "if (current == ls->thread1)" was checked by the thread
> before the thread creator set ls->thread1.
> 
> Since thread1 is the only thread who is allowed to work on the
> blocking queue, and since neither thread thought it was thread1, no one
> was working on the queue.  So everything just sat.
> 
> This patch reuses the ls->async_lock spin_lock to fix the race,
> and it fixes the problem.  I've done more than 2000 iterations of the
> loop that was recreating the failure and it seems to work.
> 
> Dave Teigland brought up the question of whether we should do this
> another way.  For example, by checking for the task name "lock_dlm1"
> instead.  I'm open to opinions.
> --
> Signed-off-by: Bob Peterson <rpeterso@redhat.com> 
> --
> diff -pur a/fs/gfs2/locking/dlm/thread.c b/fs/gfs2/locking/dlm/thread.c
> --- a/fs/gfs2/locking/dlm/thread.c	2007-09-13 15:51:08.000000000 -0500
> +++ b/fs/gfs2/locking/dlm/thread.c	2007-09-13 15:21:07.000000000 -0500
> @@ -279,8 +279,10 @@ static int gdlm_thread(void *data)
>  	/* Only thread1 is allowed to do blocking callbacks since gfs
>  	   may wait for a completion callback within a blocking cb. */
>  
> +	spin_lock(&ls->async_lock);
>  	if (current == ls->thread1)
>  		blist = 1;
> +	spin_unlock(&ls->async_lock);
>  
>  	while (!kthread_should_stop()) {
>  		set_current_state(TASK_INTERRUPTIBLE);
> @@ -338,6 +340,7 @@ int gdlm_init_threads(struct gdlm_ls *ls
>  	struct task_struct *p;
>  	int error;
>  
> +	spin_lock(&ls->async_lock);
>  	p = kthread_run(gdlm_thread, ls, "lock_dlm1");
>  	error = IS_ERR(p);
>  	if (error) {
> @@ -354,6 +357,7 @@ int gdlm_init_threads(struct gdlm_ls *ls
>  		return error;
>  	}
>  	ls->thread2 = p;
> +	spin_unlock(&ls->async_lock);
>  
>  	return 0;
>  }
>

If theres an error we'll return holding the spin lock,

Josef 



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2007-09-13 22:21 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-13 21:05 [Cluster-devel] [PATCH] [GFS2] bz 276631 : GFS2: chmod hung Bob Peterson
  -- strict thread matches above, loose matches on Subject: below --
2007-09-13 21:46 Bob Peterson
2007-09-13 22:21 ` Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).