[PATCH] md-cluster: Only one thread should request DLM lock

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] md-cluster: Only one thread should request DLM lock
@ 2015-10-22 13:31 rgoldwyn
  2015-10-22 13:31 ` [PATCH] md-cluster: Call update_raid_disks() if another node --grow's raid_disks rgoldwyn
  2015-10-23  2:11 ` [PATCH] md-cluster: Only one thread should request DLM lock Neil Brown
  0 siblings, 2 replies; 6+ messages in thread
From: rgoldwyn @ 2015-10-22 13:31 UTC (permalink / raw)
  To: linux-raid, neilb; +Cc: gqjiang, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

If a DLM lock is in progress, requesting the same DLM lock will
result in -EBUSY. Use a mutex to make sure only one thread requests
for dlm_lock() function at a time.

This will fix the error -EBUSY returned from DLM's validate_lock_args().

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 drivers/md/md-cluster.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 35ac2e8..9b977a2 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -29,6 +29,7 @@ struct dlm_lock_resource {
 	void (*bast)(void *arg, int mode); /* blocking AST function pointer*/
 	struct mddev *mddev; /* pointing back to mddev. */
 	int mode;
+	struct mutex res_lock;
 };
 
 struct suspend_info {
@@ -102,14 +103,19 @@ static int dlm_lock_sync(struct dlm_lock_resource *res, int mode)
 {
 	int ret = 0;
 
+	mutex_lock(&res->res_lock);
+
 	ret = dlm_lock(res->ls, mode, &res->lksb,
 			res->flags, res->name, strlen(res->name),
 			0, sync_ast, res, res->bast);
-	if (ret)
+	if (ret) {
+		mutex_unlock(&res->res_lock);
 		return ret;
+	}
 	wait_for_completion(&res->completion);
 	if (res->lksb.sb_status == 0)
 		res->mode = mode;
+	mutex_unlock(&res->res_lock);
 	return res->lksb.sb_status;
 }
 
@@ -134,6 +140,7 @@ static struct dlm_lock_resource *lockres_init(struct mddev *mddev,
 	res->mode = DLM_LOCK_IV;
 	namelen = strlen(name);
 	res->name = kzalloc(namelen + 1, GFP_KERNEL);
+	mutex_init(&res->res_lock);
 	if (!res->name) {
 		pr_err("md-cluster: Unable to allocate resource name for resource %s\n", name);
 		goto out_err;
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH] md-cluster: Call update_raid_disks() if another node --grow's raid_disks
  2015-10-22 13:31 [PATCH] md-cluster: Only one thread should request DLM lock rgoldwyn
@ 2015-10-22 13:31 ` rgoldwyn
  2015-10-23  2:11 ` [PATCH] md-cluster: Only one thread should request DLM lock Neil Brown
  1 sibling, 0 replies; 6+ messages in thread
From: rgoldwyn @ 2015-10-22 13:31 UTC (permalink / raw)
  To: linux-raid, neilb; +Cc: gqjiang, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

To incorporate --grow feature executed on one node, other nodes need to
acknowledge the change in number of disks. Call update_raid_disks()
to update internal data structures.

This leads to call check_reshape() -> md_allow_write() -> md_update_sb(),
this results in a deadlock. This is done so it can safely allocate memory
(which might trigger writeback which might write to raid1). This is
not required for md with a bitmap.

In the clustered case, we don't perform md_update_sb() in md_allow_write(),
but in do_md_run(). Also we disable safemode for clustered mode.

mddev->recovery_cp need not be set in check_sb_changes() because this
is required only when a node reads another node's bitmap. mddev->recovery_cp
(which is read from sb->resync_offset), is set only if mddev is in_sync.
Since we disabled safemode, in_sync is set to zero.
In a clustered environment, the MD may not be in sync because another
node could be writing to it. So make sure that in_sync is not set in
case of clustered node in __md_stop_writes().

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 drivers/md/md.c    | 22 ++++++++++++++++------
 drivers/md/raid1.c |  8 +++++---
 2 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index a71b36f..0c70856 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2230,7 +2230,6 @@ static bool does_sb_need_changing(struct mddev *mddev)
 	/* Check if any mddev parameters have changed */
 	if ((mddev->dev_sectors != le64_to_cpu(sb->size)) ||
 	    (mddev->reshape_position != le64_to_cpu(sb->reshape_position)) ||
-	    (mddev->recovery_cp != le64_to_cpu(sb->resync_offset)) ||
 	    (mddev->layout != le64_to_cpu(sb->layout)) ||
 	    (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) ||
 	    (mddev->chunk_sectors != le32_to_cpu(sb->chunksize)))
@@ -3314,6 +3313,11 @@ safe_delay_store(struct mddev *mddev, const char *cbuf, size_t len)
 {
 	unsigned long msec;
 
+	if (mddev_is_clustered(mddev)) {
+		pr_info("md: Safemode is disabled for clustered mode\n");
+		return -EINVAL;
+	}
+
 	if (strict_strtoul_scaled(cbuf, &msec, 3) < 0)
 		return -EINVAL;
 	if (msec == 0)
@@ -5224,7 +5228,10 @@ int md_run(struct mddev *mddev)
 	atomic_set(&mddev->max_corr_read_errors,
 		   MD_DEFAULT_MAX_CORRECTED_READ_ERRORS);
 	mddev->safemode = 0;
-	mddev->safemode_delay = (200 * HZ)/1000 +1; /* 200 msec delay */
+	if (mddev_is_clustered(mddev))
+		mddev->safemode_delay = 0;
+	else
+		mddev->safemode_delay = (200 * HZ)/1000 +1; /* 200 msec delay */
 	mddev->in_sync = 1;
 	smp_wmb();
 	spin_lock(&mddev->lock);
@@ -5267,6 +5274,9 @@ static int do_md_run(struct mddev *mddev)
 		goto out;
 	}
 
+	if (mddev_is_clustered(mddev))
+		md_allow_write(mddev);
+
 	md_wakeup_thread(mddev->thread);
 	md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */
 
@@ -5363,7 +5373,8 @@ static void __md_stop_writes(struct mddev *mddev)
 	md_super_wait(mddev);
 
 	if (mddev->ro == 0 &&
-	    (!mddev->in_sync || (mddev->flags & MD_UPDATE_SB_FLAGS))) {
+	    ((!mddev->in_sync && !mddev_is_clustered(mddev)) ||
+	     (mddev->flags & MD_UPDATE_SB_FLAGS))) {
 		/* mark array as shutdown cleanly */
 		mddev->in_sync = 1;
 		md_update_sb(mddev, 1);
@@ -9007,9 +9018,8 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev)
 		}
 	}
 
-	/* recovery_cp changed */
-	if (le64_to_cpu(sb->resync_offset) != mddev->recovery_cp)
-		mddev->recovery_cp = le64_to_cpu(sb->resync_offset);
+	if (mddev->raid_disks != le32_to_cpu(sb->raid_disks))
+		update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
 
 	/* Finally set the event to be up to date */
 	mddev->events = le64_to_cpu(sb->events);
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index a2d813c..5f38430 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -3044,9 +3044,11 @@ static int raid1_reshape(struct mddev *mddev)
 		return -EINVAL;
 	}
 
-	err = md_allow_write(mddev);
-	if (err)
-		return err;
+	if (!mddev_is_clustered(mddev)) {
+		err = md_allow_write(mddev);
+		if (err)
+			return err;
+	}
 
 	raid_disks = mddev->raid_disks + mddev->delta_disks;
 
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] md-cluster: Only one thread should request DLM lock
  2015-10-22 13:31 [PATCH] md-cluster: Only one thread should request DLM lock rgoldwyn
  2015-10-22 13:31 ` [PATCH] md-cluster: Call update_raid_disks() if another node --grow's raid_disks rgoldwyn
@ 2015-10-23  2:11 ` Neil Brown
  2015-10-23 10:19   ` Goldwyn Rodrigues
  1 sibling, 1 reply; 6+ messages in thread
From: Neil Brown @ 2015-10-23  2:11 UTC (permalink / raw)
  To: rgoldwyn, linux-raid; +Cc: gqjiang, Goldwyn Rodrigues

[-- Attachment #1: Type: text/plain, Size: 2811 bytes --]

rgoldwyn@suse.de writes:

> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>
> If a DLM lock is in progress, requesting the same DLM lock will
> result in -EBUSY. Use a mutex to make sure only one thread requests
> for dlm_lock() function at a time.
>
> This will fix the error -EBUSY returned from DLM's
> validate_lock_args().

I can see that we only want one thread calling dlm_lock() with a given
'struct dlm_lock_resource' at a time, otherwise nasty things could
happen.

However if such a race is possible, then aren't there other possibly
complications.

Suppose two threads try to lock the same resource.
Presumably one will try to lock the resource, then the next one (when it
gets the mutex) will discover that it already has the resource, but will
think it has exclusive access - maybe?

Then both threads will eventually try to unlock, and the second one will
unlock something that doesn't have locked.

I'm not certain, but that doesn't sound entirely safe.

Which resources to we actually see races with?

Thanks,
NeilBrown


>
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  drivers/md/md-cluster.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
> index 35ac2e8..9b977a2 100644
> --- a/drivers/md/md-cluster.c
> +++ b/drivers/md/md-cluster.c
> @@ -29,6 +29,7 @@ struct dlm_lock_resource {
>  	void (*bast)(void *arg, int mode); /* blocking AST function pointer*/
>  	struct mddev *mddev; /* pointing back to mddev. */
>  	int mode;
> +	struct mutex res_lock;
>  };
>  
>  struct suspend_info {
> @@ -102,14 +103,19 @@ static int dlm_lock_sync(struct dlm_lock_resource *res, int mode)
>  {
>  	int ret = 0;
>  
> +	mutex_lock(&res->res_lock);
> +
>  	ret = dlm_lock(res->ls, mode, &res->lksb,
>  			res->flags, res->name, strlen(res->name),
>  			0, sync_ast, res, res->bast);
> -	if (ret)
> +	if (ret) {
> +		mutex_unlock(&res->res_lock);
>  		return ret;
> +	}
>  	wait_for_completion(&res->completion);
>  	if (res->lksb.sb_status == 0)
>  		res->mode = mode;
> +	mutex_unlock(&res->res_lock);
>  	return res->lksb.sb_status;
>  }
>  
> @@ -134,6 +140,7 @@ static struct dlm_lock_resource *lockres_init(struct mddev *mddev,
>  	res->mode = DLM_LOCK_IV;
>  	namelen = strlen(name);
>  	res->name = kzalloc(namelen + 1, GFP_KERNEL);
> +	mutex_init(&res->res_lock);
>  	if (!res->name) {
>  		pr_err("md-cluster: Unable to allocate resource name for resource %s\n", name);
>  		goto out_err;
> -- 
> 1.8.5.6
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] md-cluster: Only one thread should request DLM lock
  2015-10-23  2:11 ` [PATCH] md-cluster: Only one thread should request DLM lock Neil Brown
@ 2015-10-23 10:19   ` Goldwyn Rodrigues
  2015-10-27 20:48     ` Neil Brown
  0 siblings, 1 reply; 6+ messages in thread
From: Goldwyn Rodrigues @ 2015-10-23 10:19 UTC (permalink / raw)
  To: Neil Brown, linux-raid; +Cc: gqjiang, Goldwyn Rodrigues



On 10/22/2015 09:11 PM, Neil Brown wrote:
> rgoldwyn@suse.de writes:
>
>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>
>> If a DLM lock is in progress, requesting the same DLM lock will
>> result in -EBUSY. Use a mutex to make sure only one thread requests
>> for dlm_lock() function at a time.
>>
>> This will fix the error -EBUSY returned from DLM's
>> validate_lock_args().
>
> I can see that we only want one thread calling dlm_lock() with a given
> 'struct dlm_lock_resource' at a time, otherwise nasty things could
> happen.
>
> However if such a race is possible, then aren't there other possibly
> complications.

This is specific to the duration of dlm_lock() function only and not the 
entire lifetime of the resource. If one thread has requested dlm_lock() 
and another thread comes in and calls dlm_lock() on the same resource, 
we will get -EBUSY on the second one because the lock is already requested.

Our dlm_unlock_sync() call is also a dlm_lock_sync(), and eventually 
dlm_lock() call, with a NULL lock.

>
> Suppose two threads try to lock the same resource.
> Presumably one will try to lock the resource, then the next one (when it
> gets the mutex) will discover that it already has the resource, but will
> think it has exclusive access - maybe?

I am not sure if I understand this. DLM locks are supposed to be at the 
node level as opposed to thread level.

>
> Then both threads will eventually try to unlock, and the second one will
> unlock something that doesn't have locked.
>
> I'm not certain, but that doesn't sound entirely safe.
>
> Which resources to we actually see races with?
>

This could happen with any resource, I have seen with ack, message, and 
token.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] md-cluster: Only one thread should request DLM lock
  2015-10-23 10:19   ` Goldwyn Rodrigues
@ 2015-10-27 20:48     ` Neil Brown
  2015-10-27 23:28       ` Goldwyn Rodrigues
  0 siblings, 1 reply; 6+ messages in thread
From: Neil Brown @ 2015-10-27 20:48 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-raid; +Cc: gqjiang, Goldwyn Rodrigues

[-- Attachment #1: Type: text/plain, Size: 2854 bytes --]

On Fri, Oct 23 2015, Goldwyn Rodrigues wrote:

> On 10/22/2015 09:11 PM, Neil Brown wrote:
>> rgoldwyn@suse.de writes:
>>
>>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>>
>>> If a DLM lock is in progress, requesting the same DLM lock will
>>> result in -EBUSY. Use a mutex to make sure only one thread requests
>>> for dlm_lock() function at a time.
>>>
>>> This will fix the error -EBUSY returned from DLM's
>>> validate_lock_args().
>>
>> I can see that we only want one thread calling dlm_lock() with a given
>> 'struct dlm_lock_resource' at a time, otherwise nasty things could
>> happen.
>>
>> However if such a race is possible, then aren't there other possibly
>> complications.
>
> This is specific to the duration of dlm_lock() function only and not the 
> entire lifetime of the resource. If one thread has requested dlm_lock() 
> and another thread comes in and calls dlm_lock() on the same resource, 
> we will get -EBUSY on the second one because the lock is already requested.
>
> Our dlm_unlock_sync() call is also a dlm_lock_sync(), and eventually 
> dlm_lock() call, with a NULL lock.
>
>>
>> Suppose two threads try to lock the same resource.
>> Presumably one will try to lock the resource, then the next one (when it
>> gets the mutex) will discover that it already has the resource, but will
>> think it has exclusive access - maybe?
>
> I am not sure if I understand this. DLM locks are supposed to be at the 
> node level as opposed to thread level.

I think this is exactly my point.  I think we need some extra
thread-level locking.
For example suppose some thread calls sendmsg() which takes the token
lock, and then while that is happening metadata_update_start() gets
called.
It will try to take the token lock, but as the node already has the
lock, it will succeed trivially.  Then two threads on the one node both
think they have the lock which will almost certainly lead to confusion.

So we need to hold some mutex the entire time that sendmsg() is running,
and need to hold that same mutex when calling metadata_update_start().
Once we have that, there is not need for the mutex you introduced which
is just held while claiming the lock.

It could be that we can use ->reconfig_mutex for a lot of this.
Certainly we always hold ->reconfig_mutex while performing a metadata
update.
We probably don't want to take it just for ->resync_info_update().

I'm not sure if it would be best to have a per-resource mutex which we
take in dlm_lock_sync() and drop in dlm_unlock_sync(), or if we want the
locking at a higher level.
Probably ->reconfig_mutex is already used where we need higher-level
locking.
So if you change you patch to unlock in dlm_unlock_sync() rather than
at the end of dlm_lock_sync(), then I think it will make sense.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] md-cluster: Only one thread should request DLM lock
  2015-10-27 20:48     ` Neil Brown
@ 2015-10-27 23:28       ` Goldwyn Rodrigues
  0 siblings, 0 replies; 6+ messages in thread
From: Goldwyn Rodrigues @ 2015-10-27 23:28 UTC (permalink / raw)
  To: Neil Brown, linux-raid; +Cc: gqjiang, Goldwyn Rodrigues



On 10/27/2015 03:48 PM, Neil Brown wrote:
> On Fri, Oct 23 2015, Goldwyn Rodrigues wrote:
>
>> On 10/22/2015 09:11 PM, Neil Brown wrote:
>>> rgoldwyn@suse.de writes:
>>>
>>>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>>>
>>>> If a DLM lock is in progress, requesting the same DLM lock will
>>>> result in -EBUSY. Use a mutex to make sure only one thread requests
>>>> for dlm_lock() function at a time.
>>>>
>>>> This will fix the error -EBUSY returned from DLM's
>>>> validate_lock_args().
>>>
>>> I can see that we only want one thread calling dlm_lock() with a given
>>> 'struct dlm_lock_resource' at a time, otherwise nasty things could
>>> happen.
>>>
>>> However if such a race is possible, then aren't there other possibly
>>> complications.
>>
>> This is specific to the duration of dlm_lock() function only and not the
>> entire lifetime of the resource. If one thread has requested dlm_lock()
>> and another thread comes in and calls dlm_lock() on the same resource,
>> we will get -EBUSY on the second one because the lock is already requested.
>>
>> Our dlm_unlock_sync() call is also a dlm_lock_sync(), and eventually
>> dlm_lock() call, with a NULL lock.
>>
>>>
>>> Suppose two threads try to lock the same resource.
>>> Presumably one will try to lock the resource, then the next one (when it
>>> gets the mutex) will discover that it already has the resource, but will
>>> think it has exclusive access - maybe?
>>
>> I am not sure if I understand this. DLM locks are supposed to be at the
>> node level as opposed to thread level.
>
> I think this is exactly my point.  I think we need some extra
> thread-level locking.
> For example suppose some thread calls sendmsg() which takes the token
> lock, and then while that is happening metadata_update_start() gets
> called.
> It will try to take the token lock, but as the node already has the
> lock, it will succeed trivially.  Then two threads on the one node both
> think they have the lock which will almost certainly lead to confusion.

Yes, this is the other problem I was talking about which led to the call 
trace originating in unlock_comm(). These are two separate problems, but 
your proposal of using a single mutex should resolve both. I was 
thinking more in terms of finer grained locking but it looks like an 
overkill here.

>
> So we need to hold some mutex the entire time that sendmsg() is running,
> and need to hold that same mutex when calling metadata_update_start().
> Once we have that, there is not need for the mutex you introduced which
> is just held while claiming the lock.

We may have to add flags to detect where the call is coming from, but 
yes that should be fine. I will come up with a patch soon.

>
> It could be that we can use ->reconfig_mutex for a lot of this.
> Certainly we always hold ->reconfig_mutex while performing a metadata
> update.
> We probably don't want to take it just for ->resync_info_update().

Agree here.

>
> I'm not sure if it would be best to have a per-resource mutex which we
> take in dlm_lock_sync() and drop in dlm_unlock_sync(), or if we want the
> locking at a higher level.
> Probably ->reconfig_mutex is already used where we need higher-level
> locking.
> So if you change you patch to unlock in dlm_unlock_sync() rather than
> at the end of dlm_lock_sync(), then I think it will make sense.

It is just as good as using a single mutex for all communication, so I 
would favour using a single one.

Thanks for your comments.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-10-27 23:28 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-22 13:31 [PATCH] md-cluster: Only one thread should request DLM lock rgoldwyn
2015-10-22 13:31 ` [PATCH] md-cluster: Call update_raid_disks() if another node --grow's raid_disks rgoldwyn
2015-10-23  2:11 ` [PATCH] md-cluster: Only one thread should request DLM lock Neil Brown
2015-10-23 10:19   ` Goldwyn Rodrigues
2015-10-27 20:48     ` Neil Brown
2015-10-27 23:28       ` Goldwyn Rodrigues

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).