dm: bind new table before destroying old

All of lore.kernel.org
 help / color / mirror / Atom feed

* dm: bind new table before destroying old
@ 2009-11-11  1:16 Alasdair G Kergon
  2009-11-11  1:20 ` Alasdair G Kergon
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Alasdair G Kergon @ 2009-11-11  1:16 UTC (permalink / raw)
  To: dm-devel
  Cc: Kiyoshi Ueda, Mike Snitzer, Heinz Mauelshagen, Mikulas Patocka,
	Zdenek Kabelac, Jun'ichi Nomura, Milan Broz

Questions:

  Do all the targets correctly flush or push back everything during a
  suspend (including workqueues)?

  Do all the targets correctly sync to disk all internal state that
  needs to be preserved during a suspend?

In other words, in the case of an already-suspended target, the target
'dtr' functions should only be freeing memory and other resources and
not causing I/O to any of the table's devices.

All targets are supposed to be behave this way already, but please
would you check the targets with which you are familiar anyway?

Alasdair

From: Alasdair G Kergon <agk@redhat.com>

When replacing a mapped device's table during a 'resume', delay the
destruction of the old table until the new one is successfully in place.

This will make it easier for a later patch to transfer internal state
information from the old table to the new one (something we do not currently
support) while giving us more options for reversion if a later part
of the operation fails.

Devices are always in the suspended state during dm_swap_table().
This patch reinforces the requirement that all I/O must have been
flushed from the table targets while in this state (including any in
workqueues).  In the case of 'noflush' suspending, unprocessed
I/O should have been 'pushed back' to the dm core prior to this point,
for resubmission after the new table is in place.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
---
 drivers/md/dm-table.c |    3 +++
 drivers/md/dm.c       |   13 +++++++++----
 2 files changed, 12 insertions(+), 4 deletions(-)

Index: linux-2.6.32-rc6/drivers/md/dm-table.c
===================================================================
--- linux-2.6.32-rc6.orig/drivers/md/dm-table.c
+++ linux-2.6.32-rc6/drivers/md/dm-table.c
@@ -237,6 +237,9 @@ void dm_table_destroy(struct dm_table *t
 {
 	unsigned int i;

+	if (!t)
+		return;
+
 	while (atomic_read(&t->holders))
 		msleep(1);
 	smp_mb();
Index: linux-2.6.32-rc6/drivers/md/dm.c
===================================================================
--- linux-2.6.32-rc6.orig/drivers/md/dm.c
+++ linux-2.6.32-rc6/drivers/md/dm.c
@@ -2070,7 +2070,10 @@ static int __bind(struct mapped_device *
 	return 0;
 }

-static void __unbind(struct mapped_device *md)
+/*
+ * Returns unbound table for the caller to free.
+ */
+struct dm_table *__unbind(struct mapped_device *md)
 {
 	struct dm_table *map = md->map;
 	unsigned long flags;
@@ -2082,7 +2085,8 @@ static void __unbind(struct mapped_devic
 	write_lock_irqsave(&md->map_lock, flags);
 	md->map = NULL;
 	write_unlock_irqrestore(&md->map_lock, flags);
-	dm_table_destroy(map);
+
+	return map;
 }

 /*
@@ -2175,7 +2179,7 @@ void dm_put(struct mapped_device *md)
 		}
 		dm_sysfs_exit(md);
 		dm_table_put(map);
-		__unbind(md);
+		dm_table_destroy(__unbind(md));
 		free_dev(md);
 	}
 }
@@ -2381,8 +2385,9 @@ int dm_swap_table(struct mapped_device *
 		goto out;
 	}

-	__unbind(md);
+	map = __unbind(md);
 	r = __bind(md, table, &limits);
+	dm_table_destroy(map);

 out:
 	mutex_unlock(&md->suspend_lock);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dm: bind new table before destroying old
  2009-11-11  1:16 dm: bind new table before destroying old Alasdair G Kergon
@ 2009-11-11  1:20 ` Alasdair G Kergon
  2009-11-11  2:48   ` dm: keep old table until after resume succeeded Alasdair G Kergon
  2009-11-11  6:56 ` dm: bind new table before destroying old Kiyoshi Ueda
  2009-11-11 13:20 ` Mike Snitzer
  2 siblings, 1 reply; 13+ messages in thread
From: Alasdair G Kergon @ 2009-11-11  1:20 UTC (permalink / raw)
  To: dm-devel
  Cc: Kiyoshi Ueda, Mike Snitzer, Heinz Mauelshagen, Mikulas Patocka,
	Zdenek Kabelac, Jun'ichi Nomura, Milan Broz

And now a version that actually compiles:-)


From: Alasdair G Kergon <agk@redhat.com>

When replacing a mapped device's table during a 'resume', delay the
destruction of the old table until the new one is successfully in place.

This will make it easier for a later patch to transfer internal state
information from the old table to the new one (something we do not currently
support) while giving us more options for reversion if a later part
of the operation fails.

Devices are always in the suspended state during dm_swap_table().
This patch reinforces the requirement that all I/O must have been
flushed from the table targets while in this state (including any in
workqueues).  In the case of 'noflush' suspending, unprocessed
I/O should have been 'pushed back' to the dm core prior to this point,
for resubmission after the new table is in place.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
---
 drivers/md/dm-table.c |    3 +++
 drivers/md/dm.c       |   16 +++++++++++-----
 2 files changed, 14 insertions(+), 5 deletions(-)

Index: linux-2.6.32-rc6/drivers/md/dm-table.c
===================================================================
--- linux-2.6.32-rc6.orig/drivers/md/dm-table.c
+++ linux-2.6.32-rc6/drivers/md/dm-table.c
@@ -237,6 +237,9 @@ void dm_table_destroy(struct dm_table *t
 {
 	unsigned int i;
 
+	if (!t)
+		return;
+
 	while (atomic_read(&t->holders))
 		msleep(1);
 	smp_mb();
Index: linux-2.6.32-rc6/drivers/md/dm.c
===================================================================
--- linux-2.6.32-rc6.orig/drivers/md/dm.c
+++ linux-2.6.32-rc6/drivers/md/dm.c
@@ -2070,19 +2070,23 @@ static int __bind(struct mapped_device *
 	return 0;
 }
 
-static void __unbind(struct mapped_device *md)
+/*
+ * Returns unbound table for the caller to free.
+ */
+static struct dm_table *__unbind(struct mapped_device *md)
 {
 	struct dm_table *map = md->map;
 	unsigned long flags;
 
 	if (!map)
-		return;
+		return NULL;
 
 	dm_table_event_callback(map, NULL, NULL);
 	write_lock_irqsave(&md->map_lock, flags);
 	md->map = NULL;
 	write_unlock_irqrestore(&md->map_lock, flags);
-	dm_table_destroy(map);
+
+	return map;
 }
 
 /*
@@ -2175,7 +2179,7 @@ void dm_put(struct mapped_device *md)
 		}
 		dm_sysfs_exit(md);
 		dm_table_put(map);
-		__unbind(md);
+		dm_table_destroy(__unbind(md));
 		free_dev(md);
 	}
 }
@@ -2361,6 +2365,7 @@ static void dm_rq_barrier_work(struct wo
  */
 int dm_swap_table(struct mapped_device *md, struct dm_table *table)
 {
+	struct dm_table *map;
 	struct queue_limits limits;
 	int r = -EINVAL;
 
@@ -2381,8 +2386,9 @@ int dm_swap_table(struct mapped_device *
 		goto out;
 	}
 
-	__unbind(md);
+	map = __unbind(md);
 	r = __bind(md, table, &limits);
+	dm_table_destroy(map);
 
 out:
 	mutex_unlock(&md->suspend_lock);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* dm: keep old table until after resume succeeded
  2009-11-11  1:20 ` Alasdair G Kergon
@ 2009-11-11  2:48   ` Alasdair G Kergon
  0 siblings, 0 replies; 13+ messages in thread
From: Alasdair G Kergon @ 2009-11-11  2:48 UTC (permalink / raw)
  To: dm-devel
  Cc: Kiyoshi Ueda, Mike Snitzer, Heinz Mauelshagen, Mikulas Patocka,
	Zdenek Kabelac, Jun'ichi Nomura, Milan Broz

Follow-up patch which moves the table destruction after the resume.
__unbind() becomes redundant when followed by __bind().

Alasdair


From: Alasdair G Kergon <agk@redhat.com>

When swapping a new table into place, retain the old table until 
its replacement is in place.

An old check for an empty table is removed because this is enforced
in populate_table().

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
---
 drivers/md/dm-ioctl.c         |   10 ++++++----
 drivers/md/dm.c               |   34 +++++++++++++++++-----------------
 include/linux/device-mapper.h |    4 +++-
 3 files changed, 26 insertions(+), 22 deletions(-)

Index: linux-2.6.32-rc6/drivers/md/dm-ioctl.c
===================================================================
--- linux-2.6.32-rc6.orig/drivers/md/dm-ioctl.c
+++ linux-2.6.32-rc6/drivers/md/dm-ioctl.c
@@ -855,7 +855,7 @@ static int do_resume(struct dm_ioctl *pa
 	unsigned suspend_flags = DM_SUSPEND_LOCKFS_FLAG;
 	struct hash_cell *hc;
 	struct mapped_device *md;
-	struct dm_table *new_map;
+	struct dm_table *new_map, *old_map = NULL;
 
 	down_write(&_hash_lock);
 
@@ -884,11 +884,11 @@ static int do_resume(struct dm_ioctl *pa
 		if (!dm_suspended(md))
 			dm_suspend(md, suspend_flags);
 
-		r = dm_swap_table(md, new_map);
-		if (r) {
+		old_map = dm_swap_table(md, new_map);
+		if (IS_ERR(old_map)) {
 			dm_table_destroy(new_map);
 			dm_put(md);
-			return r;
+			return PTR_ERR(old_map);
 		}
 
 		if (dm_table_get_mode(new_map) & FMODE_WRITE)
@@ -900,6 +900,8 @@ static int do_resume(struct dm_ioctl *pa
 	if (dm_suspended(md))
 		r = dm_resume(md);
 
+	if (old_map)
+		dm_table_destroy(old_map);
 
 	if (!r) {
 		dm_kobject_uevent(md, KOBJ_CHANGE, param->event_nr);
Index: linux-2.6.32-rc6/drivers/md/dm.c
===================================================================
--- linux-2.6.32-rc6.orig/drivers/md/dm.c
+++ linux-2.6.32-rc6/drivers/md/dm.c
@@ -2026,9 +2026,13 @@ static void __set_size(struct mapped_dev
 	mutex_unlock(&md->bdev->bd_inode->i_mutex);
 }
 
-static int __bind(struct mapped_device *md, struct dm_table *t,
-		  struct queue_limits *limits)
+/*
+ * Returns old map, which caller must destroy.
+ */
+static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
+			       struct queue_limits *limits)
 {
+	struct dm_table *old_map;
 	struct request_queue *q = md->queue;
 	sector_t size;
 	unsigned long flags;
@@ -2043,11 +2047,6 @@ static int __bind(struct mapped_device *
 
 	__set_size(md, size);
 
-	if (!size) {
-		dm_table_destroy(t);
-		return 0;
-	}
-
 	dm_table_event_callback(t, event_callback, md);
 
 	/*
@@ -2063,11 +2062,12 @@ static int __bind(struct mapped_device *
 	__bind_mempools(md, t);
 
 	write_lock_irqsave(&md->map_lock, flags);
+	old_map = md->map;
 	md->map = t;
 	dm_table_set_restrictions(t, q, limits);
 	write_unlock_irqrestore(&md->map_lock, flags);
 
-	return 0;
+	return old_map;
 }
 
 /*
@@ -2361,13 +2361,13 @@ static void dm_rq_barrier_work(struct wo
 }
 
 /*
- * Swap in a new table (destroying old one).
+ * Swap in a new table, returning the old one for the caller to destroy.
  */
-int dm_swap_table(struct mapped_device *md, struct dm_table *table)
+struct dm_table *dm_swap_table(struct mapped_device *md, struct dm_table *table)
 {
-	struct dm_table *map;
+	struct dm_table *map = ERR_PTR(-EINVAL);
 	struct queue_limits limits;
-	int r = -EINVAL;
+	int r;
 
 	mutex_lock(&md->suspend_lock);
 
@@ -2376,8 +2376,10 @@ int dm_swap_table(struct mapped_device *
 		goto out;
 
 	r = dm_calculate_queue_limits(table, &limits);
-	if (r)
+	if (r) {
+		map = ERR_PTR(r);
 		goto out;
+	}
 
 	/* cannot change the device type, once a table is bound */
 	if (md->map &&
@@ -2386,13 +2388,11 @@ int dm_swap_table(struct mapped_device *
 		goto out;
 	}
 
-	map = __unbind(md);
-	r = __bind(md, table, &limits);
-	dm_table_destroy(map);
+	map = __bind(md, table, &limits);
 
 out:
 	mutex_unlock(&md->suspend_lock);
-	return r;
+	return map;
 }
 
 /*
Index: linux-2.6.32-rc6/include/linux/device-mapper.h
===================================================================
--- linux-2.6.32-rc6.orig/include/linux/device-mapper.h
+++ linux-2.6.32-rc6/include/linux/device-mapper.h
@@ -295,8 +295,10 @@ void dm_table_event(struct dm_table *t);
 
 /*
  * The device must be suspended before calling this method.
+ * Returns the previous table, which the caller must destroy.
  */
-int dm_swap_table(struct mapped_device *md, struct dm_table *t);
+struct dm_table *dm_swap_table(struct mapped_device *md,
+			       struct dm_table *t);
 
 /*
  * A wrapper around vmalloc.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dm: bind new table before destroying old
  2009-11-11  1:16 dm: bind new table before destroying old Alasdair G Kergon
  2009-11-11  1:20 ` Alasdair G Kergon
@ 2009-11-11  6:56 ` Kiyoshi Ueda
  2009-11-11 15:11   ` Alasdair G Kergon
  2009-11-11 13:20 ` Mike Snitzer
  2 siblings, 1 reply; 13+ messages in thread
From: Kiyoshi Ueda @ 2009-11-11  6:56 UTC (permalink / raw)
  To: Alasdair Kergon
  Cc: Kiyoshi Ueda, Mike Snitzer, Heinz Mauelshagen, dm-devel,
	Mikulas Patocka, Zdenek Kabelac, Jun'ichi Nomura, Milan Broz

Hi Alasdair,

On 11/11/2009 10:16 AM +0900, Alasdair G Kergon wrote:
> Questions:
> 
>   Do all the targets correctly flush or push back everything during a
>   suspend (including workqueues)?
> 
>   Do all the targets correctly sync to disk all internal state that
>   needs to be preserved during a suspend?
> 
> In other words, in the case of an already-suspended target, the target
> 'dtr' functions should only be freeing memory and other resources and
> not causing I/O to any of the table's devices.
> 
> All targets are supposed to be behave this way already, but please
> would you check the targets with which you are familiar anyway?

I have checked multipath and found 2 issues.

multipath flushes all normal I/Os before suspend completion.
But multipath doesn't flush some workqueues until the table destruction.
Also, such works can be added and kicked even after suspend completion
through message ioctl.

For example, [de]activate_path() and trigger_event() are such works.
"reinstate path" message will trigger activate_path() work and
activate_path() may send some SCSI commands (through pg_init()) to
the underlying devices of the already-suspended target.
(Also, "fail_path" message will trigger deactivate_path() work
 and deactivate_path() may abort the underlying device's queue
 of the already-suspended target.)

So moving the table destruction after the resume (in your another
patch) could/might cause some race problems between new_table and
old_table if they have a same underlying device.
(e.g. pg_init() race, aborting queue after resume.)

I believe dm-mpath needs to flush such workqueues in postsuspend.
Also, we need something to block message ioctl to suspended device.

As for the message ioctl, I don't have any good idea, but...
  - Reject message ioctl to suspended device in dm-ioctl
  Or/And 
  - Targets must not kick any work influencing external themselves
    in their message ioctl handlers.

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dm: bind new table before destroying old
  2009-11-11  6:56 ` dm: bind new table before destroying old Kiyoshi Ueda
@ 2009-11-11 15:11   ` Alasdair G Kergon
  2009-11-11 15:14     ` Milan Broz
  2009-11-12  9:59     ` Kiyoshi Ueda
  0 siblings, 2 replies; 13+ messages in thread
From: Alasdair G Kergon @ 2009-11-11 15:11 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: Mike Snitzer, Heinz Mauelshagen, dm-devel, Mikulas Patocka,
	Zdenek Kabelac, Jun'ichi Nomura, Milan Broz

On Wed, Nov 11, 2009 at 03:56:05PM +0900, Kiyoshi Ueda wrote:
> I believe dm-mpath needs to flush such workqueues in postsuspend.

Yes - should be an easy change.

> Also, we need something to block message ioctl to suspended device.
 
> As for the message ioctl, I don't have any good idea, but...
>   - Reject message ioctl to suspended device in dm-ioctl

I think the crypt target expects to be able to do this to manipulate
the in-core encryption key.

>   - Targets must not kick any work influencing external themselves
>     in their message ioctl handlers.

So I think we'll need a target-specific approach like that here.

Alasdair

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dm: bind new table before destroying old
  2009-11-11 15:11   ` Alasdair G Kergon
@ 2009-11-11 15:14     ` Milan Broz
  2009-11-12  7:00       ` Mike Anderson
  2009-11-12  9:59     ` Kiyoshi Ueda
  1 sibling, 1 reply; 13+ messages in thread
From: Milan Broz @ 2009-11-11 15:14 UTC (permalink / raw)
  To: Kiyoshi Ueda, dm-devel, Mike Snitzer, Mikulas Patocka,
	Jonathan Brassow, Jun'ichi 

On 11/11/2009 04:11 PM, Alasdair G Kergon wrote:
> On Wed, Nov 11, 2009 at 03:56:05PM +0900, Kiyoshi Ueda wrote:
>> I believe dm-mpath needs to flush such workqueues in postsuspend.
> 
> Yes - should be an easy change.

similar problem is probably in crypt target, I'll check it later.

>> Also, we need something to block message ioctl to suspended device.
>  
>> As for the message ioctl, I don't have any good idea, but...
>>   - Reject message ioctl to suspended device in dm-ioctl
> 
> I think the crypt target expects to be able to do this to manipulate
> the in-core encryption key.

yes, crypt target must be able to process messages when in suspended state.

Milan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Re: dm: bind new table before destroying old
  2009-11-11 15:14     ` Milan Broz
@ 2009-11-12  7:00       ` Mike Anderson
  0 siblings, 0 replies; 13+ messages in thread
From: Mike Anderson @ 2009-11-12  7:00 UTC (permalink / raw)
  To: device-mapper development
  Cc: Kiyoshi Ueda, Mike Snitzer, Heinz Mauelshagen, Mikulas Patocka,
	Zdenek Kabelac, Jun'ichi Nomura

Milan Broz <mbroz@redhat.com> wrote:
> On 11/11/2009 04:11 PM, Alasdair G Kergon wrote:
> > On Wed, Nov 11, 2009 at 03:56:05PM +0900, Kiyoshi Ueda wrote:
> >> I believe dm-mpath needs to flush such workqueues in postsuspend.
> > 
> > Yes - should be an easy change.
> 
> similar problem is probably in crypt target, I'll check it later.
> 
> >> Also, we need something to block message ioctl to suspended device.
> >  
> >> As for the message ioctl, I don't have any good idea, but...
> >>   - Reject message ioctl to suspended device in dm-ioctl
> > 
> > I think the crypt target expects to be able to do this to manipulate
> > the in-core encryption key.
> 
> yes, crypt target must be able to process messages when in suspended state.
> 

I hit a issue like Kiyoshi described of the target_message
(multipath_message) generating new work while dev_remove is trying
complete in parallel with calling fail_path / reinstate_path.

I added two accessor functions to dm-table.c (dm_table_md_suspended and
dm_table_md_freeing). I am calling both functions from multipath_message
to return early if we are in one of these states. Depending on the usage
model of the other targets message functions dm_table_md_freeing (or a new
function dm_table_md_deleting) could be called in target_message instead
of each targets message function.

The test runs using a mutex and the alternate option of using spin_lock
are currently passing I will post tomorrow the series for comments.

I also tested suspend / resume in parallel with calling fail_path /
reinstate_path. 

-andmike
--
Michael Anderson
andmike@linux.vnet.ibm.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dm: bind new table before destroying old
  2009-11-11 15:11   ` Alasdair G Kergon
  2009-11-11 15:14     ` Milan Broz
@ 2009-11-12  9:59     ` Kiyoshi Ueda
  1 sibling, 0 replies; 13+ messages in thread
From: Kiyoshi Ueda @ 2009-11-12  9:59 UTC (permalink / raw)
  To: Alasdair Kergon
  Cc: Kiyoshi Ueda, Mike Snitzer, Heinz Mauelshagen, dm-devel,
	Mikulas Patocka, Zdenek Kabelac, Jun'ichi Nomura, Milan Broz

Hi Alasdair,

On 11/12/2009 12:11 AM +0900, Alasdair G Kergon wrote:
> On Wed, Nov 11, 2009 at 03:56:05PM +0900, Kiyoshi Ueda wrote:
>> I believe dm-mpath needs to flush such workqueues in postsuspend.
> 
> Yes - should be an easy change.

OK, I posted the patch.
See: http://patchwork.kernel.org/patch/59556/


>> Also, we need something to block message ioctl to suspended device.
>  
>> As for the message ioctl, I don't have any good idea, but...
>>   - Reject message ioctl to suspended device in dm-ioctl
> 
> I think the crypt target expects to be able to do this to manipulate
> the in-core encryption key.
> 
>>   - Targets must not kick any work influencing external themselves
>>     in their message ioctl handlers.
> 
> So I think we'll need a target-specific approach like that here.

OK, Mike Anderson seems to be working on this.

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dm: bind new table before destroying old
  2009-11-11  1:16 dm: bind new table before destroying old Alasdair G Kergon
  2009-11-11  1:20 ` Alasdair G Kergon
  2009-11-11  6:56 ` dm: bind new table before destroying old Kiyoshi Ueda
@ 2009-11-11 13:20 ` Mike Snitzer
  2009-11-11 23:45   ` Mike Snitzer
  2 siblings, 1 reply; 13+ messages in thread
From: Mike Snitzer @ 2009-11-11 13:20 UTC (permalink / raw)
  To: Alasdair G Kergon
  Cc: Kiyoshi Ueda, Heinz Mauelshagen, dm-devel, Mikulas Patocka,
	Zdenek Kabelac, Jun'ichi Nomura, Milan Broz

On Tue, Nov 10 2009 at  8:16pm -0500,
Alasdair G Kergon <agk@redhat.com> wrote:

> Questions:
> 
>   Do all the targets correctly flush or push back everything during a
>   suspend (including workqueues)?
> 
>   Do all the targets correctly sync to disk all internal state that
>   needs to be preserved during a suspend?
> 
> In other words, in the case of an already-suspended target, the target
> 'dtr' functions should only be freeing memory and other resources and
> not causing I/O to any of the table's devices.
> 
> All targets are supposed to be behave this way already, but please
> would you check the targets with which you are familiar anyway?
> 
> Alasdair
> 
> 
> From: Alasdair G Kergon <agk@redhat.com>
> 
> When replacing a mapped device's table during a 'resume', delay the
> destruction of the old table until the new one is successfully in place.
> 
> This will make it easier for a later patch to transfer internal state
> information from the old table to the new one (something we do not currently
> support) while giving us more options for reversion if a later part
> of the operation fails.

I have confirmed that this patch allows handover to work within a single
device.

The error recovery, will it be done unilaterally in the DM core (on
preresume failure)?  Or will each target driver need to call a DM core
callback if it'd like recovery?

Mike

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dm: bind new table before destroying old
  2009-11-11 13:20 ` Mike Snitzer
@ 2009-11-11 23:45   ` Mike Snitzer
  2009-11-19  0:50     ` Alasdair G Kergon
  0 siblings, 1 reply; 13+ messages in thread
From: Mike Snitzer @ 2009-11-11 23:45 UTC (permalink / raw)
  To: Alasdair G Kergon
  Cc: Kiyoshi Ueda, Heinz Mauelshagen, dm-devel, Mikulas Patocka,
	Zdenek Kabelac, Jun'ichi Nomura, Milan Broz

On Wed, Nov 11 2009 at  8:20am -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Tue, Nov 10 2009 at  8:16pm -0500,
> Alasdair G Kergon <agk@redhat.com> wrote:
> 
> > Questions:
> > 
> >   Do all the targets correctly flush or push back everything during a
> >   suspend (including workqueues)?
> > 
> >   Do all the targets correctly sync to disk all internal state that
> >   needs to be preserved during a suspend?
> > 
> > In other words, in the case of an already-suspended target, the target
> > 'dtr' functions should only be freeing memory and other resources and
> > not causing I/O to any of the table's devices.
> > 
> > All targets are supposed to be behave this way already, but please
> > would you check the targets with which you are familiar anyway?
> > 
> > Alasdair
> > 
> > 
> > From: Alasdair G Kergon <agk@redhat.com>
> > 
> > When replacing a mapped device's table during a 'resume', delay the
> > destruction of the old table until the new one is successfully in place.
> > 
> > This will make it easier for a later patch to transfer internal state
> > information from the old table to the new one (something we do not currently
> > support) while giving us more options for reversion if a later part
> > of the operation fails.
> 
> I have confirmed that this patch allows handover to work within a single
> device.

Alasdair,

After further testing I've hit a lockdep trace.  My testing was with
handing over on the same device.  I had the snapshot (of an ext3 FS)
mounted and I was doing a sequential direct-io write to a file in the
FS.  While writing I triggered a handover with the following:

echo "0 50331648 snapshot 253:2 253:3 P 8" | dmsetup reload test-testlv_snap
dmsetup resume test-testlv_snap

With that handover worked fine (with no IO errors), but the following
lockdep resulted (some "snapshot_*" tracing was added for context):

snapshot_ctr
snapshot_ctr: found snap_src
snapshot_presuspend

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.32-rc6-snitm #8
-------------------------------------------------------
dmsetup/1827 is trying to acquire lock:
 (&md->suspend_lock){+.+...}, at: [<ffffffffa00678d8>] dm_swap_table+0x2d/0x249 [dm_mod]

but task is already holding lock:
 (&journal->j_barrier){+.+...}, at: [<ffffffff8119192d>] journal_lock_updates+0xe1/0xf0

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (&journal->j_barrier){+.+...}:
       [<ffffffff810857b3>] __lock_acquire+0xb6b/0xd13
       [<ffffffff81086396>] lock_release_non_nested+0x1dc/0x23b
       [<ffffffff8108656f>] lock_release+0x17a/0x1a5
       [<ffffffff8139214b>] __mutex_unlock_slowpath+0xce/0x132
       [<ffffffff813921bd>] mutex_unlock+0xe/0x10
       [<ffffffff81147329>] freeze_bdev+0x104/0x110
       [<ffffffffa0069038>] dm_suspend+0x119/0x2a1 [dm_mod]
       [<ffffffffa006db3a>] dev_suspend+0x11d/0x1de [dm_mod]
       [<ffffffffa006e30c>] ctl_ioctl+0x1c6/0x213 [dm_mod]
       [<ffffffffa006e36c>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
       [<ffffffff8112a959>] vfs_ioctl+0x22/0x87
       [<ffffffff8112aec2>] do_vfs_ioctl+0x488/0x4ce
       [<ffffffff8112af5e>] sys_ioctl+0x56/0x79
       [<ffffffff8100bb82>] system_call_fastpath+0x16/0x1b

-> #0 (&md->suspend_lock){+.+...}:
       [<ffffffff8108565d>] __lock_acquire+0xa15/0xd13
       [<ffffffff81085a37>] lock_acquire+0xdc/0x102
       [<ffffffff81392372>] __mutex_lock_common+0x4b/0x37b
       [<ffffffff81392766>] mutex_lock_nested+0x3e/0x43
       [<ffffffffa00678d8>] dm_swap_table+0x2d/0x249 [dm_mod]
       [<ffffffffa006db45>] dev_suspend+0x128/0x1de [dm_mod]
       [<ffffffffa006e30c>] ctl_ioctl+0x1c6/0x213 [dm_mod]
       [<ffffffffa006e36c>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
       [<ffffffff8112a959>] vfs_ioctl+0x22/0x87
       [<ffffffff8112aec2>] do_vfs_ioctl+0x488/0x4ce
       [<ffffffff8112af5e>] sys_ioctl+0x56/0x79
       [<ffffffff8100bb82>] system_call_fastpath+0x16/0x1b

other info that might help us debug this:

1 lock held by dmsetup/1827:
 #0:  (&journal->j_barrier){+.+...}, at: [<ffffffff8119192d>] journal_lock_updates+0xe1/0xf0

stack backtrace:
Pid: 1827, comm: dmsetup Not tainted 2.6.32-rc6-snitm #8
Call Trace:
 [<ffffffff81084825>] print_circular_bug+0xa8/0xb7
 [<ffffffff8108565d>] __lock_acquire+0xa15/0xd13
 [<ffffffff81085a37>] lock_acquire+0xdc/0x102
 [<ffffffffa00678d8>] ? dm_swap_table+0x2d/0x249 [dm_mod]
 [<ffffffffa00678d8>] ? dm_swap_table+0x2d/0x249 [dm_mod]
 [<ffffffffa006da1d>] ? dev_suspend+0x0/0x1de [dm_mod]
 [<ffffffff81392372>] __mutex_lock_common+0x4b/0x37b
 [<ffffffffa00678d8>] ? dm_swap_table+0x2d/0x249 [dm_mod]
 [<ffffffff81083933>] ? mark_lock+0x2d/0x22d
 [<ffffffff81083b85>] ? mark_held_locks+0x52/0x70
 [<ffffffff8139219d>] ? __mutex_unlock_slowpath+0x120/0x132
 [<ffffffffa006da1d>] ? dev_suspend+0x0/0x1de [dm_mod]
 [<ffffffff81392766>] mutex_lock_nested+0x3e/0x43
 [<ffffffffa00678d8>] dm_swap_table+0x2d/0x249 [dm_mod]
 [<ffffffff813921bd>] ? mutex_unlock+0xe/0x10
 [<ffffffffa00691ae>] ? dm_suspend+0x28f/0x2a1 [dm_mod]
 [<ffffffffa006da1d>] ? dev_suspend+0x0/0x1de [dm_mod]
 [<ffffffffa006db45>] dev_suspend+0x128/0x1de [dm_mod]
 [<ffffffffa006e30c>] ctl_ioctl+0x1c6/0x213 [dm_mod]
 [<ffffffff81077d7f>] ? cpu_clock+0x43/0x5e
 [<ffffffffa006e36c>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
 [<ffffffff8112a959>] vfs_ioctl+0x22/0x87
 [<ffffffff81083e41>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff8112aec2>] do_vfs_ioctl+0x488/0x4ce
 [<ffffffff811f3e5a>] ? __up_read+0x76/0x7f
 [<ffffffff81076746>] ? up_read+0x2b/0x2f
 [<ffffffff8100c635>] ? retint_swapgs+0x13/0x1b
 [<ffffffff8112af5e>] sys_ioctl+0x56/0x79
 [<ffffffff8100bb82>] system_call_fastpath+0x16/0x1b
snapshot_preresume
snapshot_preresume: snap_src is_handover_source
snapshot_preresume: resuming handover-destination
snapshot_resume
snapshot_resume: handing over exceptions
snapshot_dtr

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dm: bind new table before destroying old
  2009-11-11 23:45   ` Mike Snitzer
@ 2009-11-19  0:50     ` Alasdair G Kergon
  2009-11-19  2:51       ` malahal
  0 siblings, 1 reply; 13+ messages in thread
From: Alasdair G Kergon @ 2009-11-19  0:50 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Kiyoshi Ueda, Heinz Mauelshagen, dm-devel, Mikulas Patocka,
	Zdenek Kabelac, Jun'ichi Nomura, Milan Broz

On Wed, Nov 11, 2009 at 06:45:55PM -0500, Mike Snitzer wrote:
> After further testing I've hit a lockdep trace.  My testing was with
> handing over on the same device.  I had the snapshot (of an ext3 FS)
> mounted and I was doing a sequential direct-io write to a file in the
> FS.  While writing I triggered a handover with the following:
 
> =======================================================
> [ INFO: possible circular locking dependency detected ]
> 2.6.32-rc6-snitm #8
> -------------------------------------------------------
> dmsetup/1827 is trying to acquire lock:
>  (&md->suspend_lock){+.+...}, at: [<ffffffffa00678d8>] dm_swap_table+0x2d/0x249 [dm_mod]
> 
> but task is already holding lock:
>  (&journal->j_barrier){+.+...}, at: [<ffffffff8119192d>] journal_lock_updates+0xe1/0xf0
> 
> which lock already depends on the new lock.

I'm going to assume this is bogus - and I can't spot any annotations
available to suppress it, so people will just have to ignore it.

Suspend involves:
  get suspend lock
  if dev is not already suspended
    get journal lock
    set state "dev is suspended"
  release suspend lock
  
Resume involves
  [journal lock is held]
  get suspend lock
  if dev is suspended 
    release journal lock
    set state "dev is not suspended"
  release suspend lock

It looks as if lockdep sees that as a problem:
  Imagine those two sections running in parallel without the "Is dev
  suspended?" check, of which lockdep has no knowledge.

Alasdair

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Re: dm: bind new table before destroying old
  2009-11-19  0:50     ` Alasdair G Kergon
@ 2009-11-19  2:51       ` malahal
  2009-11-19 12:49         ` Mikulas Patocka
  0 siblings, 1 reply; 13+ messages in thread
From: malahal @ 2009-11-19  2:51 UTC (permalink / raw)
  To: dm-devel
  Cc: Kiyoshi Ueda, Mike Snitzer, Heinz Mauelshagen, Mikulas Patocka,
	Zdenek Kabelac, Jun'ichi Nomura, Milan Broz

Alasdair G Kergon [agk@redhat.com] wrote:
> > After further testing I've hit a lockdep trace.  My testing was with
> > handing over on the same device.  I had the snapshot (of an ext3 FS)
> > mounted and I was doing a sequential direct-io write to a file in the
> > FS.  While writing I triggered a handover with the following:
> 
> > =======================================================
> > [ INFO: possible circular locking dependency detected ]
> > 2.6.32-rc6-snitm #8
> > -------------------------------------------------------
> > dmsetup/1827 is trying to acquire lock:
> >  (&md->suspend_lock){+.+...}, at: [<ffffffffa00678d8>] dm_swap_table+0x2d/0x249 [dm_mod]
> > 
> > but task is already holding lock:
> >  (&journal->j_barrier){+.+...}, at: [<ffffffff8119192d>] journal_lock_updates+0xe1/0xf0
> > 
> > which lock already depends on the new lock.
> 
> I'm going to assume this is bogus - and I can't spot any annotations
> available to suppress it, so people will just have to ignore it.
> 
> Suspend involves:
>   get suspend lock
>   if dev is not already suspended
>     get journal lock
>     set state "dev is suspended"
>   release suspend lock
>   
> Resume involves
>   [journal lock is held]
>   get suspend lock
>   if dev is suspended 
>     release journal lock
>     set state "dev is not suspended"
>   release suspend lock
> 
> It looks as if lockdep sees that as a problem:
>   Imagine those two sections running in parallel without the "Is dev
>   suspended?" check, of which lockdep has no knowledge.

Agreed, but is it possible to restructure the suspend code such that it
acquires the journal lock before the suspend lock, and then releases the
journal lock if dev is already suspended? This needs some explaining
in a comment form though! :-)

Thanks, Malahal.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Re: dm: bind new table before destroying old
  2009-11-19  2:51       ` malahal
@ 2009-11-19 12:49         ` Mikulas Patocka
  0 siblings, 0 replies; 13+ messages in thread
From: Mikulas Patocka @ 2009-11-19 12:49 UTC (permalink / raw)
  To: linux-ext4, Stephen Tweedie
  Cc: Kiyoshi Ueda, Mike Snitzer, Heinz Mauelshagen, dm-devel,
	Zdenek Kabelac, Jun'ichi Nomura, Milan Broz

On Wed, 18 Nov 2009, malahal@us.ibm.com wrote:

> Alasdair G Kergon [agk@redhat.com] wrote:
> > > After further testing I've hit a lockdep trace.  My testing was with
> > > handing over on the same device.  I had the snapshot (of an ext3 FS)
> > > mounted and I was doing a sequential direct-io write to a file in the
> > > FS.  While writing I triggered a handover with the following:
> > 
> > > =======================================================
> > > [ INFO: possible circular locking dependency detected ]
> > > 2.6.32-rc6-snitm #8
> > > -------------------------------------------------------
> > > dmsetup/1827 is trying to acquire lock:
> > >  (&md->suspend_lock){+.+...}, at: [<ffffffffa00678d8>] dm_swap_table+0x2d/0x249 [dm_mod]
> > > 
> > > but task is already holding lock:
> > >  (&journal->j_barrier){+.+...}, at: [<ffffffff8119192d>] journal_lock_updates+0xe1/0xf0
> > > 
> > > which lock already depends on the new lock.
> > 
> > I'm going to assume this is bogus - and I can't spot any annotations
> > available to suppress it, so people will just have to ignore it.
> > 
> > Suspend involves:
> >   get suspend lock
> >   if dev is not already suspended
> >     get journal lock
> >     set state "dev is suspended"
> >   release suspend lock
> >   
> > Resume involves
> >   [journal lock is held]
> >   get suspend lock
> >   if dev is suspended 
> >     release journal lock
> >     set state "dev is not suspended"
> >   release suspend lock
> > 
> > It looks as if lockdep sees that as a problem:
> >   Imagine those two sections running in parallel without the "Is dev
> >   suspended?" check, of which lockdep has no knowledge.
> 
> Agreed, but is it possible to restructure the suspend code such that it
> acquires the journal lock before the suspend lock, and then releases the
> journal lock if dev is already suspended? This needs some explaining
> in a comment form though! :-)
> 
> Thanks, Malahal.

The real reason for the warning is that ext3 jbd takes the mutex on 
suspend (ext3_freeze) and then keeps it taken until resume 
(ext3_unfreeze).

You also get a warning if you issue "dmsetup suspend" manually, it warns 
that a task exists with mutex held.

If suspending and resuming manually, suspend and resume are done from 
different processes, thus ext3 is violating mutex specification
Documentation/mutex-design.txt
* - only the owner can unlock the mutex
* - task may not exit with mutex held

It is really a bug in ext3 and ext4 and device mapper has nothing to do 
with it, except that it triggers it. The bug should be reported to ext[34] 
maintainers and fixed there.

Mikulas

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2009-11-19 12:49 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-11-11  1:16 dm: bind new table before destroying old Alasdair G Kergon
2009-11-11  1:20 ` Alasdair G Kergon
2009-11-11  2:48   ` dm: keep old table until after resume succeeded Alasdair G Kergon
2009-11-11  6:56 ` dm: bind new table before destroying old Kiyoshi Ueda
2009-11-11 15:11   ` Alasdair G Kergon
2009-11-11 15:14     ` Milan Broz
2009-11-12  7:00       ` Mike Anderson
2009-11-12  9:59     ` Kiyoshi Ueda
2009-11-11 13:20 ` Mike Snitzer
2009-11-11 23:45   ` Mike Snitzer
2009-11-19  0:50     ` Alasdair G Kergon
2009-11-19  2:51       ` malahal
2009-11-19 12:49         ` Mikulas Patocka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.