All of lore.kernel.org
 help / color / mirror / Atom feed
* dm-snap deadlock in pending_complete()
@ 2015-08-10  3:55 NeilBrown
  2015-08-11  9:14 ` Mikulas Patocka
  0 siblings, 1 reply; 11+ messages in thread
From: NeilBrown @ 2015-08-10  3:55 UTC (permalink / raw)
  To: Mikulas Patocka, dm-devel


Hi Mikulas,
 I have a customer hitting the deadlock you described over a year ago
 in:

Subject: [dm-devel] [PATCH] block: flush queued bios when the process
         blocks

I notice that patch never went upstream.

Has anything else been done to fix this deadlock?

My thought was that something like the below would be sufficient.  Do
you see any problem with that?  It avoids the deadlock by dropping the
lock while sleeping.

Thanks
NeilBrown

diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 7c82d3ccce87..d29bcd02f9cf 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -1454,6 +1454,7 @@ static void pending_complete(struct dm_snap_pending_exception *pe, int success)
 	}
 	*e = pe->e;
 
+retry:
 	down_write(&s->lock);
 	if (!s->valid) {
 		free_completed_exception(e);
@@ -1462,7 +1463,11 @@ static void pending_complete(struct dm_snap_pending_exception *pe, int success)
 	}
 
 	/* Check for conflicting reads */
-	__check_for_conflicting_io(s, pe->e.old_chunk);
+	if (__chunk_size_tracked(s, pe->e.old_chunk)) {
+		up_write(&s->lock);
+		msleep(1);
+		goto retry;
+	}
 
 	/*
 	 * Add a proper exception, and remove the

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: dm-snap deadlock in pending_complete()
  2015-08-10  3:55 dm-snap deadlock in pending_complete() NeilBrown
@ 2015-08-11  9:14 ` Mikulas Patocka
  2015-08-12  1:46   ` NeilBrown
  0 siblings, 1 reply; 11+ messages in thread
From: Mikulas Patocka @ 2015-08-11  9:14 UTC (permalink / raw)
  To: NeilBrown; +Cc: dm-devel

Hi

On Mon, 10 Aug 2015, NeilBrown wrote:

> 
> Hi Mikulas,
>  I have a customer hitting the deadlock you described over a year ago
>  in:
> 
> Subject: [dm-devel] [PATCH] block: flush queued bios when the process
>          blocks

Ask block layer maintainers to accept that patch.

> I notice that patch never went upstream.
> 
> Has anything else been done to fix this deadlock?
> 
> My thought was that something like the below would be sufficient.  Do
> you see any problem with that?  It avoids the deadlock by dropping the
> lock while sleeping.

The patch below introduces a bug - if the user is continuously reading the 
same chunk (for example, by issuing multiple direct i/o requests to the 
same chunk), the function pending_complete never finishes.

We need to hold the lock while waiting to make sure that no new read 
requests are submitted.

Mikulas

> Thanks
> NeilBrown
> 
> diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
> index 7c82d3ccce87..d29bcd02f9cf 100644
> --- a/drivers/md/dm-snap.c
> +++ b/drivers/md/dm-snap.c
> @@ -1454,6 +1454,7 @@ static void pending_complete(struct dm_snap_pending_exception *pe, int success)
>  	}
>  	*e = pe->e;
>  
> +retry:
>  	down_write(&s->lock);
>  	if (!s->valid) {
>  		free_completed_exception(e);
> @@ -1462,7 +1463,11 @@ static void pending_complete(struct dm_snap_pending_exception *pe, int success)
>  	}
>  
>  	/* Check for conflicting reads */
> -	__check_for_conflicting_io(s, pe->e.old_chunk);
> +	if (__chunk_size_tracked(s, pe->e.old_chunk)) {
> +		up_write(&s->lock);
> +		msleep(1);
> +		goto retry;
> +	}
>  
>  	/*
>  	 * Add a proper exception, and remove the
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dm-snap deadlock in pending_complete()
  2015-08-11  9:14 ` Mikulas Patocka
@ 2015-08-12  1:46   ` NeilBrown
  2015-08-12 14:16     ` Mikulas Patocka
  2015-08-12 16:25     ` Mikulas Patocka
  0 siblings, 2 replies; 11+ messages in thread
From: NeilBrown @ 2015-08-12  1:46 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel

On Tue, 11 Aug 2015 05:14:33 -0400 (EDT) Mikulas Patocka
<mpatocka@redhat.com> wrote:

> Hi
> 
> On Mon, 10 Aug 2015, NeilBrown wrote:
> 
> > 
> > Hi Mikulas,
> >  I have a customer hitting the deadlock you described over a year ago
> >  in:
> > 
> > Subject: [dm-devel] [PATCH] block: flush queued bios when the process
> >          blocks
> 
> Ask block layer maintainers to accept that patch.

Unfortunately I don't really like the patch ... or the bioset rescue
workqueues that it is based on.   Sorry.

So I might keep looking for a more localised approach....

> 
> > I notice that patch never went upstream.
> > 
> > Has anything else been done to fix this deadlock?
> > 
> > My thought was that something like the below would be sufficient.  Do
> > you see any problem with that?  It avoids the deadlock by dropping the
> > lock while sleeping.
> 
> The patch below introduces a bug - if the user is continuously reading the 
> same chunk (for example, by issuing multiple direct i/o requests to the 
> same chunk), the function pending_complete never finishes.
> 
> We need to hold the lock while waiting to make sure that no new read 
> requests are submitted.

Ahh - thanks for the explanation.  That makes sense.
Presumably you need to wait for old reads that you don't have a read
returning old data after a write has confirmed the new data is written?
Is there some other reason?

As soon as the new "proper" exception is installed, no new reads will
use the old address, so it should be safe to wait without holding the
lock.

So what about this?

Thanks for your time,
NeilBrown



diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 7c82d3ccce87..c7a7ed2cfd6d 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -1461,14 +1461,22 @@ static void pending_complete(struct dm_snap_pending_exception *pe, int success)
 		goto out;
 	}
 
-	/* Check for conflicting reads */
-	__check_for_conflicting_io(s, pe->e.old_chunk);
+	/* Add proper exception.  Now all reads will be
+	 * redirected, so no new reads can be started on
+	 * 'old_chunk'.
+	 */
+	dm_insert_exception(&s->complete, e);
+
+	/* Drain conflicting reads */
+	if (__chunk_is_tracked(s, pe->e.old_chunk)) {
+		up_write(&s->lock);
+		__check_for_conflicting_io(s, pe->e.old_chunk);
+		down_write(&s->lock);
+	}
 
 	/*
-	 * Add a proper exception, and remove the
-	 * in-flight exception from the list.
+	 * And now we can remove the temporary exception.
 	 */
-	dm_insert_exception(&s->complete, e);
 
 out:
 	dm_remove_exception(&pe->e);

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: dm-snap deadlock in pending_complete()
  2015-08-12  1:46   ` NeilBrown
@ 2015-08-12 14:16     ` Mikulas Patocka
  2015-08-13  1:23       ` NeilBrown
  2015-08-12 16:25     ` Mikulas Patocka
  1 sibling, 1 reply; 11+ messages in thread
From: Mikulas Patocka @ 2015-08-12 14:16 UTC (permalink / raw)
  To: NeilBrown; +Cc: dm-devel



On Wed, 12 Aug 2015, NeilBrown wrote:

> Ahh - thanks for the explanation.  That makes sense.
> Presumably you need to wait for old reads that you don't have a read
> returning old data after a write has confirmed the new data is written?
> Is there some other reason?
> 
> As soon as the new "proper" exception is installed, no new reads will
> use the old address, so it should be safe to wait without holding the
> lock.
> 
> So what about this?

This is wrong too - when you add the exception to the complete hash table 
and drop the lock, writes to the chunk in the origin target are can modify 
the origin volume. These writes race with pending reads to the snapshot 
(__chunk_is_tracked tests for those reads). If the I/O scheduler schedules 
the write to the origin before the read from the snapshot, the read from 
the snapshot returns corrupted data.

There can be more problems with snapshot merging - if the chunk is on the 
complete hash table, the merging code assumes that it can be safely 
merged.

Mikulas

> Thanks for your time,
> NeilBrown
> 
> 
> 
> diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
> index 7c82d3ccce87..c7a7ed2cfd6d 100644
> --- a/drivers/md/dm-snap.c
> +++ b/drivers/md/dm-snap.c
> @@ -1461,14 +1461,22 @@ static void pending_complete(struct dm_snap_pending_exception *pe, int success)
>  		goto out;
>  	}
>  
> -	/* Check for conflicting reads */
> -	__check_for_conflicting_io(s, pe->e.old_chunk);
> +	/* Add proper exception.  Now all reads will be
> +	 * redirected, so no new reads can be started on
> +	 * 'old_chunk'.
> +	 */
> +	dm_insert_exception(&s->complete, e);
> +
> +	/* Drain conflicting reads */
> +	if (__chunk_is_tracked(s, pe->e.old_chunk)) {
> +		up_write(&s->lock);
> +		__check_for_conflicting_io(s, pe->e.old_chunk);
> +		down_write(&s->lock);
> +	}
>  
>  	/*
> -	 * Add a proper exception, and remove the
> -	 * in-flight exception from the list.
> +	 * And now we can remove the temporary exception.
>  	 */
> -	dm_insert_exception(&s->complete, e);
>  
>  out:
>  	dm_remove_exception(&pe->e);
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dm-snap deadlock in pending_complete()
  2015-08-12  1:46   ` NeilBrown
  2015-08-12 14:16     ` Mikulas Patocka
@ 2015-08-12 16:25     ` Mikulas Patocka
  2015-08-13  0:43       ` NeilBrown
  1 sibling, 1 reply; 11+ messages in thread
From: Mikulas Patocka @ 2015-08-12 16:25 UTC (permalink / raw)
  To: NeilBrown; +Cc: dm-devel



On Wed, 12 Aug 2015, NeilBrown wrote:

> On Tue, 11 Aug 2015 05:14:33 -0400 (EDT) Mikulas Patocka
> <mpatocka@redhat.com> wrote:
> 
> > Hi
> > 
> > On Mon, 10 Aug 2015, NeilBrown wrote:
> > 
> > > 
> > > Hi Mikulas,
> > >  I have a customer hitting the deadlock you described over a year ago
> > >  in:
> > > 
> > > Subject: [dm-devel] [PATCH] block: flush queued bios when the process
> > >          blocks
> > 
> > Ask block layer maintainers to accept that patch.
> 
> Unfortunately I don't really like the patch ... or the bioset rescue
> workqueues that it is based on.   Sorry.
> 
> So I might keep looking for a more localised approach....

The problem here is that other dm targets may deadlock in a similar way 
too - for example, dm-thin could deadlock on pmd->pool_lock.

The cause of the bug is bio queuing on current->bio_list. There is an 
assumption that if a dm target submits a bio to a lower-level target, the 
bio finishes in finite time. Queuing on current->bio_list breaks the 
assumption, bios can be held indefinitelly on current->bio_list.

The patch that flushes current->bio_list is the correct way to fix it - it 
makes sure that a bio can't be held indefinitely.

Another way to fix it would be to abandon current->bio_list --- but then, 
there would be problem with stack overflow on deeply nested targets.

Mikulas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dm-snap deadlock in pending_complete()
  2015-08-12 16:25     ` Mikulas Patocka
@ 2015-08-13  0:43       ` NeilBrown
  2015-08-13 14:02         ` Mike Snitzer
  2015-08-17 13:48         ` Mikulas Patocka
  0 siblings, 2 replies; 11+ messages in thread
From: NeilBrown @ 2015-08-13  0:43 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel

On Wed, 12 Aug 2015 12:25:42 -0400 (EDT) Mikulas Patocka
<mpatocka@redhat.com> wrote:

> 
> 
> On Wed, 12 Aug 2015, NeilBrown wrote:
> 
> > On Tue, 11 Aug 2015 05:14:33 -0400 (EDT) Mikulas Patocka
> > <mpatocka@redhat.com> wrote:
> > 
> > > Hi
> > > 
> > > On Mon, 10 Aug 2015, NeilBrown wrote:
> > > 
> > > > 
> > > > Hi Mikulas,
> > > >  I have a customer hitting the deadlock you described over a year ago
> > > >  in:
> > > > 
> > > > Subject: [dm-devel] [PATCH] block: flush queued bios when the process
> > > >          blocks
> > > 
> > > Ask block layer maintainers to accept that patch.
> > 
> > Unfortunately I don't really like the patch ... or the bioset rescue
> > workqueues that it is based on.   Sorry.
> > 
> > So I might keep looking for a more localised approach....
> 
> The problem here is that other dm targets may deadlock in a similar way 
> too - for example, dm-thin could deadlock on pmd->pool_lock.
> 
> The cause of the bug is bio queuing on current->bio_list. There is an 
> assumption that if a dm target submits a bio to a lower-level target, the 
> bio finishes in finite time. Queuing on current->bio_list breaks the 
> assumption, bios can be held indefinitelly on current->bio_list.
> 
> The patch that flushes current->bio_list is the correct way to fix it - it 
> makes sure that a bio can't be held indefinitely.
> 
> Another way to fix it would be to abandon current->bio_list --- but then, 
> there would be problem with stack overflow on deeply nested targets.
> 

I think it is a bit simplistic to say that current->bio_list is the
"cause" of the problem.  It is certainly a part of the problem.  The
assumption you mention is another part - and the two conflict.

As you say, we could abandon current->bio_list, but then we risk stack
overflows again.
Or we could hand the bio_list to a work-queue whenever the make_request
function needs to schedule.... but then if handling one of those bios
needs to schedule...  not sure, it might work.

Or we could change the assumption and never block in a make_request
function after calling generic_make_request().  Maybe that is difficult.

Or we could change __split_and_process_bio to use bio_split() to split
the bio, then handle the first and call generic_make_request on the
second.  That would effectively put the second half on the end of
bio_list so it wouldn't be tried until all requests derived from the
first half have been handled.

None of these is completely straight forward, but I suspect all are
possible.

I'm leaning towards the last one:  when you want to split a bio, use
bio_split and handle the two halves separately.

Do you have thoughts on that?

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dm-snap deadlock in pending_complete()
  2015-08-12 14:16     ` Mikulas Patocka
@ 2015-08-13  1:23       ` NeilBrown
  2015-08-17 10:55         ` Mikulas Patocka
  0 siblings, 1 reply; 11+ messages in thread
From: NeilBrown @ 2015-08-13  1:23 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel

On Wed, 12 Aug 2015 10:16:17 -0400 (EDT) Mikulas Patocka
<mpatocka@redhat.com> wrote:

> 
> 
> On Wed, 12 Aug 2015, NeilBrown wrote:
> 
> > Ahh - thanks for the explanation.  That makes sense.
> > Presumably you need to wait for old reads that you don't have a read
> > returning old data after a write has confirmed the new data is written?
> > Is there some other reason?
> > 
> > As soon as the new "proper" exception is installed, no new reads will
> > use the old address, so it should be safe to wait without holding the
> > lock.
> > 
> > So what about this?
> 
> This is wrong too - when you add the exception to the complete hash table 
> and drop the lock, writes to the chunk in the origin target are can modify 
> the origin volume. These writes race with pending reads to the snapshot 
> (__chunk_is_tracked tests for those reads). If the I/O scheduler schedules 
> the write to the origin before the read from the snapshot, the read from 
> the snapshot returns corrupted data.
> 
> There can be more problems with snapshot merging - if the chunk is on the 
> complete hash table, the merging code assumes that it can be safely 
> merged.
> 

Thanks again.

I came up with the following which should address the origin-write
issue, but I'm not so sure about snapshot merging, and it is becoming a
lot less simple that I had hoped.

I'll see if I can come up with code for a more general solution that is
still localised to dm - along the lines of my bio_split suggestion.

Thanks,
NeilBrown



diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 7c82d3ccce87..7f9e1b0429c8 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -1461,14 +1461,22 @@ static void pending_complete(struct dm_snap_pending_exception *pe, int success)
 		goto out;
 	}
 
-	/* Check for conflicting reads */
-	__check_for_conflicting_io(s, pe->e.old_chunk);
+	/* Add proper exception.  Now all reads will be
+	 * redirected, so no new reads can be started on
+	 * 'old_chunk'.
+	 */
+	dm_insert_exception(&s->complete, e);
+
+	/* Drain conflicting reads */
+	if (__chunk_is_tracked(s, pe->e.old_chunk)) {
+		up_write(&s->lock);
+		__check_for_conflicting_io(s, pe->e.old_chunk);
+		down_write(&s->lock);
+	}
 
 	/*
-	 * Add a proper exception, and remove the
-	 * in-flight exception from the list.
+	 * And now we can remove the temporary exception.
 	 */
-	dm_insert_exception(&s->complete, e);
 
 out:
 	dm_remove_exception(&pe->e);
@@ -2089,17 +2097,17 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
 		 */
 		chunk = sector_to_chunk(snap->store, sector);
 
-		/*
-		 * Check exception table to see if block
-		 * is already remapped in this snapshot
-		 * and trigger an exception if not.
-		 */
-		e = dm_lookup_exception(&snap->complete, chunk);
-		if (e)
-			goto next_snapshot;
-
 		pe = __lookup_pending_exception(snap, chunk);
 		if (!pe) {
+			/*
+			 * Check exception table to see if block
+			 * is already remapped in this snapshot
+			 * and trigger an exception if not.
+			 */
+			e = dm_lookup_exception(&snap->complete, chunk);
+			if (e)
+				goto next_snapshot;
+
 			up_write(&snap->lock);
 			pe = alloc_pending_exception(snap);
 			down_write(&snap->lock);

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: dm-snap deadlock in pending_complete()
  2015-08-13  0:43       ` NeilBrown
@ 2015-08-13 14:02         ` Mike Snitzer
  2015-08-17 13:48         ` Mikulas Patocka
  1 sibling, 0 replies; 11+ messages in thread
From: Mike Snitzer @ 2015-08-13 14:02 UTC (permalink / raw)
  To: NeilBrown; +Cc: dm-devel, Mikulas Patocka

On Wed, Aug 12 2015 at  8:43pm -0400,
NeilBrown <neilb@suse.com> wrote:

> On Wed, 12 Aug 2015 12:25:42 -0400 (EDT) Mikulas Patocka
> <mpatocka@redhat.com> wrote:
> 
> > 
> > 
> > On Wed, 12 Aug 2015, NeilBrown wrote:
> > 
> > > On Tue, 11 Aug 2015 05:14:33 -0400 (EDT) Mikulas Patocka
> > > <mpatocka@redhat.com> wrote:
> > > 
> > > > Hi
> > > > 
> > > > On Mon, 10 Aug 2015, NeilBrown wrote:
> > > > 
> > > > > 
> > > > > Hi Mikulas,
> > > > >  I have a customer hitting the deadlock you described over a year ago
> > > > >  in:
> > > > > 
> > > > > Subject: [dm-devel] [PATCH] block: flush queued bios when the process
> > > > >          blocks
> > > > 
> > > > Ask block layer maintainers to accept that patch.
> > > 
> > > Unfortunately I don't really like the patch ... or the bioset rescue
> > > workqueues that it is based on.   Sorry.
> > > 
> > > So I might keep looking for a more localised approach....
> > 
> > The problem here is that other dm targets may deadlock in a similar way 
> > too - for example, dm-thin could deadlock on pmd->pool_lock.
> > 
> > The cause of the bug is bio queuing on current->bio_list. There is an 
> > assumption that if a dm target submits a bio to a lower-level target, the 
> > bio finishes in finite time. Queuing on current->bio_list breaks the 
> > assumption, bios can be held indefinitelly on current->bio_list.
> > 
> > The patch that flushes current->bio_list is the correct way to fix it - it 
> > makes sure that a bio can't be held indefinitely.
> > 
> > Another way to fix it would be to abandon current->bio_list --- but then, 
> > there would be problem with stack overflow on deeply nested targets.
> > 
> 
> I think it is a bit simplistic to say that current->bio_list is the
> "cause" of the problem.  It is certainly a part of the problem.  The
> assumption you mention is another part - and the two conflict.
> 
> As you say, we could abandon current->bio_list, but then we risk stack
> overflows again.
> Or we could hand the bio_list to a work-queue whenever the make_request
> function needs to schedule.... but then if handling one of those bios
> needs to schedule...  not sure, it might work.
> 
> Or we could change the assumption and never block in a make_request
> function after calling generic_make_request().  Maybe that is difficult.
> 
> Or we could change __split_and_process_bio to use bio_split() to split
> the bio, then handle the first and call generic_make_request on the
> second.  That would effectively put the second half on the end of
> bio_list so it wouldn't be tried until all requests derived from the
> first half have been handled.
> 
> None of these is completely straight forward, but I suspect all are
> possible.
> 
> I'm leaning towards the last one:  when you want to split a bio, use
> bio_split and handle the two halves separately.

Sounds worth pursuing.

> Do you have thoughts on that?

Once the late bio splitting work is merged for 4.3 I (along with Joe,
yourself, anyone interested) hope to review/audit/fix DM core to make
proper use of bio_split() et al.

So any work in this area needs to be based on the late-bio-splitting
patchset/branch that mlin has been pushing.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dm-snap deadlock in pending_complete()
  2015-08-13  1:23       ` NeilBrown
@ 2015-08-17 10:55         ` Mikulas Patocka
  0 siblings, 0 replies; 11+ messages in thread
From: Mikulas Patocka @ 2015-08-17 10:55 UTC (permalink / raw)
  To: NeilBrown; +Cc: dm-devel



On Thu, 13 Aug 2015, NeilBrown wrote:

> On Wed, 12 Aug 2015 10:16:17 -0400 (EDT) Mikulas Patocka
> <mpatocka@redhat.com> wrote:
> 
> > 
> > 
> > On Wed, 12 Aug 2015, NeilBrown wrote:
> > 
> > > Ahh - thanks for the explanation.  That makes sense.
> > > Presumably you need to wait for old reads that you don't have a read
> > > returning old data after a write has confirmed the new data is written?
> > > Is there some other reason?
> > > 
> > > As soon as the new "proper" exception is installed, no new reads will
> > > use the old address, so it should be safe to wait without holding the
> > > lock.
> > > 
> > > So what about this?
> > 
> > This is wrong too - when you add the exception to the complete hash table 
> > and drop the lock, writes to the chunk in the origin target are can modify 
> > the origin volume. These writes race with pending reads to the snapshot 
> > (__chunk_is_tracked tests for those reads). If the I/O scheduler schedules 
> > the write to the origin before the read from the snapshot, the read from 
> > the snapshot returns corrupted data.
> > 
> > There can be more problems with snapshot merging - if the chunk is on the 
> > complete hash table, the merging code assumes that it can be safely 
> > merged.
> > 
> 
> Thanks again.
> 
> I came up with the following which should address the origin-write
> issue, but I'm not so sure about snapshot merging, and it is becoming a
> lot less simple that I had hoped.
> 
> I'll see if I can come up with code for a more general solution that is
> still localised to dm - along the lines of my bio_split suggestion.
> 
> Thanks,
> NeilBrown
> 
> 
> 
> diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
> index 7c82d3ccce87..7f9e1b0429c8 100644
> --- a/drivers/md/dm-snap.c
> +++ b/drivers/md/dm-snap.c
> @@ -1461,14 +1461,22 @@ static void pending_complete(struct dm_snap_pending_exception *pe, int success)
>  		goto out;
>  	}
>  
> -	/* Check for conflicting reads */
> -	__check_for_conflicting_io(s, pe->e.old_chunk);
> +	/* Add proper exception.  Now all reads will be
> +	 * redirected, so no new reads can be started on
> +	 * 'old_chunk'.
> +	 */
> +	dm_insert_exception(&s->complete, e);
> +
> +	/* Drain conflicting reads */
> +	if (__chunk_is_tracked(s, pe->e.old_chunk)) {
> +		up_write(&s->lock);
> +		__check_for_conflicting_io(s, pe->e.old_chunk);
> +		down_write(&s->lock);
> +	}
>  
>  	/*
> -	 * Add a proper exception, and remove the
> -	 * in-flight exception from the list.
> +	 * And now we can remove the temporary exception.
>  	 */
> -	dm_insert_exception(&s->complete, e);
>  
>  out:
>  	dm_remove_exception(&pe->e);
> @@ -2089,17 +2097,17 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
>  		 */
>  		chunk = sector_to_chunk(snap->store, sector);
>  
> -		/*
> -		 * Check exception table to see if block
> -		 * is already remapped in this snapshot
> -		 * and trigger an exception if not.
> -		 */
> -		e = dm_lookup_exception(&snap->complete, chunk);
> -		if (e)
> -			goto next_snapshot;
> -
>  		pe = __lookup_pending_exception(snap, chunk);
>  		if (!pe) {
> +			/*
> +			 * Check exception table to see if block
> +			 * is already remapped in this snapshot
> +			 * and trigger an exception if not.
> +			 */
> +			e = dm_lookup_exception(&snap->complete, chunk);
> +			if (e)
> +				goto next_snapshot;
> +
>  			up_write(&snap->lock);
>  			pe = alloc_pending_exception(snap);
>  			down_write(&snap->lock);

You forgot about another race after these calls to up_write/down_write - 
after the lock is taken again, you must call both dm_lookup_exception and 
__lookup_pending_exception.

Mikulas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dm-snap deadlock in pending_complete()
  2015-08-13  0:43       ` NeilBrown
  2015-08-13 14:02         ` Mike Snitzer
@ 2015-08-17 13:48         ` Mikulas Patocka
  2015-08-18  6:29           ` NeilBrown
  1 sibling, 1 reply; 11+ messages in thread
From: Mikulas Patocka @ 2015-08-17 13:48 UTC (permalink / raw)
  To: NeilBrown; +Cc: dm-devel



On Thu, 13 Aug 2015, NeilBrown wrote:

> On Wed, 12 Aug 2015 12:25:42 -0400 (EDT) Mikulas Patocka
> <mpatocka@redhat.com> wrote:
> 
> > 
> > 
> > On Wed, 12 Aug 2015, NeilBrown wrote:
> > 
> > > On Tue, 11 Aug 2015 05:14:33 -0400 (EDT) Mikulas Patocka
> > > <mpatocka@redhat.com> wrote:
> > > 
> > > > Hi
> > > > 
> > > > On Mon, 10 Aug 2015, NeilBrown wrote:
> > > > 
> > > > > 
> > > > > Hi Mikulas,
> > > > >  I have a customer hitting the deadlock you described over a year ago
> > > > >  in:
> > > > > 
> > > > > Subject: [dm-devel] [PATCH] block: flush queued bios when the process
> > > > >          blocks
> > > > 
> > > > Ask block layer maintainers to accept that patch.
> > > 
> > > Unfortunately I don't really like the patch ... or the bioset rescue
> > > workqueues that it is based on.   Sorry.
> > > 
> > > So I might keep looking for a more localised approach....
> > 
> > The problem here is that other dm targets may deadlock in a similar way 
> > too - for example, dm-thin could deadlock on pmd->pool_lock.
> > 
> > The cause of the bug is bio queuing on current->bio_list. There is an 
> > assumption that if a dm target submits a bio to a lower-level target, the 
> > bio finishes in finite time. Queuing on current->bio_list breaks the 
> > assumption, bios can be held indefinitelly on current->bio_list.
> > 
> > The patch that flushes current->bio_list is the correct way to fix it - it 
> > makes sure that a bio can't be held indefinitely.
> > 
> > Another way to fix it would be to abandon current->bio_list --- but then, 
> > there would be problem with stack overflow on deeply nested targets.
> > 
> 
> I think it is a bit simplistic to say that current->bio_list is the
> "cause" of the problem.  It is certainly a part of the problem.  The
> assumption you mention is another part - and the two conflict.
> 
> As you say, we could abandon current->bio_list, but then we risk stack
> overflows again.
> Or we could hand the bio_list to a work-queue whenever the make_request
> function needs to schedule.... but then if handling one of those bios
> needs to schedule...  not sure, it might work.
> 
> Or we could change the assumption and never block in a make_request
> function after calling generic_make_request().  Maybe that is difficult.
> 
> Or we could change __split_and_process_bio to use bio_split() to split
> the bio, then handle the first and call generic_make_request on the
> second.  That would effectively put the second half on the end of
> bio_list so it wouldn't be tried until all requests derived from the
> first half have been handled.

I don't think it will fix the bug - even if you put the second half of the 
bio at the end of bio_list, it will still wait until other entries on the 
bio list are processed.

For example - device 1 gets a bio, splits it to bio1 and bio2, forwards 
them to device 2 and put them on current->bio_list

Device 2 receives bio1 (popped from curretn->bio_list), splits it to bio3 
and bio4, forwards them to device 3 and puts them at the end of 
current->bio_list

Device 2 receives bio2 (popped from current->bio_list), waits until bio1 
finishes, but bio1 won't ever finish because it depends on bio3 and bio4 
that are on current->bio_list.

> None of these is completely straight forward, but I suspect all are
> possible.
> 
> I'm leaning towards the last one:  when you want to split a bio, use
> bio_split and handle the two halves separately.
> 
> Do you have thoughts on that?
> 
> Thanks,
> NeilBrown

Mikulas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dm-snap deadlock in pending_complete()
  2015-08-17 13:48         ` Mikulas Patocka
@ 2015-08-18  6:29           ` NeilBrown
  0 siblings, 0 replies; 11+ messages in thread
From: NeilBrown @ 2015-08-18  6:29 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel

On Mon, 17 Aug 2015 09:48:57 -0400 (EDT) Mikulas Patocka
<mpatocka@redhat.com> wrote:

> 
> 
> On Thu, 13 Aug 2015, NeilBrown wrote:

> > Or we could change __split_and_process_bio to use bio_split() to split
> > the bio, then handle the first and call generic_make_request on the
> > second.  That would effectively put the second half on the end of
> > bio_list so it wouldn't be tried until all requests derived from the
> > first half have been handled.
> 
> I don't think it will fix the bug - even if you put the second half of the 
> bio at the end of bio_list, it will still wait until other entries on the 
> bio list are processed.
> 
> For example - device 1 gets a bio, splits it to bio1 and bio2, forwards 
> them to device 2 and put them on current->bio_list
> 
> Device 2 receives bio1 (popped from curretn->bio_list), splits it to bio3 
> and bio4, forwards them to device 3 and puts them at the end of 
> current->bio_list
> 
> Device 2 receives bio2 (popped from current->bio_list), waits until bio1 
> finishes, but bio1 won't ever finish because it depends on bio3 and bio4 
> that are on current->bio_list.
>

Yes, you are right.  I think I was thinking of a slightly different sort
of problem.

That is more insidious than I imagined, but maybe it leads to a fairly
straight forward solution.

Suppose we treated current->bio_list like a stack instead of a queue.

In your example, bio would be split into bio1 and bio2 that are both
pushed onto the stack - lets push them in reverse order to maintain a
similar set of dependencies.  So push bio2 then bio1.

Then bio1 is popped off, split into bio3 and bio4, and pushed back.

bio3 is popped and handled - nothing new pushed.
bio4 is popped and handled - nothing new pushed.
bio2 is popped, waits for bio1 - which will complete as nothing stops it
- and then proceed (probably gets split into bio4 and bio5).

The key idea here is that a bio for a lower-level device (lower in the
stacking order, where filesystem is at the top and hardware at the
bottom) is never placed after (beneath) a bio for an upper level device
in the stack.  So we always handle the lower-level bios first.

This makes sense as upper level devices may want to wait for lower
level devices, but never the reverse.

We probably don't want a strict stack discipline as if one driver
submits two bios, they should still be processed in that order.
Rather: each call to a make_request_fn gets a new queue to attach
bios to, and when it completes, this queue is pushed onto the stack of
other pending bios.

Something like this.


diff --git a/block/blk-core.c b/block/blk-core.c
index 627ed0c593fb..b4663eed7615 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1961,12 +1961,15 @@ void generic_make_request(struct bio *bio)
 	 * bio_list, and call into ->make_request() again.
 	 */
 	BUG_ON(bio->bi_next);
-	bio_list_init(&bio_list_on_stack);
-	current->bio_list = &bio_list_on_stack;
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+		struct bio_list remainder;
 
+		remainder = bio_list_on_stack;
+		bio_list_init(&bio_list_on_stack);
+		current->bio_list = &bio_list_on_stack;
 		q->make_request_fn(q, bio);
+		bio_list_merge(&bio_list_on_stack, &remainder);
 
 		bio = bio_list_pop(current->bio_list);
 	} while (bio);


So before handling a bio we put all other delayed bios (which must be
for devices on the same or a higher level) in 'remainder' and initialize
bio_list_on_stack which becomes current->bio_list.
bios submitted by that make_request_fn call (all for lower-level
devices) are collected in bio_list_on_stack which is then pushed onto
the remainder (actually remainder is spliced underneath
bio_list_on_stack).

Then the top bio is handled.  If  the most recent make_request_fn
submitted any bios, this will be the first of those, otherwise it will
whatever is next from some previous call.


This isn't sufficient by itself.  It still needs dm_make_request to
split a bio of the front of the original bio, map that, then submit the
mapped bio and the remains of the split bio.

I imagine something vaguely like this:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ab37ae114e94..977678ff8d82 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1711,8 +1711,11 @@ static void __split_and_process_bio(struct mapped_device *md,
 	} else {
 		ci.bio = bio;
 		ci.sector_count = bio_sectors(bio);
-		while (ci.sector_count && !error)
-			error = __split_and_process_non_flush(&ci);
+		error = __split_and_process_non_flush(&ci);
+		if (!error && ci.sector_count) {
+			bio->bi_iter.bi_size = to_bytes(ci.sector_count);
+			generic_make_request(bio);
+		}
 	}
 
 	/* drop the extra reference count */


Though that is wrong at least because it doesn't create a proper linkage
between the old 'bio' and the clone created by clone_bio().
This would result in cloned part being fully processed before the
remainder is looked at at all.

NeilBrown

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-08-18  6:29 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-10  3:55 dm-snap deadlock in pending_complete() NeilBrown
2015-08-11  9:14 ` Mikulas Patocka
2015-08-12  1:46   ` NeilBrown
2015-08-12 14:16     ` Mikulas Patocka
2015-08-13  1:23       ` NeilBrown
2015-08-17 10:55         ` Mikulas Patocka
2015-08-12 16:25     ` Mikulas Patocka
2015-08-13  0:43       ` NeilBrown
2015-08-13 14:02         ` Mike Snitzer
2015-08-17 13:48         ` Mikulas Patocka
2015-08-18  6:29           ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.