All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] offload bios to a thread
@ 2016-06-29  0:16 Mikulas Patocka
  2016-06-30 19:40   ` Mike Snitzer
  0 siblings, 1 reply; 19+ messages in thread
From: Mikulas Patocka @ 2016-06-29  0:16 UTC (permalink / raw)
  To: Alasdair G. Kergon, Mike Snitzer, Zdenek Kabelac; +Cc: dm-devel

Hi

Here I'm sending three patches to fix the deadlocks in snapshot and 
snapshot-merge.

The first patch fixes the deadlock, the following 2 patches introduce a 
timer, so that bios are not offloaded immediatelly, they are offloaded 
after a specified timeout, because immediate offloading can change order 
of bios and it could theoretically produce regressions. I don't know if 
these regressions really exist or not.

If there is some way to push the patches upstream, try it.

Mikulas

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] offload bios to a thread
  2016-06-29  0:16 Mikulas Patocka
@ 2016-06-30 19:40   ` Mike Snitzer
  0 siblings, 0 replies; 19+ messages in thread
From: Mike Snitzer @ 2016-06-30 19:40 UTC (permalink / raw)
  To: Mikulas Patocka, Lars Ellenberg, axboe
  Cc: linux-block, dm-devel, Roland Kammerer, Alasdair G. Kergon,
	Zdenek Kabelac

[cc'ing linux-block and drbd folks]

On Tue, Jun 28 2016 at  8:16pm -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> Hi
> 
> Here I'm sending three patches to fix the deadlocks in snapshot and 
> snapshot-merge.
> 
> The first patch fixes the deadlock, the following 2 patches introduce a 
> timer, so that bios are not offloaded immediatelly, they are offloaded 
> after a specified timeout, because immediate offloading can change order 
> of bios and it could theoretically produce regressions. I don't know if 
> these regressions really exist or not.
> 
> If there is some way to push the patches upstream, try it.

Some fix must happen before the more recent upstream kernels can be
reliably used in stacked bio-based workloads (in production).  We simply
cannot ignore this issue any more.

drbd is also hitting the same generic_make_request (current->bio_list)
problem, see:
https://www.redhat.com/archives/dm-devel/2016-June/msg00326.html

Mikulas, I've taken your 3 proposed patches patches and refactored them
some to split out intermediate patches that hopefully make review
easier.  Nothing other than variable names and some other style stuff
was changed -- headers were tweaked some to help with clarity.

Please see the 5 topmost "block: ..." patches here:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip

It should be noted that Jens had a quick look at this set and wanted to
throw up a little when he saw the (ab)use of a timer to defer punting to
the workqueue.  I explained that without the timer, always punting to
the workqueue, we could hurt performance by reordering IO or crippling
onstack plugging.  He said he'd try to think of a cleaner way forward.

Lars, please feel free to see if this set addresses the similar deadlock
you saw/fixed with drbd.  We need to converge on an acceptable fix for
this problem -- preferably sooner rather than later!

Conversely, Mikulas: if you can easily reproduce the dm-snapshot
deadlock please try Lars' fix to see if it is workable for our DM needs.

Thanks,
Mike

p.s. I'm on holiday until next Wednesday (7/6).. so may be slow to
     respond until then.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] offload bios to a thread
@ 2016-06-30 19:40   ` Mike Snitzer
  0 siblings, 0 replies; 19+ messages in thread
From: Mike Snitzer @ 2016-06-30 19:40 UTC (permalink / raw)
  To: Mikulas Patocka, Lars Ellenberg, axboe
  Cc: Alasdair G. Kergon, Zdenek Kabelac, dm-devel, linux-block,
	Roland Kammerer

[cc'ing linux-block and drbd folks]

On Tue, Jun 28 2016 at  8:16pm -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> Hi
> 
> Here I'm sending three patches to fix the deadlocks in snapshot and 
> snapshot-merge.
> 
> The first patch fixes the deadlock, the following 2 patches introduce a 
> timer, so that bios are not offloaded immediatelly, they are offloaded 
> after a specified timeout, because immediate offloading can change order 
> of bios and it could theoretically produce regressions. I don't know if 
> these regressions really exist or not.
> 
> If there is some way to push the patches upstream, try it.

Some fix must happen before the more recent upstream kernels can be
reliably used in stacked bio-based workloads (in production).  We simply
cannot ignore this issue any more.

drbd is also hitting the same generic_make_request (current->bio_list)
problem, see:
https://www.redhat.com/archives/dm-devel/2016-June/msg00326.html

Mikulas, I've taken your 3 proposed patches patches and refactored them
some to split out intermediate patches that hopefully make review
easier.  Nothing other than variable names and some other style stuff
was changed -- headers were tweaked some to help with clarity.

Please see the 5 topmost "block: ..." patches here:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip

It should be noted that Jens had a quick look at this set and wanted to
throw up a little when he saw the (ab)use of a timer to defer punting to
the workqueue.  I explained that without the timer, always punting to
the workqueue, we could hurt performance by reordering IO or crippling
onstack plugging.  He said he'd try to think of a cleaner way forward.

Lars, please feel free to see if this set addresses the similar deadlock
you saw/fixed with drbd.  We need to converge on an acceptable fix for
this problem -- preferably sooner rather than later!

Conversely, Mikulas: if you can easily reproduce the dm-snapshot
deadlock please try Lars' fix to see if it is workable for our DM needs.

Thanks,
Mike

p.s. I'm on holiday until next Wednesday (7/6).. so may be slow to
     respond until then.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] offload bios to a thread
  2016-06-30 19:40   ` Mike Snitzer
@ 2016-06-30 23:15     ` Mike Snitzer
  -1 siblings, 0 replies; 19+ messages in thread
From: Mike Snitzer @ 2016-06-30 23:15 UTC (permalink / raw)
  To: Mikulas Patocka, Lars Ellenberg, axboe
  Cc: linux-block, dm-devel, Zdenek Kabelac, Alasdair G. Kergon,
	Roland Kammerer

On Thu, Jun 30 2016 at  3:40pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> [cc'ing linux-block and drbd folks]
> 
> On Tue, Jun 28 2016 at  8:16pm -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > Hi
> > 
> > Here I'm sending three patches to fix the deadlocks in snapshot and 
> > snapshot-merge.
> > 
> > The first patch fixes the deadlock, the following 2 patches introduce a 
> > timer, so that bios are not offloaded immediatelly, they are offloaded 
> > after a specified timeout, because immediate offloading can change order 
> > of bios and it could theoretically produce regressions. I don't know if 
> > these regressions really exist or not.
> > 
> > If there is some way to push the patches upstream, try it.
> 
> Some fix must happen before the more recent upstream kernels can be
> reliably used in stacked bio-based workloads (in production).  We simply
> cannot ignore this issue any more.
> 
> drbd is also hitting the same generic_make_request (current->bio_list)
> problem, see:
> https://www.redhat.com/archives/dm-devel/2016-June/msg00326.html
> 
> Mikulas, I've taken your 3 proposed patches patches and refactored them
> some to split out intermediate patches that hopefully make review
> easier.  Nothing other than variable names and some other style stuff
> was changed -- headers were tweaked some to help with clarity.
> 
> Please see the 5 topmost "block: ..." patches here:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip
> 
> It should be noted that Jens had a quick look at this set and wanted to
> throw up a little when he saw the (ab)use of a timer to defer punting to
> the workqueue.  I explained that without the timer, always punting to
> the workqueue, we could hurt performance by reordering IO or crippling
> onstack plugging.  He said he'd try to think of a cleaner way forward.
> 
> Lars, please feel free to see if this set addresses the similar deadlock
> you saw/fixed with drbd.  We need to converge on an acceptable fix for
> this problem -- preferably sooner rather than later!
> 
> Conversely, Mikulas: if you can easily reproduce the dm-snapshot
> deadlock please try Lars' fix to see if it is workable for our DM needs.

I hadn't reviewed Lars' patch yet but Mikulas pointed out to me that
Lars' patch is focused on the blk_queue_split() path -- and given that
DM doesn't use this function (nor do DM devices even have a 'bio_split'
bioset, see commit dbba42d8a9e) it won't fix the DM (snapshot) deadlock.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] offload bios to a thread
@ 2016-06-30 23:15     ` Mike Snitzer
  0 siblings, 0 replies; 19+ messages in thread
From: Mike Snitzer @ 2016-06-30 23:15 UTC (permalink / raw)
  To: Mikulas Patocka, Lars Ellenberg, axboe
  Cc: linux-block, dm-devel, Roland Kammerer, Alasdair G. Kergon,
	Zdenek Kabelac

On Thu, Jun 30 2016 at  3:40pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> [cc'ing linux-block and drbd folks]
> 
> On Tue, Jun 28 2016 at  8:16pm -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > Hi
> > 
> > Here I'm sending three patches to fix the deadlocks in snapshot and 
> > snapshot-merge.
> > 
> > The first patch fixes the deadlock, the following 2 patches introduce a 
> > timer, so that bios are not offloaded immediatelly, they are offloaded 
> > after a specified timeout, because immediate offloading can change order 
> > of bios and it could theoretically produce regressions. I don't know if 
> > these regressions really exist or not.
> > 
> > If there is some way to push the patches upstream, try it.
> 
> Some fix must happen before the more recent upstream kernels can be
> reliably used in stacked bio-based workloads (in production).  We simply
> cannot ignore this issue any more.
> 
> drbd is also hitting the same generic_make_request (current->bio_list)
> problem, see:
> https://www.redhat.com/archives/dm-devel/2016-June/msg00326.html
> 
> Mikulas, I've taken your 3 proposed patches patches and refactored them
> some to split out intermediate patches that hopefully make review
> easier.  Nothing other than variable names and some other style stuff
> was changed -- headers were tweaked some to help with clarity.
> 
> Please see the 5 topmost "block: ..." patches here:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip
> 
> It should be noted that Jens had a quick look at this set and wanted to
> throw up a little when he saw the (ab)use of a timer to defer punting to
> the workqueue.  I explained that without the timer, always punting to
> the workqueue, we could hurt performance by reordering IO or crippling
> onstack plugging.  He said he'd try to think of a cleaner way forward.
> 
> Lars, please feel free to see if this set addresses the similar deadlock
> you saw/fixed with drbd.  We need to converge on an acceptable fix for
> this problem -- preferably sooner rather than later!
> 
> Conversely, Mikulas: if you can easily reproduce the dm-snapshot
> deadlock please try Lars' fix to see if it is workable for our DM needs.

I hadn't reviewed Lars' patch yet but Mikulas pointed out to me that
Lars' patch is focused on the blk_queue_split() path -- and given that
DM doesn't use this function (nor do DM devices even have a 'bio_split'
bioset, see commit dbba42d8a9e) it won't fix the DM (snapshot) deadlock.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] offload bios to a thread
  2016-06-30 23:15     ` Mike Snitzer
@ 2016-07-04  8:09       ` Lars Ellenberg
  -1 siblings, 0 replies; 19+ messages in thread
From: Lars Ellenberg @ 2016-07-04  8:09 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe, linux-block, dm-devel, Mikulas Patocka, Zdenek Kabelac,
	Alasdair G. Kergon, Roland Kammerer

On Thu, Jun 30, 2016 at 07:15:18PM -0400, Mike Snitzer wrote:
> > Lars, please feel free to see if this set addresses the similar deadlock
> > you saw/fixed with drbd.

I'm pretty sure it will help, but will confirm.

> > We need to converge on an acceptable fix for
> > this problem -- preferably sooner rather than later!
> > 
> > Conversely, Mikulas: if you can easily reproduce the dm-snapshot
> > deadlock please try Lars' fix to see if it is workable for our DM needs.
> 
> I hadn't reviewed Lars' patch yet but Mikulas pointed out to me that
> Lars' patch is focused on the blk_queue_split() path -- and given that
> DM doesn't use this function (nor do DM devices even have a 'bio_split'
> bioset, see commit dbba42d8a9e) it won't fix the DM (snapshot) deadlock.

Don't you get it implicitly when using dm-mq -> blk-mq?

    Lars
	

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] offload bios to a thread
@ 2016-07-04  8:09       ` Lars Ellenberg
  0 siblings, 0 replies; 19+ messages in thread
From: Lars Ellenberg @ 2016-07-04  8:09 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Mikulas Patocka, axboe, linux-block, dm-devel, Roland Kammerer,
	Alasdair G. Kergon, Zdenek Kabelac

On Thu, Jun 30, 2016 at 07:15:18PM -0400, Mike Snitzer wrote:
> > Lars, please feel free to see if this set addresses the similar deadlock
> > you saw/fixed with drbd.

I'm pretty sure it will help, but will confirm.

> > We need to converge on an acceptable fix for
> > this problem -- preferably sooner rather than later!
> > 
> > Conversely, Mikulas: if you can easily reproduce the dm-snapshot
> > deadlock please try Lars' fix to see if it is workable for our DM needs.
> 
> I hadn't reviewed Lars' patch yet but Mikulas pointed out to me that
> Lars' patch is focused on the blk_queue_split() path -- and given that
> DM doesn't use this function (nor do DM devices even have a 'bio_split'
> bioset, see commit dbba42d8a9e) it won't fix the DM (snapshot) deadlock.

Don't you get it implicitly when using dm-mq -> blk-mq?

    Lars
	

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] offload bios to a thread
  2016-07-04  8:09       ` Lars Ellenberg
@ 2016-07-04 22:27         ` Mikulas Patocka
  -1 siblings, 0 replies; 19+ messages in thread
From: Mikulas Patocka @ 2016-07-04 22:27 UTC (permalink / raw)
  To: Lars Ellenberg
  Cc: axboe, Mike Snitzer, linux-block, dm-devel, Zdenek Kabelac,
	Alasdair G. Kergon, Roland Kammerer



On Mon, 4 Jul 2016, Lars Ellenberg wrote:

> On Thu, Jun 30, 2016 at 07:15:18PM -0400, Mike Snitzer wrote:
> > > Lars, please feel free to see if this set addresses the similar deadlock
> > > you saw/fixed with drbd.
> 
> I'm pretty sure it will help, but will confirm.
> 
> > > We need to converge on an acceptable fix for
> > > this problem -- preferably sooner rather than later!
> > > 
> > > Conversely, Mikulas: if you can easily reproduce the dm-snapshot
> > > deadlock please try Lars' fix to see if it is workable for our DM needs.
> > 
> > I hadn't reviewed Lars' patch yet but Mikulas pointed out to me that
> > Lars' patch is focused on the blk_queue_split() path -- and given that
> > DM doesn't use this function (nor do DM devices even have a 'bio_split'
> > bioset, see commit dbba42d8a9e) it won't fix the DM (snapshot) deadlock.
> 
> Don't you get it implicitly when using dm-mq -> blk-mq?
> 
>     Lars

There were observed deadlocks just between dm targets, caused by queuing 
bios on current->bio_list.

The underlying block device was not involved in the deadlocks. Therefore I 
conclude that changing the behavior of blk_queue_split would not resolve 
these deadlocks (because dm targets do not use blk_queue_split).

Mikulas

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] offload bios to a thread
@ 2016-07-04 22:27         ` Mikulas Patocka
  0 siblings, 0 replies; 19+ messages in thread
From: Mikulas Patocka @ 2016-07-04 22:27 UTC (permalink / raw)
  To: Lars Ellenberg
  Cc: Mike Snitzer, axboe, linux-block, dm-devel, Roland Kammerer,
	Alasdair G. Kergon, Zdenek Kabelac



On Mon, 4 Jul 2016, Lars Ellenberg wrote:

> On Thu, Jun 30, 2016 at 07:15:18PM -0400, Mike Snitzer wrote:
> > > Lars, please feel free to see if this set addresses the similar deadlock
> > > you saw/fixed with drbd.
> 
> I'm pretty sure it will help, but will confirm.
> 
> > > We need to converge on an acceptable fix for
> > > this problem -- preferably sooner rather than later!
> > > 
> > > Conversely, Mikulas: if you can easily reproduce the dm-snapshot
> > > deadlock please try Lars' fix to see if it is workable for our DM needs.
> > 
> > I hadn't reviewed Lars' patch yet but Mikulas pointed out to me that
> > Lars' patch is focused on the blk_queue_split() path -- and given that
> > DM doesn't use this function (nor do DM devices even have a 'bio_split'
> > bioset, see commit dbba42d8a9e) it won't fix the DM (snapshot) deadlock.
> 
> Don't you get it implicitly when using dm-mq -> blk-mq?
> 
>     Lars

There were observed deadlocks just between dm targets, caused by queuing 
bios on current->bio_list.

The underlying block device was not involved in the deadlocks. Therefore I 
conclude that changing the behavior of blk_queue_split would not resolve 
these deadlocks (because dm targets do not use blk_queue_split).

Mikulas

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] offload bios to a thread
  2016-06-30 19:40   ` Mike Snitzer
@ 2016-07-04 22:45     ` Mikulas Patocka
  -1 siblings, 0 replies; 19+ messages in thread
From: Mikulas Patocka @ 2016-07-04 22:45 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe, linux-block, dm-devel, Zdenek Kabelac, Lars Ellenberg,
	Alasdair G. Kergon, Roland Kammerer



On Thu, 30 Jun 2016, Mike Snitzer wrote:

> [cc'ing linux-block and drbd folks]
> 
> On Tue, Jun 28 2016 at  8:16pm -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > Hi
> > 
> > Here I'm sending three patches to fix the deadlocks in snapshot and 
> > snapshot-merge.
> > 
> > The first patch fixes the deadlock, the following 2 patches introduce a 
> > timer, so that bios are not offloaded immediatelly, they are offloaded 
> > after a specified timeout, because immediate offloading can change order 
> > of bios and it could theoretically produce regressions. I don't know if 
> > these regressions really exist or not.
> > 
> > If there is some way to push the patches upstream, try it.
> 
> Some fix must happen before the more recent upstream kernels can be
> reliably used in stacked bio-based workloads (in production).  We simply
> cannot ignore this issue any more.
> 
> drbd is also hitting the same generic_make_request (current->bio_list)
> problem, see:
> https://www.redhat.com/archives/dm-devel/2016-June/msg00326.html
> 
> Mikulas, I've taken your 3 proposed patches patches and refactored them
> some to split out intermediate patches that hopefully make review
> easier.  Nothing other than variable names and some other style stuff
> was changed -- headers were tweaked some to help with clarity.
> 
> Please see the 5 topmost "block: ..." patches here:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip

I found a problem with the patches when using loop device - we must not 
offload bios to the rescue thread if they are allocated from fs_bio_set. 
I'll send a second version of the patches with this change. You can 
incorporate that change to your git tree.

> It should be noted that Jens had a quick look at this set and wanted to
> throw up a little when he saw the (ab)use of a timer to defer punting to
> the workqueue.  I explained that without the timer, always punting to
> the workqueue, we could hurt performance by reordering IO or crippling
> onstack plugging.  He said he'd try to think of a cleaner way forward.

The behavior depends on the timer only in a situation when the deadlock 
actually happens - the timer doesn't hurt performance on normal use. So, 
it's better to have timed delay in bio processing than a deadlock :)

The timer part can be dropped entirely if someone shows that offloading 
bios on schedule doesn't hurt performance in any way. Does anyone have a 
large collection of block layer performance tests that could be tried to 
detect if the regression happens?

Mikulas

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] offload bios to a thread
@ 2016-07-04 22:45     ` Mikulas Patocka
  0 siblings, 0 replies; 19+ messages in thread
From: Mikulas Patocka @ 2016-07-04 22:45 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Lars Ellenberg, axboe, Alasdair G. Kergon, Zdenek Kabelac,
	dm-devel, linux-block, Roland Kammerer



On Thu, 30 Jun 2016, Mike Snitzer wrote:

> [cc'ing linux-block and drbd folks]
> 
> On Tue, Jun 28 2016 at  8:16pm -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > Hi
> > 
> > Here I'm sending three patches to fix the deadlocks in snapshot and 
> > snapshot-merge.
> > 
> > The first patch fixes the deadlock, the following 2 patches introduce a 
> > timer, so that bios are not offloaded immediatelly, they are offloaded 
> > after a specified timeout, because immediate offloading can change order 
> > of bios and it could theoretically produce regressions. I don't know if 
> > these regressions really exist or not.
> > 
> > If there is some way to push the patches upstream, try it.
> 
> Some fix must happen before the more recent upstream kernels can be
> reliably used in stacked bio-based workloads (in production).  We simply
> cannot ignore this issue any more.
> 
> drbd is also hitting the same generic_make_request (current->bio_list)
> problem, see:
> https://www.redhat.com/archives/dm-devel/2016-June/msg00326.html
> 
> Mikulas, I've taken your 3 proposed patches patches and refactored them
> some to split out intermediate patches that hopefully make review
> easier.  Nothing other than variable names and some other style stuff
> was changed -- headers were tweaked some to help with clarity.
> 
> Please see the 5 topmost "block: ..." patches here:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip

I found a problem with the patches when using loop device - we must not 
offload bios to the rescue thread if they are allocated from fs_bio_set. 
I'll send a second version of the patches with this change. You can 
incorporate that change to your git tree.

> It should be noted that Jens had a quick look at this set and wanted to
> throw up a little when he saw the (ab)use of a timer to defer punting to
> the workqueue.  I explained that without the timer, always punting to
> the workqueue, we could hurt performance by reordering IO or crippling
> onstack plugging.  He said he'd try to think of a cleaner way forward.

The behavior depends on the timer only in a situation when the deadlock 
actually happens - the timer doesn't hurt performance on normal use. So, 
it's better to have timed delay in bio processing than a deadlock :)

The timer part can be dropped entirely if someone shows that offloading 
bios on schedule doesn't hurt performance in any way. Does anyone have a 
large collection of block layer performance tests that could be tried to 
detect if the regression happens?

Mikulas

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 0/3] offload bios to a thread
@ 2016-07-04 22:53 Mikulas Patocka
  2016-07-04 22:56 ` [PATCH 1/3] block: flush queued bios when process blocks to avoid deadlock Mikulas Patocka
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Mikulas Patocka @ 2016-07-04 22:53 UTC (permalink / raw)
  To: Alasdair G. Kergon, snitm, Zdenek Kabelac; +Cc: dm-devel

Hi

This is the second version of patches that fix deadlocks by redirecting 
bios from current->bio_list to rescuer workqueues.

I found out that the original patches caused deadlock with the loopback 
device. When the loopback device is used, both lower and upper filesystems 
use the same bio set - fs_bio_set. Consequently, bios submitted by both of 
them end up on the same rescuer workqueue. There is a deadlock possibility 
- if generic_make_request for the upper filesystem's bio blocks (because 
there are too many requests in flight on the loop device), it may stall 
processing some bios for the lower filesystem.

Ideadlly, each filesystem should have its own bio set. But it doesn't. So 
I fix this problem by not offloading bios allocated from fs_bio_set.

Mikulas

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/3] block: flush queued bios when process blocks to avoid deadlock
  2016-07-04 22:53 [PATCH 0/3] offload bios to a thread Mikulas Patocka
@ 2016-07-04 22:56 ` Mikulas Patocka
  2016-07-04 22:58 ` [PATCH 2/3] block: prepare for timed bio offload Mikulas Patocka
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 19+ messages in thread
From: Mikulas Patocka @ 2016-07-04 22:56 UTC (permalink / raw)
  To: Alasdair G. Kergon, Mike Snitzer, Zdenek Kabelac; +Cc: dm-devel

The block layer uses per-process bio list to avoid recursion in
generic_make_request.  When generic_make_request is called recursively,
the bio is added to current->bio_list and generic_make_request returns
immediately.  The top-level instance of generic_make_request takes bios
from current->bio_list and processes them.

Commit df2cb6daa4 ("block: Avoid deadlocks with bio allocation by
stacking drivers") created a workqueue for every bio set and code
in bio_alloc_bioset() that tries to resolve some low-memory deadlocks by
redirecting bios queued on current->bio_list to the workqueue if the
system is low on memory.  However another deadlock (see below **) may
happen, without any low memory condition, because generic_make_request
is queuing bios to current->bio_list (rather than submitting them).

Fix this deadlock by redirecting any bios on current->bio_list to the
bio_set's rescue workqueue on every schedule call.  Consequently, when
the process blocks on a mutex, the bios queued on current->bio_list are
dispatched to independent workqueus and they can complete without
waiting for the mutex to be available.

Also, now we can remove punt_bios_to_rescuer() and bio_alloc_bioset()'s
calls to it because bio_alloc_bioset() will implicitly punt all bios on
current->bio_list if it performs a blocking allocation.

** Here is the dm-snapshot deadlock that was observed:

1) Process A sends one-page read bio to the dm-snapshot target. The bio
spans snapshot chunk boundary and so it is split to two bios by device
mapper.

2) Device mapper creates the first sub-bio and sends it to the snapshot
driver.

3) The function snapshot_map calls track_chunk (that allocates a structure
dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then remaps
the bio to the underlying device and exits with DM_MAPIO_REMAPPED.

4) The remapped bio is submitted with generic_make_request, but it isn't
issued - it is added to current->bio_list instead.

5) Meanwhile, process B (dm's kcopyd) executes pending_complete for the
chunk affected be the first remapped bio, it takes down_write(&s->lock)
and then loops in __check_for_conflicting_io, waiting for
dm_snap_tracked_chunk created in step 3) to be released.

6) Process A continues, it creates a second sub-bio for the rest of the
original bio.

7) snapshot_map is called for this new bio, it waits on
down_write(&s->lock) that is held by Process B (in step 5).

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1267650
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Depends-on: df2cb6daa4 ("block: Avoid deadlocks with bio allocation by stacking drivers")
Cc: stable@vger.kernel.org

---
 block/bio.c            |   77 +++++++++++++++++++------------------------------
 include/linux/blkdev.h |   24 ++++++++++-----
 kernel/sched/core.c    |    7 +---
 3 files changed, 50 insertions(+), 58 deletions(-)

Index: linux-4.7-rc6/block/bio.c
===================================================================
--- linux-4.7-rc6.orig/block/bio.c	2016-07-04 23:00:17.000000000 +0200
+++ linux-4.7-rc6/block/bio.c	2016-07-05 00:02:01.000000000 +0200
@@ -349,35 +349,37 @@ static void bio_alloc_rescue(struct work
 	}
 }
 
-static void punt_bios_to_rescuer(struct bio_set *bs)
+/**
+ * blk_flush_bio_list
+ * @tsk: task_struct whose bio_list must be flushed
+ *
+ * Pop bios queued on @tsk->bio_list and submit each of them to
+ * their rescue workqueue.
+ *
+ * If the bio doesn't have a bio_set, we leave it on @tsk->bio_list.
+ * If the bio is allocated from fs_bio_set, we must leave it to avoid
+ * deadlock on loopback block device.
+ * Stacking bio drivers should use bio_set, so this shouldn't be
+ * an issue.
+ */
+void blk_flush_bio_list(struct task_struct *tsk)
 {
-	struct bio_list punt, nopunt;
 	struct bio *bio;
+	struct bio_list list = *tsk->bio_list;
+	bio_list_init(tsk->bio_list);
 
-	/*
-	 * In order to guarantee forward progress we must punt only bios that
-	 * were allocated from this bio_set; otherwise, if there was a bio on
-	 * there for a stacking driver higher up in the stack, processing it
-	 * could require allocating bios from this bio_set, and doing that from
-	 * our own rescuer would be bad.
-	 *
-	 * Since bio lists are singly linked, pop them all instead of trying to
-	 * remove from the middle of the list:
-	 */
-
-	bio_list_init(&punt);
-	bio_list_init(&nopunt);
-
-	while ((bio = bio_list_pop(current->bio_list)))
-		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
-
-	*current->bio_list = nopunt;
-
-	spin_lock(&bs->rescue_lock);
-	bio_list_merge(&bs->rescue_list, &punt);
-	spin_unlock(&bs->rescue_lock);
+	while ((bio = bio_list_pop(&list))) {
+		struct bio_set *bs = bio->bi_pool;
+		if (unlikely(!bs) || bs == fs_bio_set) {
+			bio_list_add(tsk->bio_list, bio);
+			continue;
+		}
 
-	queue_work(bs->rescue_workqueue, &bs->rescue_work);
+		spin_lock(&bs->rescue_lock);
+		bio_list_add(&bs->rescue_list, bio);
+		queue_work(bs->rescue_workqueue, &bs->rescue_work);
+		spin_unlock(&bs->rescue_lock);
+	}
 }
 
 /**
@@ -417,7 +419,6 @@ static void punt_bios_to_rescuer(struct 
  */
 struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 {
-	gfp_t saved_gfp = gfp_mask;
 	unsigned front_pad;
 	unsigned inline_vecs;
 	unsigned long idx = BIO_POOL_NONE;
@@ -452,23 +453,11 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m
 		 * reserve.
 		 *
 		 * We solve this, and guarantee forward progress, with a rescuer
-		 * workqueue per bio_set. If we go to allocate and there are
-		 * bios on current->bio_list, we first try the allocation
-		 * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
-		 * bios we would be blocking to the rescuer workqueue before
-		 * we retry with the original gfp_flags.
+		 * workqueue per bio_set. If an allocation would block (due to
+		 * __GFP_DIRECT_RECLAIM) the scheduler will first punt all bios
+		 * on current->bio_list to the rescuer workqueue.
 		 */
-
-		if (current->bio_list && !bio_list_empty(current->bio_list))
-			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
-
 		p = mempool_alloc(bs->bio_pool, gfp_mask);
-		if (!p && gfp_mask != saved_gfp) {
-			punt_bios_to_rescuer(bs);
-			gfp_mask = saved_gfp;
-			p = mempool_alloc(bs->bio_pool, gfp_mask);
-		}
-
 		front_pad = bs->front_pad;
 		inline_vecs = BIO_INLINE_VECS;
 	}
@@ -481,12 +470,6 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m
 
 	if (nr_iovecs > inline_vecs) {
 		bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
-		if (!bvl && gfp_mask != saved_gfp) {
-			punt_bios_to_rescuer(bs);
-			gfp_mask = saved_gfp;
-			bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
-		}
-
 		if (unlikely(!bvl))
 			goto err_free;
 
Index: linux-4.7-rc6/include/linux/blkdev.h
===================================================================
--- linux-4.7-rc6.orig/include/linux/blkdev.h	2016-07-04 23:00:17.000000000 +0200
+++ linux-4.7-rc6/include/linux/blkdev.h	2016-07-04 23:58:02.000000000 +0200
@@ -1114,6 +1114,22 @@ static inline bool blk_needs_flush_plug(
 		 !list_empty(&plug->cb_list));
 }
 
+extern void blk_flush_bio_list(struct task_struct *tsk);
+
+static inline void blk_flush_queued_io(struct task_struct *tsk)
+{
+	/*
+	 * Flush any queued bios to corresponding rescue threads.
+	 */
+	if (tsk->bio_list && !bio_list_empty(tsk->bio_list))
+		blk_flush_bio_list(tsk);
+	/*
+	 * Flush any plugged IO that is queued.
+	 */
+	if (blk_needs_flush_plug(tsk))
+		blk_schedule_flush_plug(tsk);
+}
+
 /*
  * tag stuff
  */
@@ -1722,16 +1738,10 @@ static inline void blk_flush_plug(struct
 {
 }
 
-static inline void blk_schedule_flush_plug(struct task_struct *task)
+static inline void blk_flush_queued_io(struct task_struct *tsk)
 {
 }
 
-
-static inline bool blk_needs_flush_plug(struct task_struct *tsk)
-{
-	return false;
-}
-
 static inline int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
 				     sector_t *error_sector)
 {
Index: linux-4.7-rc6/kernel/sched/core.c
===================================================================
--- linux-4.7-rc6.orig/kernel/sched/core.c	2016-07-04 23:00:17.000000000 +0200
+++ linux-4.7-rc6/kernel/sched/core.c	2016-07-04 23:01:29.000000000 +0200
@@ -3359,11 +3359,10 @@ static inline void sched_submit_work(str
 	if (!tsk->state || tsk_is_pi_blocked(tsk))
 		return;
 	/*
-	 * If we are going to sleep and we have plugged IO queued,
+	 * If we are going to sleep and we have queued IO,
 	 * make sure to submit it to avoid deadlocks.
 	 */
-	if (blk_needs_flush_plug(tsk))
-		blk_schedule_flush_plug(tsk);
+	blk_flush_queued_io(tsk);
 }
 
 asmlinkage __visible void __sched schedule(void)
@@ -4977,7 +4976,7 @@ long __sched io_schedule_timeout(long ti
 	long ret;
 
 	current->in_iowait = 1;
-	blk_schedule_flush_plug(current);
+	blk_flush_queued_io(current);
 
 	delayacct_blkio_start();
 	rq = raw_rq();

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 2/3] block: prepare for timed bio offload
  2016-07-04 22:53 [PATCH 0/3] offload bios to a thread Mikulas Patocka
  2016-07-04 22:56 ` [PATCH 1/3] block: flush queued bios when process blocks to avoid deadlock Mikulas Patocka
@ 2016-07-04 22:58 ` Mikulas Patocka
  2016-07-04 22:59 ` [PATCH 3/3] block: use timed offload Mikulas Patocka
  2016-07-06 13:36 ` [PATCH 0/3] offload bios to a thread Mike Snitzer
  3 siblings, 0 replies; 19+ messages in thread
From: Mikulas Patocka @ 2016-07-04 22:58 UTC (permalink / raw)
  To: Alasdair G. Kergon, Mike Snitzer, Zdenek Kabelac; +Cc: dm-devel

Replace the pointer current->bio_list with structure queued_bios.
It is a prerequisite for the following patch that will use the timer
placed in this structure.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 block/bio.c               |    6 +++---
 block/blk-core.c          |   16 ++++++++--------
 drivers/md/bcache/btree.c |   12 ++++++------
 drivers/md/dm-bufio.c     |    2 +-
 drivers/md/raid1.c        |    6 +++---
 drivers/md/raid10.c       |    6 +++---
 include/linux/blkdev.h    |    7 ++++++-
 include/linux/sched.h     |    4 ++--
 8 files changed, 32 insertions(+), 27 deletions(-)

Index: linux-4.7-rc6/include/linux/sched.h
===================================================================
--- linux-4.7-rc6.orig/include/linux/sched.h	2016-07-04 23:58:01.000000000 +0200
+++ linux-4.7-rc6/include/linux/sched.h	2016-07-05 00:02:10.000000000 +0200
@@ -128,7 +128,7 @@ struct sched_attr {
 
 struct futex_pi_state;
 struct robust_list_head;
-struct bio_list;
+struct queued_bios;
 struct fs_struct;
 struct perf_event_context;
 struct blk_plug;
@@ -1727,7 +1727,7 @@ struct task_struct {
 	void *journal_info;
 
 /* stacked block device info */
-	struct bio_list *bio_list;
+	struct queued_bios *queued_bios;
 
 #ifdef CONFIG_BLOCK
 /* stack plugging */
Index: linux-4.7-rc6/block/blk-core.c
===================================================================
--- linux-4.7-rc6.orig/block/blk-core.c	2016-07-04 23:58:02.000000000 +0200
+++ linux-4.7-rc6/block/blk-core.c	2016-07-05 00:02:10.000000000 +0200
@@ -2031,7 +2031,7 @@ end_io:
  */
 blk_qc_t generic_make_request(struct bio *bio)
 {
-	struct bio_list bio_list_on_stack;
+	struct queued_bios queued_bios_on_stack;
 	blk_qc_t ret = BLK_QC_T_NONE;
 
 	if (!generic_make_request_checks(bio))
@@ -2047,8 +2047,8 @@ blk_qc_t generic_make_request(struct bio
 	 * it is non-NULL, then a make_request is active, and new requests
 	 * should be added at the tail
 	 */
-	if (current->bio_list) {
-		bio_list_add(current->bio_list, bio);
+	if (current->queued_bios) {
+		bio_list_add(&current->queued_bios->bio_list, bio);
 		goto out;
 	}
 
@@ -2067,8 +2067,8 @@ blk_qc_t generic_make_request(struct bio
 	 * bio_list, and call into ->make_request() again.
 	 */
 	BUG_ON(bio->bi_next);
-	bio_list_init(&bio_list_on_stack);
-	current->bio_list = &bio_list_on_stack;
+	bio_list_init(&queued_bios_on_stack.bio_list);
+	current->queued_bios = &queued_bios_on_stack;
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
@@ -2077,15 +2077,15 @@ blk_qc_t generic_make_request(struct bio
 
 			blk_queue_exit(q);
 
-			bio = bio_list_pop(current->bio_list);
+			bio = bio_list_pop(&current->queued_bios->bio_list);
 		} else {
-			struct bio *bio_next = bio_list_pop(current->bio_list);
+			struct bio *bio_next = bio_list_pop(&current->queued_bios->bio_list);
 
 			bio_io_error(bio);
 			bio = bio_next;
 		}
 	} while (bio);
-	current->bio_list = NULL; /* deactivate */
+	current->queued_bios = NULL; /* deactivate */
 
 out:
 	return ret;
Index: linux-4.7-rc6/include/linux/blkdev.h
===================================================================
--- linux-4.7-rc6.orig/include/linux/blkdev.h	2016-07-04 23:58:02.000000000 +0200
+++ linux-4.7-rc6/include/linux/blkdev.h	2016-07-05 00:02:10.000000000 +0200
@@ -1114,6 +1114,11 @@ static inline bool blk_needs_flush_plug(
 		 !list_empty(&plug->cb_list));
 }
 
+struct queued_bios {
+	struct bio_list bio_list;
+	struct timer_list timer;
+};
+
 extern void blk_flush_bio_list(struct task_struct *tsk);
 
 static inline void blk_flush_queued_io(struct task_struct *tsk)
@@ -1121,7 +1126,7 @@ static inline void blk_flush_queued_io(s
 	/*
 	 * Flush any queued bios to corresponding rescue threads.
 	 */
-	if (tsk->bio_list && !bio_list_empty(tsk->bio_list))
+	if (tsk->queued_bios && !bio_list_empty(&tsk->queued_bios->bio_list))
 		blk_flush_bio_list(tsk);
 	/*
 	 * Flush any plugged IO that is queued.
Index: linux-4.7-rc6/block/bio.c
===================================================================
--- linux-4.7-rc6.orig/block/bio.c	2016-07-05 00:02:01.000000000 +0200
+++ linux-4.7-rc6/block/bio.c	2016-07-05 00:02:10.000000000 +0200
@@ -365,13 +365,13 @@ static void bio_alloc_rescue(struct work
 void blk_flush_bio_list(struct task_struct *tsk)
 {
 	struct bio *bio;
-	struct bio_list list = *tsk->bio_list;
-	bio_list_init(tsk->bio_list);
+	struct bio_list list = tsk->queued_bios->bio_list;
+	bio_list_init(&tsk->queued_bios->bio_list);
 
 	while ((bio = bio_list_pop(&list))) {
 		struct bio_set *bs = bio->bi_pool;
 		if (unlikely(!bs) || bs == fs_bio_set) {
-			bio_list_add(tsk->bio_list, bio);
+			bio_list_add(&tsk->queued_bios->bio_list, bio);
 			continue;
 		}
 
Index: linux-4.7-rc6/drivers/md/bcache/btree.c
===================================================================
--- linux-4.7-rc6.orig/drivers/md/bcache/btree.c	2016-07-04 23:58:02.000000000 +0200
+++ linux-4.7-rc6/drivers/md/bcache/btree.c	2016-07-05 00:02:10.000000000 +0200
@@ -450,7 +450,7 @@ void __bch_btree_node_write(struct btree
 
 	trace_bcache_btree_write(b);
 
-	BUG_ON(current->bio_list);
+	BUG_ON(current->queued_bios);
 	BUG_ON(b->written >= btree_blocks(b));
 	BUG_ON(b->written && !i->keys);
 	BUG_ON(btree_bset_first(b)->seq != i->seq);
@@ -544,7 +544,7 @@ static void bch_btree_leaf_dirty(struct 
 
 	/* Force write if set is too big */
 	if (set_bytes(i) > PAGE_SIZE - 48 &&
-	    !current->bio_list)
+	    !current->queued_bios)
 		bch_btree_node_write(b, NULL);
 }
 
@@ -889,7 +889,7 @@ static struct btree *mca_alloc(struct ca
 {
 	struct btree *b;
 
-	BUG_ON(current->bio_list);
+	BUG_ON(current->queued_bios);
 
 	lockdep_assert_held(&c->bucket_lock);
 
@@ -976,7 +976,7 @@ retry:
 	b = mca_find(c, k);
 
 	if (!b) {
-		if (current->bio_list)
+		if (current->queued_bios)
 			return ERR_PTR(-EAGAIN);
 
 		mutex_lock(&c->bucket_lock);
@@ -2127,7 +2127,7 @@ static int bch_btree_insert_node(struct 
 
 	return 0;
 split:
-	if (current->bio_list) {
+	if (current->queued_bios) {
 		op->lock = b->c->root->level + 1;
 		return -EAGAIN;
 	} else if (op->lock <= b->c->root->level) {
@@ -2209,7 +2209,7 @@ int bch_btree_insert(struct cache_set *c
 	struct btree_insert_op op;
 	int ret = 0;
 
-	BUG_ON(current->bio_list);
+	BUG_ON(current->queued_bios);
 	BUG_ON(bch_keylist_empty(keys));
 
 	bch_btree_op_init(&op.op, 0);
Index: linux-4.7-rc6/drivers/md/dm-bufio.c
===================================================================
--- linux-4.7-rc6.orig/drivers/md/dm-bufio.c	2016-07-04 23:58:02.000000000 +0200
+++ linux-4.7-rc6/drivers/md/dm-bufio.c	2016-07-05 00:02:10.000000000 +0200
@@ -174,7 +174,7 @@ static inline int dm_bufio_cache_index(s
 #define DM_BUFIO_CACHE(c)	(dm_bufio_caches[dm_bufio_cache_index(c)])
 #define DM_BUFIO_CACHE_NAME(c)	(dm_bufio_cache_names[dm_bufio_cache_index(c)])
 
-#define dm_bufio_in_request()	(!!current->bio_list)
+#define dm_bufio_in_request()	(!!current->queued_bios)
 
 static void dm_bufio_lock(struct dm_bufio_client *c)
 {
Index: linux-4.7-rc6/drivers/md/raid1.c
===================================================================
--- linux-4.7-rc6.orig/drivers/md/raid1.c	2016-07-04 23:58:02.000000000 +0200
+++ linux-4.7-rc6/drivers/md/raid1.c	2016-07-05 00:02:10.000000000 +0200
@@ -876,8 +876,8 @@ static sector_t wait_barrier(struct r1co
 				    (!conf->barrier ||
 				     ((conf->start_next_window <
 				       conf->next_resync + RESYNC_SECTORS) &&
-				      current->bio_list &&
-				      !bio_list_empty(current->bio_list))),
+				      current->queued_bios &&
+				      !bio_list_empty(&current->queued_bios->bio_list))),
 				    conf->resync_lock);
 		conf->nr_waiting--;
 	}
@@ -1014,7 +1014,7 @@ static void raid1_unplug(struct blk_plug
 	struct r1conf *conf = mddev->private;
 	struct bio *bio;
 
-	if (from_schedule || current->bio_list) {
+	if (from_schedule || current->queued_bios) {
 		spin_lock_irq(&conf->device_lock);
 		bio_list_merge(&conf->pending_bio_list, &plug->pending);
 		conf->pending_count += plug->pending_cnt;
Index: linux-4.7-rc6/drivers/md/raid10.c
===================================================================
--- linux-4.7-rc6.orig/drivers/md/raid10.c	2016-07-04 23:58:02.000000000 +0200
+++ linux-4.7-rc6/drivers/md/raid10.c	2016-07-05 00:02:10.000000000 +0200
@@ -945,8 +945,8 @@ static void wait_barrier(struct r10conf 
 		wait_event_lock_irq(conf->wait_barrier,
 				    !conf->barrier ||
 				    (conf->nr_pending &&
-				     current->bio_list &&
-				     !bio_list_empty(current->bio_list)),
+				     current->queued_bios &&
+				     !bio_list_empty(&current->queued_bios->bio_list)),
 				    conf->resync_lock);
 		conf->nr_waiting--;
 	}
@@ -1022,7 +1022,7 @@ static void raid10_unplug(struct blk_plu
 	struct r10conf *conf = mddev->private;
 	struct bio *bio;
 
-	if (from_schedule || current->bio_list) {
+	if (from_schedule || current->queued_bios) {
 		spin_lock_irq(&conf->device_lock);
 		bio_list_merge(&conf->pending_bio_list, &plug->pending);
 		conf->pending_count += plug->pending_cnt;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 3/3] block: use timed offload
  2016-07-04 22:53 [PATCH 0/3] offload bios to a thread Mikulas Patocka
  2016-07-04 22:56 ` [PATCH 1/3] block: flush queued bios when process blocks to avoid deadlock Mikulas Patocka
  2016-07-04 22:58 ` [PATCH 2/3] block: prepare for timed bio offload Mikulas Patocka
@ 2016-07-04 22:59 ` Mikulas Patocka
  2016-07-06 13:36 ` [PATCH 0/3] offload bios to a thread Mike Snitzer
  3 siblings, 0 replies; 19+ messages in thread
From: Mikulas Patocka @ 2016-07-04 22:59 UTC (permalink / raw)
  To: Alasdair G. Kergon, Mike Snitzer, Zdenek Kabelac; +Cc: dm-devel

This patch introduces a timed bio offload.

When a process schedules and there are bios queued on
current->queued_bios, we submit a timer that redirects the queued bios to
a workqueue after a specific timeout (currently 1s).

The reason for the timer is that immediate bio offload could change
ordering of bios and it could theoretically cause performance regressions.
So, we offload bios only if the process is blocked for a certain amount of
time.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 block/bio.c      |   45 +++++++++++++++++++++++++++++++++------------
 block/blk-core.c |   19 +++++++++++++++++--
 2 files changed, 50 insertions(+), 14 deletions(-)

Index: linux-4.7-rc5-devel/block/bio.c
===================================================================
--- linux-4.7-rc5-devel.orig/block/bio.c	2016-06-30 17:25:56.000000000 +0200
+++ linux-4.7-rc5-devel/block/bio.c	2016-06-30 17:26:04.000000000 +0200
@@ -338,9 +338,10 @@ static void bio_alloc_rescue(struct work
 	struct bio *bio;
 
 	while (1) {
-		spin_lock(&bs->rescue_lock);
+		unsigned long flags;
+		spin_lock_irqsave(&bs->rescue_lock, flags);
 		bio = bio_list_pop(&bs->rescue_list);
-		spin_unlock(&bs->rescue_lock);
+		spin_unlock_irqrestore(&bs->rescue_lock, flags);
 
 		if (!bio)
 			break;
@@ -350,35 +351,55 @@ static void bio_alloc_rescue(struct work
 }
 
 /**
- * blk_flush_bio_list
- * @tsk: task_struct whose bio_list must be flushed
+ * blk_timer_flush_bio_list
  *
- * Pop bios queued on @tsk->bio_list and submit each of them to
+ * Pop bios queued on q->bio_list and submit each of them to
  * their rescue workqueue.
  *
- * If the bio doesn't have a bio_set, we leave it on @tsk->bio_list.
+ * If the bio doesn't have a bio_set, we leave it on q->bio_list.
  * If the bio is allocated from fs_bio_set, we must leave it to avoid
  * deadlock on loopback block device.
  * Stacking bio drivers should use bio_set, so this shouldn't be
  * an issue.
  */
-void blk_flush_bio_list(struct task_struct *tsk)
+static void blk_timer_flush_bio_list(unsigned long data)
 {
+	struct queued_bios *q = (struct queued_bios *)data;
 	struct bio *bio;
-	struct bio_list list = tsk->queued_bios->bio_list;
-	bio_list_init(&tsk->queued_bios->bio_list);
+
+	struct bio_list list = q->bio_list;
+	bio_list_init(&q->bio_list);
 
 	while ((bio = bio_list_pop(&list))) {
+		unsigned long flags;
 		struct bio_set *bs = bio->bi_pool;
 		if (unlikely(!bs) || bs == fs_bio_set) {
-			bio_list_add(&tsk->queued_bios->bio_list, bio);
+			bio_list_add(&q->bio_list, bio);
 			continue;
 		}
 
-		spin_lock(&bs->rescue_lock);
+		spin_lock_irqsave(&bs->rescue_lock, flags);
 		bio_list_add(&bs->rescue_list, bio);
 		queue_work(bs->rescue_workqueue, &bs->rescue_work);
-		spin_unlock(&bs->rescue_lock);
+		spin_unlock_irqrestore(&bs->rescue_lock, flags);
+	}
+}
+
+#define BIO_RESCUE_TIMEOUT	HZ
+
+/**
+ * blk_flush_bio_list
+ * @tsk: task_struct whose bio_list must be flushed
+ *
+ * This function sets up a timer that flushes the queued bios.
+ */
+void blk_flush_bio_list(struct task_struct *tsk)
+{
+	struct queued_bios *q = tsk->queued_bios;
+	if (q->timer.function == NULL) {
+		setup_timer(&q->timer, blk_timer_flush_bio_list,
+			    (unsigned long)q);
+		mod_timer(&q->timer, jiffies + BIO_RESCUE_TIMEOUT);
 	}
 }
 
Index: linux-4.7-rc5-devel/block/blk-core.c
===================================================================
--- linux-4.7-rc5-devel.orig/block/blk-core.c	2016-06-30 17:25:56.000000000 +0200
+++ linux-4.7-rc5-devel/block/blk-core.c	2016-06-30 17:26:04.000000000 +0200
@@ -2032,6 +2032,7 @@ end_io:
 blk_qc_t generic_make_request(struct bio *bio)
 {
 	struct queued_bios queued_bios_on_stack;
+	struct queued_bios *q;
 	blk_qc_t ret = BLK_QC_T_NONE;
 
 	if (!generic_make_request_checks(bio))
@@ -2047,8 +2048,17 @@ blk_qc_t generic_make_request(struct bio
 	 * it is non-NULL, then a make_request is active, and new requests
 	 * should be added at the tail
 	 */
-	if (current->queued_bios) {
-		bio_list_add(&current->queued_bios->bio_list, bio);
+	q = current->queued_bios;
+	if (q) {
+		/*
+		 * The timer may modify q->bio_list. So we must stop the timer
+		 * before modifying the list.
+		 */
+		if (q->timer.function != NULL) {
+			del_timer_sync(&q->timer);
+			q->timer.function = NULL;
+		}
+		bio_list_add(&q->bio_list, bio);
 		goto out;
 	}
 
@@ -2068,6 +2078,7 @@ blk_qc_t generic_make_request(struct bio
 	 */
 	BUG_ON(bio->bi_next);
 	bio_list_init(&queued_bios_on_stack.bio_list);
+	queued_bios_on_stack.timer.function = NULL;
 	current->queued_bios = &queued_bios_on_stack;
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
@@ -2084,6 +2095,10 @@ blk_qc_t generic_make_request(struct bio
 			bio_io_error(bio);
 			bio = bio_next;
 		}
+		if (unlikely(queued_bios_on_stack.timer.function != NULL)) {
+			del_timer_sync(&queued_bios_on_stack.timer);
+			queued_bios_on_stack.timer.function = NULL;
+		}
 	} while (bio);
 	current->queued_bios = NULL; /* deactivate */
 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] offload bios to a thread
  2016-07-04 22:53 [PATCH 0/3] offload bios to a thread Mikulas Patocka
                   ` (2 preceding siblings ...)
  2016-07-04 22:59 ` [PATCH 3/3] block: use timed offload Mikulas Patocka
@ 2016-07-06 13:36 ` Mike Snitzer
  2016-07-06 13:53   ` Mikulas Patocka
  3 siblings, 1 reply; 19+ messages in thread
From: Mike Snitzer @ 2016-07-06 13:36 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, Alasdair G. Kergon, Zdenek Kabelac

On Mon, Jul 04 2016 at  6:53pm -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> Hi
> 
> This is the second version of patches that fix deadlocks by redirecting 
> bios from current->bio_list to rescuer workqueues.
> 
> I found out that the original patches caused deadlock with the loopback 
> device. When the loopback device is used, both lower and upper filesystems 
> use the same bio set - fs_bio_set. Consequently, bios submitted by both of 
> them end up on the same rescuer workqueue. There is a deadlock possibility 
> - if generic_make_request for the upper filesystem's bio blocks (because 
> there are too many requests in flight on the loop device), it may stall 
> processing some bios for the lower filesystem.
> 
> Ideadlly, each filesystem should have its own bio set. But it doesn't. So 
> I fix this problem by not offloading bios allocated from fs_bio_set.

I'd much preferred you just send an incremental fix that built on the
tree you know I started, here:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip

I've now folded your fix into this tree.

But please don't ignore work you know that was done to further prepare
your patches for inclusion.  It makes for tedious busy work on my end to
pull out the incremental fix, which is simply:

diff --git a/block/bio.c b/block/bio.c
index 7c49b91..80ebe88 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -357,7 +357,9 @@ static void bio_alloc_rescue(struct work_struct *work)
  * to their rescue workqueue.
  *
  * If the bio doesn't have a bio_set, we leave it on queued_bios->bio_list.
- * However, stacking drivers should use bio_set, so this shouldn't be
+ * If the bio is allocated from fs_bio_set, we must leave it to avoid
+ * deadlock on loopback block device.
+ * But stacking drivers should use a bio_set, so this shouldn't be
  * an issue.
  */
 static void blk_timer_flush_bio_list(unsigned long data)
@@ -371,7 +373,7 @@ static void blk_timer_flush_bio_list(unsigned long data)
 	while ((bio = bio_list_pop(&list))) {
 		unsigned long flags;
 		struct bio_set *bs = bio->bi_pool;
-		if (unlikely(!bs)) {
+		if (unlikely(!bs) || bs == fs_bio_set) {
 			bio_list_add(&queued_bios->bio_list, bio);
 			continue;
 		}

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] offload bios to a thread
  2016-07-06 13:36 ` [PATCH 0/3] offload bios to a thread Mike Snitzer
@ 2016-07-06 13:53   ` Mikulas Patocka
  2016-07-06 13:55     ` Mike Snitzer
  0 siblings, 1 reply; 19+ messages in thread
From: Mikulas Patocka @ 2016-07-06 13:53 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: dm-devel, Alasdair G. Kergon, Zdenek Kabelac



On Wed, 6 Jul 2016, Mike Snitzer wrote:

> On Mon, Jul 04 2016 at  6:53pm -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > Hi
> > 
> > This is the second version of patches that fix deadlocks by redirecting 
> > bios from current->bio_list to rescuer workqueues.
> > 
> > I found out that the original patches caused deadlock with the loopback 
> > device. When the loopback device is used, both lower and upper filesystems 
> > use the same bio set - fs_bio_set. Consequently, bios submitted by both of 
> > them end up on the same rescuer workqueue. There is a deadlock possibility 
> > - if generic_make_request for the upper filesystem's bio blocks (because 
> > there are too many requests in flight on the loop device), it may stall 
> > processing some bios for the lower filesystem.
> > 
> > Ideadlly, each filesystem should have its own bio set. But it doesn't. So 
> > I fix this problem by not offloading bios allocated from fs_bio_set.
> 
> I'd much preferred you just send an incremental fix that built on the
> tree you know I started, here:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip

You need to change three patches in your git:
* block: flush queued bios when process blocks to avoid deadlock
* block: prepare for timed offload of queued bios to workqueue
* block: use timed offload of queued bios to a workqueue
because this bug is present in all of them.

When these patches are sent to Linus, the bug should not be present in any 
of them.

Mikulas

> I've now folded your fix into this tree.
> 
> But please don't ignore work you know that was done to further prepare
> your patches for inclusion.  It makes for tedious busy work on my end to
> pull out the incremental fix, which is simply:
> 
> diff --git a/block/bio.c b/block/bio.c
> index 7c49b91..80ebe88 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -357,7 +357,9 @@ static void bio_alloc_rescue(struct work_struct *work)
>   * to their rescue workqueue.
>   *
>   * If the bio doesn't have a bio_set, we leave it on queued_bios->bio_list.
> - * However, stacking drivers should use bio_set, so this shouldn't be
> + * If the bio is allocated from fs_bio_set, we must leave it to avoid
> + * deadlock on loopback block device.
> + * But stacking drivers should use a bio_set, so this shouldn't be
>   * an issue.
>   */
>  static void blk_timer_flush_bio_list(unsigned long data)
> @@ -371,7 +373,7 @@ static void blk_timer_flush_bio_list(unsigned long data)
>  	while ((bio = bio_list_pop(&list))) {
>  		unsigned long flags;
>  		struct bio_set *bs = bio->bi_pool;
> -		if (unlikely(!bs)) {
> +		if (unlikely(!bs) || bs == fs_bio_set) {
>  			bio_list_add(&queued_bios->bio_list, bio);
>  			continue;
>  		}
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] offload bios to a thread
  2016-07-06 13:53   ` Mikulas Patocka
@ 2016-07-06 13:55     ` Mike Snitzer
  2016-07-06 15:23       ` Mikulas Patocka
  0 siblings, 1 reply; 19+ messages in thread
From: Mike Snitzer @ 2016-07-06 13:55 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, Alasdair G. Kergon, Zdenek Kabelac

On Wed, Jul 06 2016 at  9:53am -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> On Wed, 6 Jul 2016, Mike Snitzer wrote:
> 
> > On Mon, Jul 04 2016 at  6:53pm -0400,
> > Mikulas Patocka <mpatocka@redhat.com> wrote:
> > 
> > > Hi
> > > 
> > > This is the second version of patches that fix deadlocks by redirecting 
> > > bios from current->bio_list to rescuer workqueues.
> > > 
> > > I found out that the original patches caused deadlock with the loopback 
> > > device. When the loopback device is used, both lower and upper filesystems 
> > > use the same bio set - fs_bio_set. Consequently, bios submitted by both of 
> > > them end up on the same rescuer workqueue. There is a deadlock possibility 
> > > - if generic_make_request for the upper filesystem's bio blocks (because 
> > > there are too many requests in flight on the loop device), it may stall 
> > > processing some bios for the lower filesystem.
> > > 
> > > Ideadlly, each filesystem should have its own bio set. But it doesn't. So 
> > > I fix this problem by not offloading bios allocated from fs_bio_set.
> > 
> > I'd much preferred you just send an incremental fix that built on the
> > tree you know I started, here:
> > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip
> 
> You need to change three patches in your git:
> * block: flush queued bios when process blocks to avoid deadlock
> * block: prepare for timed offload of queued bios to workqueue
> * block: use timed offload of queued bios to a workqueue
> because this bug is present in all of them.
> 
> When these patches are sent to Linus, the bug should not be present in any 
> of them.

Yes, I'm aware.  Please review:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] offload bios to a thread
  2016-07-06 13:55     ` Mike Snitzer
@ 2016-07-06 15:23       ` Mikulas Patocka
  0 siblings, 0 replies; 19+ messages in thread
From: Mikulas Patocka @ 2016-07-06 15:23 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: dm-devel, Alasdair G. Kergon, Zdenek Kabelac



On Wed, 6 Jul 2016, Mike Snitzer wrote:

> > > I'd much preferred you just send an incremental fix that built on the
> > > tree you know I started, here:
> > > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip
> > 
> > You need to change three patches in your git:
> > * block: flush queued bios when process blocks to avoid deadlock
> > * block: prepare for timed offload of queued bios to workqueue
> > * block: use timed offload of queued bios to a workqueue
> > because this bug is present in all of them.
> > 
> > When these patches are sent to Linus, the bug should not be present in any 
> > of them.
> 
> Yes, I'm aware.  Please review:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip

Yes, It's OK.

Mikulas

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-07-06 15:23 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-04 22:53 [PATCH 0/3] offload bios to a thread Mikulas Patocka
2016-07-04 22:56 ` [PATCH 1/3] block: flush queued bios when process blocks to avoid deadlock Mikulas Patocka
2016-07-04 22:58 ` [PATCH 2/3] block: prepare for timed bio offload Mikulas Patocka
2016-07-04 22:59 ` [PATCH 3/3] block: use timed offload Mikulas Patocka
2016-07-06 13:36 ` [PATCH 0/3] offload bios to a thread Mike Snitzer
2016-07-06 13:53   ` Mikulas Patocka
2016-07-06 13:55     ` Mike Snitzer
2016-07-06 15:23       ` Mikulas Patocka
  -- strict thread matches above, loose matches on Subject: below --
2016-06-29  0:16 Mikulas Patocka
2016-06-30 19:40 ` Mike Snitzer
2016-06-30 19:40   ` Mike Snitzer
2016-06-30 23:15   ` Mike Snitzer
2016-06-30 23:15     ` Mike Snitzer
2016-07-04  8:09     ` Lars Ellenberg
2016-07-04  8:09       ` Lars Ellenberg
2016-07-04 22:27       ` Mikulas Patocka
2016-07-04 22:27         ` Mikulas Patocka
2016-07-04 22:45   ` Mikulas Patocka
2016-07-04 22:45     ` Mikulas Patocka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.