* [PATCH 1/3] block: flush queued bios when process blocks to avoid deadlock
2016-07-04 22:53 [PATCH 0/3] offload bios to a thread Mikulas Patocka
@ 2016-07-04 22:56 ` Mikulas Patocka
2016-07-04 22:58 ` [PATCH 2/3] block: prepare for timed bio offload Mikulas Patocka
` (2 subsequent siblings)
3 siblings, 0 replies; 19+ messages in thread
From: Mikulas Patocka @ 2016-07-04 22:56 UTC (permalink / raw)
To: Alasdair G. Kergon, Mike Snitzer, Zdenek Kabelac; +Cc: dm-devel
The block layer uses per-process bio list to avoid recursion in
generic_make_request. When generic_make_request is called recursively,
the bio is added to current->bio_list and generic_make_request returns
immediately. The top-level instance of generic_make_request takes bios
from current->bio_list and processes them.
Commit df2cb6daa4 ("block: Avoid deadlocks with bio allocation by
stacking drivers") created a workqueue for every bio set and code
in bio_alloc_bioset() that tries to resolve some low-memory deadlocks by
redirecting bios queued on current->bio_list to the workqueue if the
system is low on memory. However another deadlock (see below **) may
happen, without any low memory condition, because generic_make_request
is queuing bios to current->bio_list (rather than submitting them).
Fix this deadlock by redirecting any bios on current->bio_list to the
bio_set's rescue workqueue on every schedule call. Consequently, when
the process blocks on a mutex, the bios queued on current->bio_list are
dispatched to independent workqueus and they can complete without
waiting for the mutex to be available.
Also, now we can remove punt_bios_to_rescuer() and bio_alloc_bioset()'s
calls to it because bio_alloc_bioset() will implicitly punt all bios on
current->bio_list if it performs a blocking allocation.
** Here is the dm-snapshot deadlock that was observed:
1) Process A sends one-page read bio to the dm-snapshot target. The bio
spans snapshot chunk boundary and so it is split to two bios by device
mapper.
2) Device mapper creates the first sub-bio and sends it to the snapshot
driver.
3) The function snapshot_map calls track_chunk (that allocates a structure
dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then remaps
the bio to the underlying device and exits with DM_MAPIO_REMAPPED.
4) The remapped bio is submitted with generic_make_request, but it isn't
issued - it is added to current->bio_list instead.
5) Meanwhile, process B (dm's kcopyd) executes pending_complete for the
chunk affected be the first remapped bio, it takes down_write(&s->lock)
and then loops in __check_for_conflicting_io, waiting for
dm_snap_tracked_chunk created in step 3) to be released.
6) Process A continues, it creates a second sub-bio for the rest of the
original bio.
7) snapshot_map is called for this new bio, it waits on
down_write(&s->lock) that is held by Process B (in step 5).
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1267650
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Depends-on: df2cb6daa4 ("block: Avoid deadlocks with bio allocation by stacking drivers")
Cc: stable@vger.kernel.org
---
block/bio.c | 77 +++++++++++++++++++------------------------------
include/linux/blkdev.h | 24 ++++++++++-----
kernel/sched/core.c | 7 +---
3 files changed, 50 insertions(+), 58 deletions(-)
Index: linux-4.7-rc6/block/bio.c
===================================================================
--- linux-4.7-rc6.orig/block/bio.c 2016-07-04 23:00:17.000000000 +0200
+++ linux-4.7-rc6/block/bio.c 2016-07-05 00:02:01.000000000 +0200
@@ -349,35 +349,37 @@ static void bio_alloc_rescue(struct work
}
}
-static void punt_bios_to_rescuer(struct bio_set *bs)
+/**
+ * blk_flush_bio_list
+ * @tsk: task_struct whose bio_list must be flushed
+ *
+ * Pop bios queued on @tsk->bio_list and submit each of them to
+ * their rescue workqueue.
+ *
+ * If the bio doesn't have a bio_set, we leave it on @tsk->bio_list.
+ * If the bio is allocated from fs_bio_set, we must leave it to avoid
+ * deadlock on loopback block device.
+ * Stacking bio drivers should use bio_set, so this shouldn't be
+ * an issue.
+ */
+void blk_flush_bio_list(struct task_struct *tsk)
{
- struct bio_list punt, nopunt;
struct bio *bio;
+ struct bio_list list = *tsk->bio_list;
+ bio_list_init(tsk->bio_list);
- /*
- * In order to guarantee forward progress we must punt only bios that
- * were allocated from this bio_set; otherwise, if there was a bio on
- * there for a stacking driver higher up in the stack, processing it
- * could require allocating bios from this bio_set, and doing that from
- * our own rescuer would be bad.
- *
- * Since bio lists are singly linked, pop them all instead of trying to
- * remove from the middle of the list:
- */
-
- bio_list_init(&punt);
- bio_list_init(&nopunt);
-
- while ((bio = bio_list_pop(current->bio_list)))
- bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
-
- *current->bio_list = nopunt;
-
- spin_lock(&bs->rescue_lock);
- bio_list_merge(&bs->rescue_list, &punt);
- spin_unlock(&bs->rescue_lock);
+ while ((bio = bio_list_pop(&list))) {
+ struct bio_set *bs = bio->bi_pool;
+ if (unlikely(!bs) || bs == fs_bio_set) {
+ bio_list_add(tsk->bio_list, bio);
+ continue;
+ }
- queue_work(bs->rescue_workqueue, &bs->rescue_work);
+ spin_lock(&bs->rescue_lock);
+ bio_list_add(&bs->rescue_list, bio);
+ queue_work(bs->rescue_workqueue, &bs->rescue_work);
+ spin_unlock(&bs->rescue_lock);
+ }
}
/**
@@ -417,7 +419,6 @@ static void punt_bios_to_rescuer(struct
*/
struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
{
- gfp_t saved_gfp = gfp_mask;
unsigned front_pad;
unsigned inline_vecs;
unsigned long idx = BIO_POOL_NONE;
@@ -452,23 +453,11 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m
* reserve.
*
* We solve this, and guarantee forward progress, with a rescuer
- * workqueue per bio_set. If we go to allocate and there are
- * bios on current->bio_list, we first try the allocation
- * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
- * bios we would be blocking to the rescuer workqueue before
- * we retry with the original gfp_flags.
+ * workqueue per bio_set. If an allocation would block (due to
+ * __GFP_DIRECT_RECLAIM) the scheduler will first punt all bios
+ * on current->bio_list to the rescuer workqueue.
*/
-
- if (current->bio_list && !bio_list_empty(current->bio_list))
- gfp_mask &= ~__GFP_DIRECT_RECLAIM;
-
p = mempool_alloc(bs->bio_pool, gfp_mask);
- if (!p && gfp_mask != saved_gfp) {
- punt_bios_to_rescuer(bs);
- gfp_mask = saved_gfp;
- p = mempool_alloc(bs->bio_pool, gfp_mask);
- }
-
front_pad = bs->front_pad;
inline_vecs = BIO_INLINE_VECS;
}
@@ -481,12 +470,6 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m
if (nr_iovecs > inline_vecs) {
bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
- if (!bvl && gfp_mask != saved_gfp) {
- punt_bios_to_rescuer(bs);
- gfp_mask = saved_gfp;
- bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
- }
-
if (unlikely(!bvl))
goto err_free;
Index: linux-4.7-rc6/include/linux/blkdev.h
===================================================================
--- linux-4.7-rc6.orig/include/linux/blkdev.h 2016-07-04 23:00:17.000000000 +0200
+++ linux-4.7-rc6/include/linux/blkdev.h 2016-07-04 23:58:02.000000000 +0200
@@ -1114,6 +1114,22 @@ static inline bool blk_needs_flush_plug(
!list_empty(&plug->cb_list));
}
+extern void blk_flush_bio_list(struct task_struct *tsk);
+
+static inline void blk_flush_queued_io(struct task_struct *tsk)
+{
+ /*
+ * Flush any queued bios to corresponding rescue threads.
+ */
+ if (tsk->bio_list && !bio_list_empty(tsk->bio_list))
+ blk_flush_bio_list(tsk);
+ /*
+ * Flush any plugged IO that is queued.
+ */
+ if (blk_needs_flush_plug(tsk))
+ blk_schedule_flush_plug(tsk);
+}
+
/*
* tag stuff
*/
@@ -1722,16 +1738,10 @@ static inline void blk_flush_plug(struct
{
}
-static inline void blk_schedule_flush_plug(struct task_struct *task)
+static inline void blk_flush_queued_io(struct task_struct *tsk)
{
}
-
-static inline bool blk_needs_flush_plug(struct task_struct *tsk)
-{
- return false;
-}
-
static inline int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
sector_t *error_sector)
{
Index: linux-4.7-rc6/kernel/sched/core.c
===================================================================
--- linux-4.7-rc6.orig/kernel/sched/core.c 2016-07-04 23:00:17.000000000 +0200
+++ linux-4.7-rc6/kernel/sched/core.c 2016-07-04 23:01:29.000000000 +0200
@@ -3359,11 +3359,10 @@ static inline void sched_submit_work(str
if (!tsk->state || tsk_is_pi_blocked(tsk))
return;
/*
- * If we are going to sleep and we have plugged IO queued,
+ * If we are going to sleep and we have queued IO,
* make sure to submit it to avoid deadlocks.
*/
- if (blk_needs_flush_plug(tsk))
- blk_schedule_flush_plug(tsk);
+ blk_flush_queued_io(tsk);
}
asmlinkage __visible void __sched schedule(void)
@@ -4977,7 +4976,7 @@ long __sched io_schedule_timeout(long ti
long ret;
current->in_iowait = 1;
- blk_schedule_flush_plug(current);
+ blk_flush_queued_io(current);
delayacct_blkio_start();
rq = raw_rq();
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH 2/3] block: prepare for timed bio offload
2016-07-04 22:53 [PATCH 0/3] offload bios to a thread Mikulas Patocka
2016-07-04 22:56 ` [PATCH 1/3] block: flush queued bios when process blocks to avoid deadlock Mikulas Patocka
@ 2016-07-04 22:58 ` Mikulas Patocka
2016-07-04 22:59 ` [PATCH 3/3] block: use timed offload Mikulas Patocka
2016-07-06 13:36 ` [PATCH 0/3] offload bios to a thread Mike Snitzer
3 siblings, 0 replies; 19+ messages in thread
From: Mikulas Patocka @ 2016-07-04 22:58 UTC (permalink / raw)
To: Alasdair G. Kergon, Mike Snitzer, Zdenek Kabelac; +Cc: dm-devel
Replace the pointer current->bio_list with structure queued_bios.
It is a prerequisite for the following patch that will use the timer
placed in this structure.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
---
block/bio.c | 6 +++---
block/blk-core.c | 16 ++++++++--------
drivers/md/bcache/btree.c | 12 ++++++------
drivers/md/dm-bufio.c | 2 +-
drivers/md/raid1.c | 6 +++---
drivers/md/raid10.c | 6 +++---
include/linux/blkdev.h | 7 ++++++-
include/linux/sched.h | 4 ++--
8 files changed, 32 insertions(+), 27 deletions(-)
Index: linux-4.7-rc6/include/linux/sched.h
===================================================================
--- linux-4.7-rc6.orig/include/linux/sched.h 2016-07-04 23:58:01.000000000 +0200
+++ linux-4.7-rc6/include/linux/sched.h 2016-07-05 00:02:10.000000000 +0200
@@ -128,7 +128,7 @@ struct sched_attr {
struct futex_pi_state;
struct robust_list_head;
-struct bio_list;
+struct queued_bios;
struct fs_struct;
struct perf_event_context;
struct blk_plug;
@@ -1727,7 +1727,7 @@ struct task_struct {
void *journal_info;
/* stacked block device info */
- struct bio_list *bio_list;
+ struct queued_bios *queued_bios;
#ifdef CONFIG_BLOCK
/* stack plugging */
Index: linux-4.7-rc6/block/blk-core.c
===================================================================
--- linux-4.7-rc6.orig/block/blk-core.c 2016-07-04 23:58:02.000000000 +0200
+++ linux-4.7-rc6/block/blk-core.c 2016-07-05 00:02:10.000000000 +0200
@@ -2031,7 +2031,7 @@ end_io:
*/
blk_qc_t generic_make_request(struct bio *bio)
{
- struct bio_list bio_list_on_stack;
+ struct queued_bios queued_bios_on_stack;
blk_qc_t ret = BLK_QC_T_NONE;
if (!generic_make_request_checks(bio))
@@ -2047,8 +2047,8 @@ blk_qc_t generic_make_request(struct bio
* it is non-NULL, then a make_request is active, and new requests
* should be added at the tail
*/
- if (current->bio_list) {
- bio_list_add(current->bio_list, bio);
+ if (current->queued_bios) {
+ bio_list_add(¤t->queued_bios->bio_list, bio);
goto out;
}
@@ -2067,8 +2067,8 @@ blk_qc_t generic_make_request(struct bio
* bio_list, and call into ->make_request() again.
*/
BUG_ON(bio->bi_next);
- bio_list_init(&bio_list_on_stack);
- current->bio_list = &bio_list_on_stack;
+ bio_list_init(&queued_bios_on_stack.bio_list);
+ current->queued_bios = &queued_bios_on_stack;
do {
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
@@ -2077,15 +2077,15 @@ blk_qc_t generic_make_request(struct bio
blk_queue_exit(q);
- bio = bio_list_pop(current->bio_list);
+ bio = bio_list_pop(¤t->queued_bios->bio_list);
} else {
- struct bio *bio_next = bio_list_pop(current->bio_list);
+ struct bio *bio_next = bio_list_pop(¤t->queued_bios->bio_list);
bio_io_error(bio);
bio = bio_next;
}
} while (bio);
- current->bio_list = NULL; /* deactivate */
+ current->queued_bios = NULL; /* deactivate */
out:
return ret;
Index: linux-4.7-rc6/include/linux/blkdev.h
===================================================================
--- linux-4.7-rc6.orig/include/linux/blkdev.h 2016-07-04 23:58:02.000000000 +0200
+++ linux-4.7-rc6/include/linux/blkdev.h 2016-07-05 00:02:10.000000000 +0200
@@ -1114,6 +1114,11 @@ static inline bool blk_needs_flush_plug(
!list_empty(&plug->cb_list));
}
+struct queued_bios {
+ struct bio_list bio_list;
+ struct timer_list timer;
+};
+
extern void blk_flush_bio_list(struct task_struct *tsk);
static inline void blk_flush_queued_io(struct task_struct *tsk)
@@ -1121,7 +1126,7 @@ static inline void blk_flush_queued_io(s
/*
* Flush any queued bios to corresponding rescue threads.
*/
- if (tsk->bio_list && !bio_list_empty(tsk->bio_list))
+ if (tsk->queued_bios && !bio_list_empty(&tsk->queued_bios->bio_list))
blk_flush_bio_list(tsk);
/*
* Flush any plugged IO that is queued.
Index: linux-4.7-rc6/block/bio.c
===================================================================
--- linux-4.7-rc6.orig/block/bio.c 2016-07-05 00:02:01.000000000 +0200
+++ linux-4.7-rc6/block/bio.c 2016-07-05 00:02:10.000000000 +0200
@@ -365,13 +365,13 @@ static void bio_alloc_rescue(struct work
void blk_flush_bio_list(struct task_struct *tsk)
{
struct bio *bio;
- struct bio_list list = *tsk->bio_list;
- bio_list_init(tsk->bio_list);
+ struct bio_list list = tsk->queued_bios->bio_list;
+ bio_list_init(&tsk->queued_bios->bio_list);
while ((bio = bio_list_pop(&list))) {
struct bio_set *bs = bio->bi_pool;
if (unlikely(!bs) || bs == fs_bio_set) {
- bio_list_add(tsk->bio_list, bio);
+ bio_list_add(&tsk->queued_bios->bio_list, bio);
continue;
}
Index: linux-4.7-rc6/drivers/md/bcache/btree.c
===================================================================
--- linux-4.7-rc6.orig/drivers/md/bcache/btree.c 2016-07-04 23:58:02.000000000 +0200
+++ linux-4.7-rc6/drivers/md/bcache/btree.c 2016-07-05 00:02:10.000000000 +0200
@@ -450,7 +450,7 @@ void __bch_btree_node_write(struct btree
trace_bcache_btree_write(b);
- BUG_ON(current->bio_list);
+ BUG_ON(current->queued_bios);
BUG_ON(b->written >= btree_blocks(b));
BUG_ON(b->written && !i->keys);
BUG_ON(btree_bset_first(b)->seq != i->seq);
@@ -544,7 +544,7 @@ static void bch_btree_leaf_dirty(struct
/* Force write if set is too big */
if (set_bytes(i) > PAGE_SIZE - 48 &&
- !current->bio_list)
+ !current->queued_bios)
bch_btree_node_write(b, NULL);
}
@@ -889,7 +889,7 @@ static struct btree *mca_alloc(struct ca
{
struct btree *b;
- BUG_ON(current->bio_list);
+ BUG_ON(current->queued_bios);
lockdep_assert_held(&c->bucket_lock);
@@ -976,7 +976,7 @@ retry:
b = mca_find(c, k);
if (!b) {
- if (current->bio_list)
+ if (current->queued_bios)
return ERR_PTR(-EAGAIN);
mutex_lock(&c->bucket_lock);
@@ -2127,7 +2127,7 @@ static int bch_btree_insert_node(struct
return 0;
split:
- if (current->bio_list) {
+ if (current->queued_bios) {
op->lock = b->c->root->level + 1;
return -EAGAIN;
} else if (op->lock <= b->c->root->level) {
@@ -2209,7 +2209,7 @@ int bch_btree_insert(struct cache_set *c
struct btree_insert_op op;
int ret = 0;
- BUG_ON(current->bio_list);
+ BUG_ON(current->queued_bios);
BUG_ON(bch_keylist_empty(keys));
bch_btree_op_init(&op.op, 0);
Index: linux-4.7-rc6/drivers/md/dm-bufio.c
===================================================================
--- linux-4.7-rc6.orig/drivers/md/dm-bufio.c 2016-07-04 23:58:02.000000000 +0200
+++ linux-4.7-rc6/drivers/md/dm-bufio.c 2016-07-05 00:02:10.000000000 +0200
@@ -174,7 +174,7 @@ static inline int dm_bufio_cache_index(s
#define DM_BUFIO_CACHE(c) (dm_bufio_caches[dm_bufio_cache_index(c)])
#define DM_BUFIO_CACHE_NAME(c) (dm_bufio_cache_names[dm_bufio_cache_index(c)])
-#define dm_bufio_in_request() (!!current->bio_list)
+#define dm_bufio_in_request() (!!current->queued_bios)
static void dm_bufio_lock(struct dm_bufio_client *c)
{
Index: linux-4.7-rc6/drivers/md/raid1.c
===================================================================
--- linux-4.7-rc6.orig/drivers/md/raid1.c 2016-07-04 23:58:02.000000000 +0200
+++ linux-4.7-rc6/drivers/md/raid1.c 2016-07-05 00:02:10.000000000 +0200
@@ -876,8 +876,8 @@ static sector_t wait_barrier(struct r1co
(!conf->barrier ||
((conf->start_next_window <
conf->next_resync + RESYNC_SECTORS) &&
- current->bio_list &&
- !bio_list_empty(current->bio_list))),
+ current->queued_bios &&
+ !bio_list_empty(¤t->queued_bios->bio_list))),
conf->resync_lock);
conf->nr_waiting--;
}
@@ -1014,7 +1014,7 @@ static void raid1_unplug(struct blk_plug
struct r1conf *conf = mddev->private;
struct bio *bio;
- if (from_schedule || current->bio_list) {
+ if (from_schedule || current->queued_bios) {
spin_lock_irq(&conf->device_lock);
bio_list_merge(&conf->pending_bio_list, &plug->pending);
conf->pending_count += plug->pending_cnt;
Index: linux-4.7-rc6/drivers/md/raid10.c
===================================================================
--- linux-4.7-rc6.orig/drivers/md/raid10.c 2016-07-04 23:58:02.000000000 +0200
+++ linux-4.7-rc6/drivers/md/raid10.c 2016-07-05 00:02:10.000000000 +0200
@@ -945,8 +945,8 @@ static void wait_barrier(struct r10conf
wait_event_lock_irq(conf->wait_barrier,
!conf->barrier ||
(conf->nr_pending &&
- current->bio_list &&
- !bio_list_empty(current->bio_list)),
+ current->queued_bios &&
+ !bio_list_empty(¤t->queued_bios->bio_list)),
conf->resync_lock);
conf->nr_waiting--;
}
@@ -1022,7 +1022,7 @@ static void raid10_unplug(struct blk_plu
struct r10conf *conf = mddev->private;
struct bio *bio;
- if (from_schedule || current->bio_list) {
+ if (from_schedule || current->queued_bios) {
spin_lock_irq(&conf->device_lock);
bio_list_merge(&conf->pending_bio_list, &plug->pending);
conf->pending_count += plug->pending_cnt;
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH 3/3] block: use timed offload
2016-07-04 22:53 [PATCH 0/3] offload bios to a thread Mikulas Patocka
2016-07-04 22:56 ` [PATCH 1/3] block: flush queued bios when process blocks to avoid deadlock Mikulas Patocka
2016-07-04 22:58 ` [PATCH 2/3] block: prepare for timed bio offload Mikulas Patocka
@ 2016-07-04 22:59 ` Mikulas Patocka
2016-07-06 13:36 ` [PATCH 0/3] offload bios to a thread Mike Snitzer
3 siblings, 0 replies; 19+ messages in thread
From: Mikulas Patocka @ 2016-07-04 22:59 UTC (permalink / raw)
To: Alasdair G. Kergon, Mike Snitzer, Zdenek Kabelac; +Cc: dm-devel
This patch introduces a timed bio offload.
When a process schedules and there are bios queued on
current->queued_bios, we submit a timer that redirects the queued bios to
a workqueue after a specific timeout (currently 1s).
The reason for the timer is that immediate bio offload could change
ordering of bios and it could theoretically cause performance regressions.
So, we offload bios only if the process is blocked for a certain amount of
time.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
---
block/bio.c | 45 +++++++++++++++++++++++++++++++++------------
block/blk-core.c | 19 +++++++++++++++++--
2 files changed, 50 insertions(+), 14 deletions(-)
Index: linux-4.7-rc5-devel/block/bio.c
===================================================================
--- linux-4.7-rc5-devel.orig/block/bio.c 2016-06-30 17:25:56.000000000 +0200
+++ linux-4.7-rc5-devel/block/bio.c 2016-06-30 17:26:04.000000000 +0200
@@ -338,9 +338,10 @@ static void bio_alloc_rescue(struct work
struct bio *bio;
while (1) {
- spin_lock(&bs->rescue_lock);
+ unsigned long flags;
+ spin_lock_irqsave(&bs->rescue_lock, flags);
bio = bio_list_pop(&bs->rescue_list);
- spin_unlock(&bs->rescue_lock);
+ spin_unlock_irqrestore(&bs->rescue_lock, flags);
if (!bio)
break;
@@ -350,35 +351,55 @@ static void bio_alloc_rescue(struct work
}
/**
- * blk_flush_bio_list
- * @tsk: task_struct whose bio_list must be flushed
+ * blk_timer_flush_bio_list
*
- * Pop bios queued on @tsk->bio_list and submit each of them to
+ * Pop bios queued on q->bio_list and submit each of them to
* their rescue workqueue.
*
- * If the bio doesn't have a bio_set, we leave it on @tsk->bio_list.
+ * If the bio doesn't have a bio_set, we leave it on q->bio_list.
* If the bio is allocated from fs_bio_set, we must leave it to avoid
* deadlock on loopback block device.
* Stacking bio drivers should use bio_set, so this shouldn't be
* an issue.
*/
-void blk_flush_bio_list(struct task_struct *tsk)
+static void blk_timer_flush_bio_list(unsigned long data)
{
+ struct queued_bios *q = (struct queued_bios *)data;
struct bio *bio;
- struct bio_list list = tsk->queued_bios->bio_list;
- bio_list_init(&tsk->queued_bios->bio_list);
+
+ struct bio_list list = q->bio_list;
+ bio_list_init(&q->bio_list);
while ((bio = bio_list_pop(&list))) {
+ unsigned long flags;
struct bio_set *bs = bio->bi_pool;
if (unlikely(!bs) || bs == fs_bio_set) {
- bio_list_add(&tsk->queued_bios->bio_list, bio);
+ bio_list_add(&q->bio_list, bio);
continue;
}
- spin_lock(&bs->rescue_lock);
+ spin_lock_irqsave(&bs->rescue_lock, flags);
bio_list_add(&bs->rescue_list, bio);
queue_work(bs->rescue_workqueue, &bs->rescue_work);
- spin_unlock(&bs->rescue_lock);
+ spin_unlock_irqrestore(&bs->rescue_lock, flags);
+ }
+}
+
+#define BIO_RESCUE_TIMEOUT HZ
+
+/**
+ * blk_flush_bio_list
+ * @tsk: task_struct whose bio_list must be flushed
+ *
+ * This function sets up a timer that flushes the queued bios.
+ */
+void blk_flush_bio_list(struct task_struct *tsk)
+{
+ struct queued_bios *q = tsk->queued_bios;
+ if (q->timer.function == NULL) {
+ setup_timer(&q->timer, blk_timer_flush_bio_list,
+ (unsigned long)q);
+ mod_timer(&q->timer, jiffies + BIO_RESCUE_TIMEOUT);
}
}
Index: linux-4.7-rc5-devel/block/blk-core.c
===================================================================
--- linux-4.7-rc5-devel.orig/block/blk-core.c 2016-06-30 17:25:56.000000000 +0200
+++ linux-4.7-rc5-devel/block/blk-core.c 2016-06-30 17:26:04.000000000 +0200
@@ -2032,6 +2032,7 @@ end_io:
blk_qc_t generic_make_request(struct bio *bio)
{
struct queued_bios queued_bios_on_stack;
+ struct queued_bios *q;
blk_qc_t ret = BLK_QC_T_NONE;
if (!generic_make_request_checks(bio))
@@ -2047,8 +2048,17 @@ blk_qc_t generic_make_request(struct bio
* it is non-NULL, then a make_request is active, and new requests
* should be added at the tail
*/
- if (current->queued_bios) {
- bio_list_add(¤t->queued_bios->bio_list, bio);
+ q = current->queued_bios;
+ if (q) {
+ /*
+ * The timer may modify q->bio_list. So we must stop the timer
+ * before modifying the list.
+ */
+ if (q->timer.function != NULL) {
+ del_timer_sync(&q->timer);
+ q->timer.function = NULL;
+ }
+ bio_list_add(&q->bio_list, bio);
goto out;
}
@@ -2068,6 +2078,7 @@ blk_qc_t generic_make_request(struct bio
*/
BUG_ON(bio->bi_next);
bio_list_init(&queued_bios_on_stack.bio_list);
+ queued_bios_on_stack.timer.function = NULL;
current->queued_bios = &queued_bios_on_stack;
do {
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
@@ -2084,6 +2095,10 @@ blk_qc_t generic_make_request(struct bio
bio_io_error(bio);
bio = bio_next;
}
+ if (unlikely(queued_bios_on_stack.timer.function != NULL)) {
+ del_timer_sync(&queued_bios_on_stack.timer);
+ queued_bios_on_stack.timer.function = NULL;
+ }
} while (bio);
current->queued_bios = NULL; /* deactivate */
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH 0/3] offload bios to a thread
2016-07-04 22:53 [PATCH 0/3] offload bios to a thread Mikulas Patocka
` (2 preceding siblings ...)
2016-07-04 22:59 ` [PATCH 3/3] block: use timed offload Mikulas Patocka
@ 2016-07-06 13:36 ` Mike Snitzer
2016-07-06 13:53 ` Mikulas Patocka
3 siblings, 1 reply; 19+ messages in thread
From: Mike Snitzer @ 2016-07-06 13:36 UTC (permalink / raw)
To: Mikulas Patocka; +Cc: dm-devel, Alasdair G. Kergon, Zdenek Kabelac
On Mon, Jul 04 2016 at 6:53pm -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:
> Hi
>
> This is the second version of patches that fix deadlocks by redirecting
> bios from current->bio_list to rescuer workqueues.
>
> I found out that the original patches caused deadlock with the loopback
> device. When the loopback device is used, both lower and upper filesystems
> use the same bio set - fs_bio_set. Consequently, bios submitted by both of
> them end up on the same rescuer workqueue. There is a deadlock possibility
> - if generic_make_request for the upper filesystem's bio blocks (because
> there are too many requests in flight on the loop device), it may stall
> processing some bios for the lower filesystem.
>
> Ideadlly, each filesystem should have its own bio set. But it doesn't. So
> I fix this problem by not offloading bios allocated from fs_bio_set.
I'd much preferred you just send an incremental fix that built on the
tree you know I started, here:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip
I've now folded your fix into this tree.
But please don't ignore work you know that was done to further prepare
your patches for inclusion. It makes for tedious busy work on my end to
pull out the incremental fix, which is simply:
diff --git a/block/bio.c b/block/bio.c
index 7c49b91..80ebe88 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -357,7 +357,9 @@ static void bio_alloc_rescue(struct work_struct *work)
* to their rescue workqueue.
*
* If the bio doesn't have a bio_set, we leave it on queued_bios->bio_list.
- * However, stacking drivers should use bio_set, so this shouldn't be
+ * If the bio is allocated from fs_bio_set, we must leave it to avoid
+ * deadlock on loopback block device.
+ * But stacking drivers should use a bio_set, so this shouldn't be
* an issue.
*/
static void blk_timer_flush_bio_list(unsigned long data)
@@ -371,7 +373,7 @@ static void blk_timer_flush_bio_list(unsigned long data)
while ((bio = bio_list_pop(&list))) {
unsigned long flags;
struct bio_set *bs = bio->bi_pool;
- if (unlikely(!bs)) {
+ if (unlikely(!bs) || bs == fs_bio_set) {
bio_list_add(&queued_bios->bio_list, bio);
continue;
}
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [PATCH 0/3] offload bios to a thread
2016-07-06 13:36 ` [PATCH 0/3] offload bios to a thread Mike Snitzer
@ 2016-07-06 13:53 ` Mikulas Patocka
2016-07-06 13:55 ` Mike Snitzer
0 siblings, 1 reply; 19+ messages in thread
From: Mikulas Patocka @ 2016-07-06 13:53 UTC (permalink / raw)
To: Mike Snitzer; +Cc: dm-devel, Alasdair G. Kergon, Zdenek Kabelac
On Wed, 6 Jul 2016, Mike Snitzer wrote:
> On Mon, Jul 04 2016 at 6:53pm -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
>
> > Hi
> >
> > This is the second version of patches that fix deadlocks by redirecting
> > bios from current->bio_list to rescuer workqueues.
> >
> > I found out that the original patches caused deadlock with the loopback
> > device. When the loopback device is used, both lower and upper filesystems
> > use the same bio set - fs_bio_set. Consequently, bios submitted by both of
> > them end up on the same rescuer workqueue. There is a deadlock possibility
> > - if generic_make_request for the upper filesystem's bio blocks (because
> > there are too many requests in flight on the loop device), it may stall
> > processing some bios for the lower filesystem.
> >
> > Ideadlly, each filesystem should have its own bio set. But it doesn't. So
> > I fix this problem by not offloading bios allocated from fs_bio_set.
>
> I'd much preferred you just send an incremental fix that built on the
> tree you know I started, here:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip
You need to change three patches in your git:
* block: flush queued bios when process blocks to avoid deadlock
* block: prepare for timed offload of queued bios to workqueue
* block: use timed offload of queued bios to a workqueue
because this bug is present in all of them.
When these patches are sent to Linus, the bug should not be present in any
of them.
Mikulas
> I've now folded your fix into this tree.
>
> But please don't ignore work you know that was done to further prepare
> your patches for inclusion. It makes for tedious busy work on my end to
> pull out the incremental fix, which is simply:
>
> diff --git a/block/bio.c b/block/bio.c
> index 7c49b91..80ebe88 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -357,7 +357,9 @@ static void bio_alloc_rescue(struct work_struct *work)
> * to their rescue workqueue.
> *
> * If the bio doesn't have a bio_set, we leave it on queued_bios->bio_list.
> - * However, stacking drivers should use bio_set, so this shouldn't be
> + * If the bio is allocated from fs_bio_set, we must leave it to avoid
> + * deadlock on loopback block device.
> + * But stacking drivers should use a bio_set, so this shouldn't be
> * an issue.
> */
> static void blk_timer_flush_bio_list(unsigned long data)
> @@ -371,7 +373,7 @@ static void blk_timer_flush_bio_list(unsigned long data)
> while ((bio = bio_list_pop(&list))) {
> unsigned long flags;
> struct bio_set *bs = bio->bi_pool;
> - if (unlikely(!bs)) {
> + if (unlikely(!bs) || bs == fs_bio_set) {
> bio_list_add(&queued_bios->bio_list, bio);
> continue;
> }
>
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH 0/3] offload bios to a thread
2016-07-06 13:53 ` Mikulas Patocka
@ 2016-07-06 13:55 ` Mike Snitzer
2016-07-06 15:23 ` Mikulas Patocka
0 siblings, 1 reply; 19+ messages in thread
From: Mike Snitzer @ 2016-07-06 13:55 UTC (permalink / raw)
To: Mikulas Patocka; +Cc: dm-devel, Alasdair G. Kergon, Zdenek Kabelac
On Wed, Jul 06 2016 at 9:53am -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Wed, 6 Jul 2016, Mike Snitzer wrote:
>
> > On Mon, Jul 04 2016 at 6:53pm -0400,
> > Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> > > Hi
> > >
> > > This is the second version of patches that fix deadlocks by redirecting
> > > bios from current->bio_list to rescuer workqueues.
> > >
> > > I found out that the original patches caused deadlock with the loopback
> > > device. When the loopback device is used, both lower and upper filesystems
> > > use the same bio set - fs_bio_set. Consequently, bios submitted by both of
> > > them end up on the same rescuer workqueue. There is a deadlock possibility
> > > - if generic_make_request for the upper filesystem's bio blocks (because
> > > there are too many requests in flight on the loop device), it may stall
> > > processing some bios for the lower filesystem.
> > >
> > > Ideadlly, each filesystem should have its own bio set. But it doesn't. So
> > > I fix this problem by not offloading bios allocated from fs_bio_set.
> >
> > I'd much preferred you just send an incremental fix that built on the
> > tree you know I started, here:
> > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip
>
> You need to change three patches in your git:
> * block: flush queued bios when process blocks to avoid deadlock
> * block: prepare for timed offload of queued bios to workqueue
> * block: use timed offload of queued bios to a workqueue
> because this bug is present in all of them.
>
> When these patches are sent to Linus, the bug should not be present in any
> of them.
Yes, I'm aware. Please review:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip
^ permalink raw reply [flat|nested] 19+ messages in thread