From: Kent Overstreet <koverstreet@google.com>
To: Mikulas Patocka <mpatocka@redhat.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>,
linux-bcache@vger.kernel.org, linux-kernel@vger.kernel.org,
dm-devel@redhat.com, tj@kernel.org, axboe@kernel.dk,
agk@redhat.com, neilb@suse.de, drbd-dev@lists.linbit.com,
vgoyal@redhat.com, sage@newdream.net, yehuda@hq.newdream.net
Subject: Re: [PATCH] A possible deadlock with stacked devices (was: [PATCH v4 08/12] block: Introduce new bio_split())
Date: Wed, 15 Aug 2012 16:07:15 -0700 [thread overview]
Message-ID: <20120815230715.GD2758@google.com> (raw)
In-Reply-To: <Pine.LNX.4.64.1207262005450.28322@file.rdu.redhat.com>
> Both md and dm use __GFP_WAIT allocations from mempools in
> generic_make_request.
>
> I think you found an interesting bug here. Suppose that we have three
> stacked devices: d1 depends on d2 and d2 depends on d3.
>
> Now, a bio b1 comes to d1. d1 splits it to two bios: b2.1 and b2.2 and
> sends them to the device d2 - these bios end up in current->bio_list. The
> driver for d2 receives bio b2.1 and sends bio b3.1 to d3. Now,
> current->bio_list contains bios b2.2, b3.1. Now, bio b2.2 is popped off
> the bio list and the driver for d2 is called with b2.2 - suppose that for
> some reason mempool in d2 is exhausted and the driver needs to wait until
> b2.1 finishes. b2.1 never finishes, because b2.1 depends on b3.1 and b3.1
> is still in current->bio_list. So it deadlocks.
>
> Turning off __GFP_WAIT fixes nothing - it just turns one bug (a possible
> deadlock) into another bug (a possible bio failure with -ENOMEM).
>
> Increasing mempool sizes doesn't fix it either, the bio would just have to
> be split to more pieces in the above example to make it deadlock.
>
> I think the above possible deadlock scenario could be fixed by reversing
> current->bio_list processing - i.e. when some device's make_request_fn
> adds some bios to current->bio_list, these bios are processed before other
> bios that were on the list before. This way, bio list would contain "b3.1,
> b2.2" instead of "b2.2, b3.1" in the above example and the deadlock would
> not happen.
Your patch isn't sufficient in the case where a bio may be split
multiple times (I'm not sure if it's sufficient in the case where bios
are only split once; trying to work it all out makes my head hurt).
You don't need multiple stacked drivers to see this; just the case where
a single driver is running that splits multiple times is sufficient, if
you have enough threads submitting at the same time.
Bcache works around this with the trick I mentioned previously, where it
masks out _GFP_WAIT if current->bio_list != NULL, and punts to workqueue
if the allocation fails.
This works but it'd have to be done in each stacking driver... it's not
a generic solution, and it's a pain in the ass.
I came up with another idea the other day. Conceptually, it inverts my
previous workaround - the punting to workqueue is done in the allocation
code when necessary, for the bios that would be blocked.
It's lightly tested, gonna rig up some kind of fault injection and test
it more thoroughly later.
commit d61bbb074cc8f2e089eb57e2bee8e84500f390a8
Author: Kent Overstreet <koverstreet@google.com>
Date: Mon Aug 13 18:11:01 2012 -0700
block: Avoid deadlocks with bio allocation by stacking drivers
Previously, if we ever try to allocate more than once from the same bio
set while running under generic_make_request(), we risk deadlock.
This would happen if e.g. a bio ever needed to be split more than once,
and it's difficult to handle correctly in the drivers - so in practice
it's not.
This patch fixes this issue by allocating a rescuer workqueue for each
bio_set, and punting queued bios to said rescuer when necessary:
diff --git a/fs/bio.c b/fs/bio.c
index bc4265a..7b4f655 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -281,6 +281,23 @@ void bio_reset(struct bio *bio)
}
EXPORT_SYMBOL(bio_reset);
+static void bio_alloc_rescue(struct work_struct *work)
+{
+ struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
+ struct bio *bio;
+
+ while (1) {
+ spin_lock(&bs->rescue_lock);
+ bio = bio_list_pop(&bs->rescue_list);
+ spin_unlock(&bs->rescue_lock);
+
+ if (!bio)
+ break;
+
+ generic_make_request(bio);
+ }
+}
+
/**
* bio_alloc_bioset - allocate a bio for I/O
* @gfp_mask: the GFP_ mask given to the slab allocator
@@ -294,6 +311,7 @@ EXPORT_SYMBOL(bio_reset);
**/
struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
{
+ gfp_t saved_gfp = gfp_mask;
unsigned front_pad;
unsigned inline_vecs;
unsigned long idx = BIO_POOL_NONE;
@@ -308,16 +326,39 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
p = kmalloc(sizeof(struct bio) +
nr_iovecs * sizeof(struct bio_vec),
gfp_mask);
+
front_pad = 0;
inline_vecs = nr_iovecs;
} else {
- p = mempool_alloc(bs->bio_pool, gfp_mask);
+ /*
+ * If we're running under generic_make_request()
+ * (current->bio_list != NULL), we risk deadlock if we sleep on
+ * allocation and there's already bios on current->bio_list that
+ * were allocated from the same bio_set; they won't be submitted
+ * (and thus freed) as long as we're blocked here.
+ *
+ * To deal with this, we first try the allocation without using
+ * the mempool; if that fails, we punt all the bios on
+ * current->bio_list to a different thread and then retry the
+ * allocation with the original gfp mask.
+ */
+
+ if (current->bio_list &&
+ !bio_list_empty(current->bio_list) &&
+ (gfp_mask & __GFP_WAIT))
+ gfp_mask &= GFP_ATOMIC;
+retry:
+ if (gfp_mask & __GFP_WAIT)
+ p = mempool_alloc(bs->bio_pool, gfp_mask);
+ else
+ p = kmem_cache_alloc(bs->bio_slab, gfp_mask);
+
front_pad = bs->front_pad;
inline_vecs = BIO_INLINE_VECS;
}
if (unlikely(!p))
- return NULL;
+ goto err;
bio = p + front_pad;
bio_init(bio);
@@ -338,6 +379,19 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
err_free:
mempool_free(p, bs->bio_pool);
+err:
+ if (gfp_mask != saved_gfp) {
+ gfp_mask = saved_gfp;
+
+ spin_lock(&bs->rescue_lock);
+ bio_list_merge(&bs->rescue_list, current->bio_list);
+ bio_list_init(current->bio_list);
+ spin_unlock(&bs->rescue_lock);
+
+ queue_work(bs->rescue_workqueue, &bs->rescue_work);
+ goto retry;
+ }
+
return NULL;
}
EXPORT_SYMBOL(bio_alloc_bioset);
@@ -1546,6 +1600,9 @@ static void biovec_free_pools(struct bio_set *bs)
void bioset_free(struct bio_set *bs)
{
+ if (bs->rescue_workqueue)
+ destroy_workqueue(bs->rescue_workqueue);
+
if (bs->bio_pool)
mempool_destroy(bs->bio_pool);
@@ -1581,6 +1638,10 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
bs->front_pad = front_pad;
+ spin_lock_init(&bs->rescue_lock);
+ bio_list_init(&bs->rescue_list);
+ INIT_WORK(&bs->rescue_work, bio_alloc_rescue);
+
bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
if (!bs->bio_slab) {
kfree(bs);
@@ -1591,9 +1652,14 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
if (!bs->bio_pool)
goto bad;
- if (!biovec_create_pools(bs, pool_size))
- return bs;
+ if (biovec_create_pools(bs, pool_size))
+ goto bad;
+
+ bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
+ if (!bs->rescue_workqueue)
+ goto bad;
+ return bs;
bad:
bioset_free(bs);
return NULL;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index b22c22b..ba5b52e 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -290,39 +290,6 @@ static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }
static inline void bio_disassociate_task(struct bio *bio) { }
#endif /* CONFIG_BLK_CGROUP */
-/*
- * bio_set is used to allow other portions of the IO system to
- * allocate their own private memory pools for bio and iovec structures.
- * These memory pools in turn all allocate from the bio_slab
- * and the bvec_slabs[].
- */
-#define BIO_POOL_SIZE 2
-#define BIOVEC_NR_POOLS 6
-#define BIOVEC_MAX_IDX (BIOVEC_NR_POOLS - 1)
-
-struct bio_set {
- struct kmem_cache *bio_slab;
- unsigned int front_pad;
-
- mempool_t *bio_pool;
-#if defined(CONFIG_BLK_DEV_INTEGRITY)
- mempool_t *bio_integrity_pool;
-#endif
- mempool_t *bvec_pool;
-};
-
-struct biovec_slab {
- int nr_vecs;
- char *name;
- struct kmem_cache *slab;
-};
-
-/*
- * a small number of entries is fine, not going to be performance critical.
- * basically we just need to survive
- */
-#define BIO_SPLIT_ENTRIES 2
-
#ifdef CONFIG_HIGHMEM
/*
* remember never ever reenable interrupts between a bvec_kmap_irq and
@@ -497,6 +464,48 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
return bio;
}
+/*
+ * bio_set is used to allow other portions of the IO system to
+ * allocate their own private memory pools for bio and iovec structures.
+ * These memory pools in turn all allocate from the bio_slab
+ * and the bvec_slabs[].
+ */
+#define BIO_POOL_SIZE 2
+#define BIOVEC_NR_POOLS 6
+#define BIOVEC_MAX_IDX (BIOVEC_NR_POOLS - 1)
+
+struct bio_set {
+ struct kmem_cache *bio_slab;
+ unsigned int front_pad;
+
+ mempool_t *bio_pool;
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+ mempool_t *bio_integrity_pool;
+#endif
+ mempool_t *bvec_pool;
+
+ /*
+ * Deadlock avoidance for stacking block drivers: see comments in
+ * bio_alloc_bioset() for details
+ */
+ spinlock_t rescue_lock;
+ struct bio_list rescue_list;
+ struct work_struct rescue_work;
+ struct workqueue_struct *rescue_workqueue;
+};
+
+struct biovec_slab {
+ int nr_vecs;
+ char *name;
+ struct kmem_cache *slab;
+};
+
+/*
+ * a small number of entries is fine, not going to be performance critical.
+ * basically we just need to survive
+ */
+#define BIO_SPLIT_ENTRIES 2
+
#if defined(CONFIG_BLK_DEV_INTEGRITY)
#define bip_vec_idx(bip, idx) (&(bip->bip_vec[(idx)]))
next prev parent reply other threads:[~2012-08-15 23:07 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-07-24 20:11 [PATCH v4 00/12] Block cleanups Kent Overstreet
2012-07-24 20:11 ` [PATCH v4 01/12] block: Generalized bio pool freeing Kent Overstreet
[not found] ` <1343160689-12378-2-git-send-email-koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-07-25 11:06 ` Boaz Harrosh
[not found] ` <500FD32F.2010809-C4P08NqkoRlBDgjK7y7TUQ@public.gmane.org>
2012-07-25 23:38 ` Kent Overstreet
2012-07-24 20:11 ` [PATCH v4 02/12] dm: Use bioset's front_pad for dm_rq_clone_bio_info Kent Overstreet
2012-07-24 20:11 ` [PATCH v4 03/12] block: Add bio_reset() Kent Overstreet
[not found] ` <1343160689-12378-4-git-send-email-koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-07-25 11:19 ` Boaz Harrosh
[not found] ` <500FD63F.7050501-C4P08NqkoRlBDgjK7y7TUQ@public.gmane.org>
2012-07-25 22:56 ` Kent Overstreet
2012-07-24 20:11 ` [PATCH v4 04/12] pktcdvd: Switch to bio_kmalloc() Kent Overstreet
[not found] ` <1343160689-12378-5-git-send-email-koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-07-25 11:29 ` Boaz Harrosh
[not found] ` <500FD8B7.9040701-C4P08NqkoRlBDgjK7y7TUQ@public.gmane.org>
2012-07-25 23:01 ` Kent Overstreet
2012-07-24 20:11 ` [PATCH v4 05/12] block: Kill bi_destructor Kent Overstreet
[not found] ` <1343160689-12378-6-git-send-email-koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-07-25 11:39 ` Boaz Harrosh
[not found] ` <500FDB0D.4070605-C4P08NqkoRlBDgjK7y7TUQ@public.gmane.org>
2012-07-25 23:15 ` Kent Overstreet
2012-07-24 20:11 ` [PATCH v4 06/12] block: Add an explicit bio flag for bios that own their bvec Kent Overstreet
2012-07-24 20:11 ` [PATCH v4 08/12] block: Introduce new bio_split() Kent Overstreet
2012-07-25 11:55 ` Boaz Harrosh
2012-07-25 23:26 ` Kent Overstreet
2012-07-27 0:50 ` [PATCH] A possible deadlock with stacked devices (was: [PATCH v4 08/12] block: Introduce new bio_split()) Mikulas Patocka
2012-08-15 23:07 ` Kent Overstreet [this message]
[not found] ` <20120815230715.GD2758-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-08-29 16:08 ` Mikulas Patocka
[not found] ` <1343160689-12378-1-git-send-email-koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-07-24 20:11 ` [PATCH v4 07/12] block: Rename bio_split() -> bio_pair_split() Kent Overstreet
2012-07-24 20:11 ` [PATCH v4 09/12] block: Rework bio_pair_split() Kent Overstreet
[not found] ` <1343160689-12378-10-git-send-email-koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-07-25 12:03 ` Boaz Harrosh
2012-07-24 20:11 ` [PATCH v4 10/12] block: Add bio_clone_kmalloc() Kent Overstreet
[not found] ` <1343160689-12378-11-git-send-email-koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-07-25 12:05 ` Boaz Harrosh
2012-07-24 20:11 ` [PATCH v4 11/12] block: Add bio_clone_bioset() Kent Overstreet
2012-07-24 20:11 ` [PATCH v4 12/12] block: Only clone bio vecs that are in use Kent Overstreet
[not found] ` <1343160689-12378-13-git-send-email-koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-08-07 3:17 ` Muthu Kumar
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120815230715.GD2758@google.com \
--to=koverstreet@google.com \
--cc=agk@redhat.com \
--cc=axboe@kernel.dk \
--cc=bharrosh@panasas.com \
--cc=dm-devel@redhat.com \
--cc=drbd-dev@lists.linbit.com \
--cc=linux-bcache@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mpatocka@redhat.com \
--cc=neilb@suse.de \
--cc=sage@newdream.net \
--cc=tj@kernel.org \
--cc=vgoyal@redhat.com \
--cc=yehuda@hq.newdream.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).