From: Mikulas Patocka <mpatocka-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: Kent Overstreet <koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Boaz Harrosh <bharrosh-C4P08NqkoRlBDgjK7y7TUQ@public.gmane.org>,
linux-bcache-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
dm-devel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org,
agk-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
neilb-l3A5Bk7waGM@public.gmane.org,
drbd-dev-cunTk1MwBs8qoQakbn7OcQ@public.gmane.org,
vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org,
yehuda-L5o5AL9CYN0tUFlbccrkMA@public.gmane.org
Subject: Re: [PATCH] A possible deadlock with stacked devices (was: [PATCH v4 08/12] block: Introduce new bio_split())
Date: Wed, 29 Aug 2012 12:08:51 -0400 (EDT) [thread overview]
Message-ID: <Pine.LNX.4.64.1208291206110.774@file.rdu.redhat.com> (raw)
In-Reply-To: <20120815230715.GD2758-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
On Wed, 15 Aug 2012, Kent Overstreet wrote:
> > Both md and dm use __GFP_WAIT allocations from mempools in
> > generic_make_request.
> >
> > I think you found an interesting bug here. Suppose that we have three
> > stacked devices: d1 depends on d2 and d2 depends on d3.
> >
> > Now, a bio b1 comes to d1. d1 splits it to two bios: b2.1 and b2.2 and
> > sends them to the device d2 - these bios end up in current->bio_list. The
> > driver for d2 receives bio b2.1 and sends bio b3.1 to d3. Now,
> > current->bio_list contains bios b2.2, b3.1. Now, bio b2.2 is popped off
> > the bio list and the driver for d2 is called with b2.2 - suppose that for
> > some reason mempool in d2 is exhausted and the driver needs to wait until
> > b2.1 finishes. b2.1 never finishes, because b2.1 depends on b3.1 and b3.1
> > is still in current->bio_list. So it deadlocks.
> >
> > Turning off __GFP_WAIT fixes nothing - it just turns one bug (a possible
> > deadlock) into another bug (a possible bio failure with -ENOMEM).
> >
> > Increasing mempool sizes doesn't fix it either, the bio would just have to
> > be split to more pieces in the above example to make it deadlock.
> >
> > I think the above possible deadlock scenario could be fixed by reversing
> > current->bio_list processing - i.e. when some device's make_request_fn
> > adds some bios to current->bio_list, these bios are processed before other
> > bios that were on the list before. This way, bio list would contain "b3.1,
> > b2.2" instead of "b2.2, b3.1" in the above example and the deadlock would
> > not happen.
>
> Your patch isn't sufficient in the case where a bio may be split
> multiple times (I'm not sure if it's sufficient in the case where bios
> are only split once; trying to work it all out makes my head hurt).
>
> You don't need multiple stacked drivers to see this; just the case where
> a single driver is running that splits multiple times is sufficient, if
> you have enough threads submitting at the same time.
That is true. dm splits one bio to multiple bios, so it could still
deadlock.
Mikulas
> Bcache works around this with the trick I mentioned previously, where it
> masks out _GFP_WAIT if current->bio_list != NULL, and punts to workqueue
> if the allocation fails.
>
> This works but it'd have to be done in each stacking driver... it's not
> a generic solution, and it's a pain in the ass.
>
> I came up with another idea the other day. Conceptually, it inverts my
> previous workaround - the punting to workqueue is done in the allocation
> code when necessary, for the bios that would be blocked.
>
> It's lightly tested, gonna rig up some kind of fault injection and test
> it more thoroughly later.
>
> commit d61bbb074cc8f2e089eb57e2bee8e84500f390a8
> Author: Kent Overstreet <koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Date: Mon Aug 13 18:11:01 2012 -0700
>
> block: Avoid deadlocks with bio allocation by stacking drivers
>
> Previously, if we ever try to allocate more than once from the same bio
> set while running under generic_make_request(), we risk deadlock.
>
> This would happen if e.g. a bio ever needed to be split more than once,
> and it's difficult to handle correctly in the drivers - so in practice
> it's not.
>
> This patch fixes this issue by allocating a rescuer workqueue for each
> bio_set, and punting queued bios to said rescuer when necessary:
>
> diff --git a/fs/bio.c b/fs/bio.c
> index bc4265a..7b4f655 100644
> --- a/fs/bio.c
> +++ b/fs/bio.c
> @@ -281,6 +281,23 @@ void bio_reset(struct bio *bio)
> }
> EXPORT_SYMBOL(bio_reset);
>
> +static void bio_alloc_rescue(struct work_struct *work)
> +{
> + struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
> + struct bio *bio;
> +
> + while (1) {
> + spin_lock(&bs->rescue_lock);
> + bio = bio_list_pop(&bs->rescue_list);
> + spin_unlock(&bs->rescue_lock);
> +
> + if (!bio)
> + break;
> +
> + generic_make_request(bio);
> + }
> +}
> +
> /**
> * bio_alloc_bioset - allocate a bio for I/O
> * @gfp_mask: the GFP_ mask given to the slab allocator
> @@ -294,6 +311,7 @@ EXPORT_SYMBOL(bio_reset);
> **/
> struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> {
> + gfp_t saved_gfp = gfp_mask;
> unsigned front_pad;
> unsigned inline_vecs;
> unsigned long idx = BIO_POOL_NONE;
> @@ -308,16 +326,39 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> p = kmalloc(sizeof(struct bio) +
> nr_iovecs * sizeof(struct bio_vec),
> gfp_mask);
> +
> front_pad = 0;
> inline_vecs = nr_iovecs;
> } else {
> - p = mempool_alloc(bs->bio_pool, gfp_mask);
> + /*
> + * If we're running under generic_make_request()
> + * (current->bio_list != NULL), we risk deadlock if we sleep on
> + * allocation and there's already bios on current->bio_list that
> + * were allocated from the same bio_set; they won't be submitted
> + * (and thus freed) as long as we're blocked here.
> + *
> + * To deal with this, we first try the allocation without using
> + * the mempool; if that fails, we punt all the bios on
> + * current->bio_list to a different thread and then retry the
> + * allocation with the original gfp mask.
> + */
> +
> + if (current->bio_list &&
> + !bio_list_empty(current->bio_list) &&
> + (gfp_mask & __GFP_WAIT))
> + gfp_mask &= GFP_ATOMIC;
> +retry:
> + if (gfp_mask & __GFP_WAIT)
> + p = mempool_alloc(bs->bio_pool, gfp_mask);
> + else
> + p = kmem_cache_alloc(bs->bio_slab, gfp_mask);
> +
> front_pad = bs->front_pad;
> inline_vecs = BIO_INLINE_VECS;
> }
>
> if (unlikely(!p))
> - return NULL;
> + goto err;
>
> bio = p + front_pad;
> bio_init(bio);
> @@ -338,6 +379,19 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>
> err_free:
> mempool_free(p, bs->bio_pool);
> +err:
> + if (gfp_mask != saved_gfp) {
> + gfp_mask = saved_gfp;
> +
> + spin_lock(&bs->rescue_lock);
> + bio_list_merge(&bs->rescue_list, current->bio_list);
> + bio_list_init(current->bio_list);
> + spin_unlock(&bs->rescue_lock);
> +
> + queue_work(bs->rescue_workqueue, &bs->rescue_work);
> + goto retry;
> + }
> +
> return NULL;
> }
> EXPORT_SYMBOL(bio_alloc_bioset);
> @@ -1546,6 +1600,9 @@ static void biovec_free_pools(struct bio_set *bs)
>
> void bioset_free(struct bio_set *bs)
> {
> + if (bs->rescue_workqueue)
> + destroy_workqueue(bs->rescue_workqueue);
> +
> if (bs->bio_pool)
> mempool_destroy(bs->bio_pool);
>
> @@ -1581,6 +1638,10 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
>
> bs->front_pad = front_pad;
>
> + spin_lock_init(&bs->rescue_lock);
> + bio_list_init(&bs->rescue_list);
> + INIT_WORK(&bs->rescue_work, bio_alloc_rescue);
> +
> bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
> if (!bs->bio_slab) {
> kfree(bs);
> @@ -1591,9 +1652,14 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
> if (!bs->bio_pool)
> goto bad;
>
> - if (!biovec_create_pools(bs, pool_size))
> - return bs;
> + if (biovec_create_pools(bs, pool_size))
> + goto bad;
> +
> + bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
> + if (!bs->rescue_workqueue)
> + goto bad;
>
> + return bs;
> bad:
> bioset_free(bs);
> return NULL;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index b22c22b..ba5b52e 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -290,39 +290,6 @@ static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }
> static inline void bio_disassociate_task(struct bio *bio) { }
> #endif /* CONFIG_BLK_CGROUP */
>
> -/*
> - * bio_set is used to allow other portions of the IO system to
> - * allocate their own private memory pools for bio and iovec structures.
> - * These memory pools in turn all allocate from the bio_slab
> - * and the bvec_slabs[].
> - */
> -#define BIO_POOL_SIZE 2
> -#define BIOVEC_NR_POOLS 6
> -#define BIOVEC_MAX_IDX (BIOVEC_NR_POOLS - 1)
> -
> -struct bio_set {
> - struct kmem_cache *bio_slab;
> - unsigned int front_pad;
> -
> - mempool_t *bio_pool;
> -#if defined(CONFIG_BLK_DEV_INTEGRITY)
> - mempool_t *bio_integrity_pool;
> -#endif
> - mempool_t *bvec_pool;
> -};
> -
> -struct biovec_slab {
> - int nr_vecs;
> - char *name;
> - struct kmem_cache *slab;
> -};
> -
> -/*
> - * a small number of entries is fine, not going to be performance critical.
> - * basically we just need to survive
> - */
> -#define BIO_SPLIT_ENTRIES 2
> -
> #ifdef CONFIG_HIGHMEM
> /*
> * remember never ever reenable interrupts between a bvec_kmap_irq and
> @@ -497,6 +464,48 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
> return bio;
> }
>
> +/*
> + * bio_set is used to allow other portions of the IO system to
> + * allocate their own private memory pools for bio and iovec structures.
> + * These memory pools in turn all allocate from the bio_slab
> + * and the bvec_slabs[].
> + */
> +#define BIO_POOL_SIZE 2
> +#define BIOVEC_NR_POOLS 6
> +#define BIOVEC_MAX_IDX (BIOVEC_NR_POOLS - 1)
> +
> +struct bio_set {
> + struct kmem_cache *bio_slab;
> + unsigned int front_pad;
> +
> + mempool_t *bio_pool;
> +#if defined(CONFIG_BLK_DEV_INTEGRITY)
> + mempool_t *bio_integrity_pool;
> +#endif
> + mempool_t *bvec_pool;
> +
> + /*
> + * Deadlock avoidance for stacking block drivers: see comments in
> + * bio_alloc_bioset() for details
> + */
> + spinlock_t rescue_lock;
> + struct bio_list rescue_list;
> + struct work_struct rescue_work;
> + struct workqueue_struct *rescue_workqueue;
> +};
> +
> +struct biovec_slab {
> + int nr_vecs;
> + char *name;
> + struct kmem_cache *slab;
> +};
> +
> +/*
> + * a small number of entries is fine, not going to be performance critical.
> + * basically we just need to survive
> + */
> +#define BIO_SPLIT_ENTRIES 2
> +
> #if defined(CONFIG_BLK_DEV_INTEGRITY)
>
> #define bip_vec_idx(bip, idx) (&(bip->bip_vec[(idx)]))
>
next prev parent reply other threads:[~2012-08-29 16:08 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-07-24 20:11 [PATCH v4 00/12] Block cleanups Kent Overstreet
2012-07-24 20:11 ` [PATCH v4 01/12] block: Generalized bio pool freeing Kent Overstreet
[not found] ` <1343160689-12378-2-git-send-email-koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-07-25 11:06 ` Boaz Harrosh
[not found] ` <500FD32F.2010809-C4P08NqkoRlBDgjK7y7TUQ@public.gmane.org>
2012-07-25 23:38 ` Kent Overstreet
2012-07-24 20:11 ` [PATCH v4 02/12] dm: Use bioset's front_pad for dm_rq_clone_bio_info Kent Overstreet
2012-07-24 20:11 ` [PATCH v4 03/12] block: Add bio_reset() Kent Overstreet
[not found] ` <1343160689-12378-4-git-send-email-koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-07-25 11:19 ` Boaz Harrosh
[not found] ` <500FD63F.7050501-C4P08NqkoRlBDgjK7y7TUQ@public.gmane.org>
2012-07-25 22:56 ` Kent Overstreet
2012-07-24 20:11 ` [PATCH v4 04/12] pktcdvd: Switch to bio_kmalloc() Kent Overstreet
[not found] ` <1343160689-12378-5-git-send-email-koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-07-25 11:29 ` Boaz Harrosh
[not found] ` <500FD8B7.9040701-C4P08NqkoRlBDgjK7y7TUQ@public.gmane.org>
2012-07-25 23:01 ` Kent Overstreet
2012-07-24 20:11 ` [PATCH v4 05/12] block: Kill bi_destructor Kent Overstreet
[not found] ` <1343160689-12378-6-git-send-email-koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-07-25 11:39 ` Boaz Harrosh
[not found] ` <500FDB0D.4070605-C4P08NqkoRlBDgjK7y7TUQ@public.gmane.org>
2012-07-25 23:15 ` Kent Overstreet
2012-07-24 20:11 ` [PATCH v4 06/12] block: Add an explicit bio flag for bios that own their bvec Kent Overstreet
2012-07-24 20:11 ` [PATCH v4 08/12] block: Introduce new bio_split() Kent Overstreet
2012-07-25 11:55 ` Boaz Harrosh
2012-07-25 23:26 ` Kent Overstreet
2012-07-27 0:50 ` [PATCH] A possible deadlock with stacked devices (was: [PATCH v4 08/12] block: Introduce new bio_split()) Mikulas Patocka
2012-08-15 23:07 ` Kent Overstreet
[not found] ` <20120815230715.GD2758-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-08-29 16:08 ` Mikulas Patocka [this message]
[not found] ` <1343160689-12378-1-git-send-email-koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-07-24 20:11 ` [PATCH v4 07/12] block: Rename bio_split() -> bio_pair_split() Kent Overstreet
2012-07-24 20:11 ` [PATCH v4 09/12] block: Rework bio_pair_split() Kent Overstreet
[not found] ` <1343160689-12378-10-git-send-email-koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-07-25 12:03 ` Boaz Harrosh
2012-07-24 20:11 ` [PATCH v4 10/12] block: Add bio_clone_kmalloc() Kent Overstreet
[not found] ` <1343160689-12378-11-git-send-email-koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-07-25 12:05 ` Boaz Harrosh
2012-07-24 20:11 ` [PATCH v4 11/12] block: Add bio_clone_bioset() Kent Overstreet
2012-07-24 20:11 ` [PATCH v4 12/12] block: Only clone bio vecs that are in use Kent Overstreet
[not found] ` <1343160689-12378-13-git-send-email-koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-08-07 3:17 ` Muthu Kumar
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.64.1208291206110.774@file.rdu.redhat.com \
--to=mpatocka-h+wxahxf7alqt0dzr+alfa@public.gmane.org \
--cc=agk-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
--cc=axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org \
--cc=bharrosh-C4P08NqkoRlBDgjK7y7TUQ@public.gmane.org \
--cc=dm-devel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
--cc=drbd-dev-cunTk1MwBs8qoQakbn7OcQ@public.gmane.org \
--cc=koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
--cc=linux-bcache-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=neilb-l3A5Bk7waGM@public.gmane.org \
--cc=sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org \
--cc=tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
--cc=vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
--cc=yehuda-L5o5AL9CYN0tUFlbccrkMA@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).