* [PATCH v2] block, dm: don't copy bios for request clones
@ 2015-04-25 10:23 Christoph Hellwig
2015-04-25 18:13 ` Hannes Reinecke
2015-04-28 0:59 ` [PATCH v3] " Mike Snitzer
0 siblings, 2 replies; 12+ messages in thread
From: Christoph Hellwig @ 2015-04-25 10:23 UTC (permalink / raw)
To: Mike Snitzer, Jens Axboe; +Cc: dm-devel
Currently dm-multipath has to clone the bios for every request sent
to the lower devices, which wastes cpu cycles and ties down memory.
This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio
to not complete bios attached to a request, which we set on clone
requests similar to bios in a flush sequence. With this change I/O
errors on a path failure only get propagated to dm-multipath, which
can then either resubmit the I/O or complete the bios on the original
request.
I've done some basic testing of this on a Linux target with ALUA support,
and it survives path failures during I/O nicely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
block/blk-core.c | 94 +++----------------------
drivers/md/dm-table.c | 28 +++++---
drivers/md/dm.c | 173 +++++++++++-----------------------------------
drivers/md/dm.h | 6 +-
include/linux/blk_types.h | 2 +
include/linux/blkdev.h | 6 +-
6 files changed, 78 insertions(+), 231 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index fd154b9..4ef93a8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -117,7 +117,7 @@ EXPORT_SYMBOL(blk_rq_init);
static void req_bio_endio(struct request *rq, struct bio *bio,
unsigned int nbytes, int error)
{
- if (error)
+ if (error && !(rq->cmd_flags & REQ_CLONE))
clear_bit(BIO_UPTODATE, &bio->bi_flags);
else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
error = -EIO;
@@ -128,7 +128,8 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
bio_advance(bio, nbytes);
/* don't actually finish bio if it's part of flush sequence */
- if (bio->bi_iter.bi_size == 0 && !(rq->cmd_flags & REQ_FLUSH_SEQ))
+ if (bio->bi_iter.bi_size == 0 &&
+ !(rq->cmd_flags & (REQ_FLUSH_SEQ|REQ_CLONE)))
bio_endio(bio, error);
}
@@ -2901,95 +2902,22 @@ int blk_lld_busy(struct request_queue *q)
}
EXPORT_SYMBOL_GPL(blk_lld_busy);
-/**
- * blk_rq_unprep_clone - Helper function to free all bios in a cloned request
- * @rq: the clone request to be cleaned up
- *
- * Description:
- * Free all bios in @rq for a cloned request.
- */
-void blk_rq_unprep_clone(struct request *rq)
-{
- struct bio *bio;
-
- while ((bio = rq->bio) != NULL) {
- rq->bio = bio->bi_next;
-
- bio_put(bio);
- }
-}
-EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
-
-/*
- * Copy attributes of the original request to the clone request.
- * The actual data parts (e.g. ->cmd, ->sense) are not copied.
- */
-static void __blk_rq_prep_clone(struct request *dst, struct request *src)
+void blk_rq_prep_clone(struct request *dst, struct request *src)
{
dst->cpu = src->cpu;
- dst->cmd_flags |= (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
+ dst->cmd_flags |= (src->cmd_flags & REQ_CLONE_MASK);
+ dst->cmd_flags |= REQ_NOMERGE | REQ_CLONE;
dst->cmd_type = src->cmd_type;
dst->__sector = blk_rq_pos(src);
dst->__data_len = blk_rq_bytes(src);
dst->nr_phys_segments = src->nr_phys_segments;
dst->ioprio = src->ioprio;
dst->extra_len = src->extra_len;
-}
-
-/**
- * blk_rq_prep_clone - Helper function to setup clone request
- * @rq: the request to be setup
- * @rq_src: original request to be cloned
- * @bs: bio_set that bios for clone are allocated from
- * @gfp_mask: memory allocation mask for bio
- * @bio_ctr: setup function to be called for each clone bio.
- * Returns %0 for success, non %0 for failure.
- * @data: private data to be passed to @bio_ctr
- *
- * Description:
- * Clones bios in @rq_src to @rq, and copies attributes of @rq_src to @rq.
- * The actual data parts of @rq_src (e.g. ->cmd, ->sense)
- * are not copied, and copying such parts is the caller's responsibility.
- * Also, pages which the original bios are pointing to are not copied
- * and the cloned bios just point same pages.
- * So cloned bios must be completed before original bios, which means
- * the caller must complete @rq before @rq_src.
- */
-int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
- struct bio_set *bs, gfp_t gfp_mask,
- int (*bio_ctr)(struct bio *, struct bio *, void *),
- void *data)
-{
- struct bio *bio, *bio_src;
-
- if (!bs)
- bs = fs_bio_set;
-
- __rq_for_each_bio(bio_src, rq_src) {
- bio = bio_clone_fast(bio_src, gfp_mask, bs);
- if (!bio)
- goto free_and_out;
-
- if (bio_ctr && bio_ctr(bio, bio_src, data))
- goto free_and_out;
-
- if (rq->bio) {
- rq->biotail->bi_next = bio;
- rq->biotail = bio;
- } else
- rq->bio = rq->biotail = bio;
- }
-
- __blk_rq_prep_clone(rq, rq_src);
-
- return 0;
-
-free_and_out:
- if (bio)
- bio_put(bio);
- blk_rq_unprep_clone(rq);
-
- return -ENOMEM;
+ dst->bio = src->bio;
+ dst->biotail = src->biotail;
+ dst->cmd = src->cmd;
+ dst->cmd_len = src->cmd_len;
+ dst->sense = src->sense;
}
EXPORT_SYMBOL_GPL(blk_rq_prep_clone);
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index d9b00b8..1f859cd 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -938,23 +938,31 @@ bool dm_table_mq_request_based(struct dm_table *t)
static int dm_table_alloc_md_mempools(struct dm_table *t, struct mapped_device *md)
{
- unsigned type = dm_table_get_type(t);
+ int type = dm_table_get_type(t);
unsigned per_bio_data_size = 0;
- struct dm_target *tgt;
unsigned i;
- if (unlikely(type == DM_TYPE_NONE)) {
+ switch (type) {
+ case DM_TYPE_BIO_BASED:
+ for (i = 0; i < t->num_targets; i++) {
+ struct dm_target *tgt = t->targets + i;
+
+ per_bio_data_size = max(per_bio_data_size,
+ tgt->per_bio_data_size);
+ }
+
+ t->mempools = dm_alloc_bio_mempools(t->integrity_supported,
+ per_bio_data_size);
+ break;
+ case DM_TYPE_REQUEST_BASED:
+ case DM_TYPE_MQ_REQUEST_BASED:
+ t->mempools = dm_alloc_rq_mempools(md, type);
+ break;
+ default:
DMWARN("no table type is set, can't allocate mempools");
return -EINVAL;
}
- if (type == DM_TYPE_BIO_BASED)
- for (i = 0; i < t->num_targets; i++) {
- tgt = t->targets + i;
- per_bio_data_size = max(per_bio_data_size, tgt->per_bio_data_size);
- }
-
- t->mempools = dm_alloc_md_mempools(md, type, t->integrity_supported, per_bio_data_size);
if (!t->mempools)
return -ENOMEM;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index f8c7ca3..6ea18dc 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -990,57 +990,6 @@ static void clone_endio(struct bio *bio, int error)
dec_pending(io, error);
}
-/*
- * Partial completion handling for request-based dm
- */
-static void end_clone_bio(struct bio *clone, int error)
-{
- struct dm_rq_clone_bio_info *info =
- container_of(clone, struct dm_rq_clone_bio_info, clone);
- struct dm_rq_target_io *tio = info->tio;
- struct bio *bio = info->orig;
- unsigned int nr_bytes = info->orig->bi_iter.bi_size;
-
- bio_put(clone);
-
- if (tio->error)
- /*
- * An error has already been detected on the request.
- * Once error occurred, just let clone->end_io() handle
- * the remainder.
- */
- return;
- else if (error) {
- /*
- * Don't notice the error to the upper layer yet.
- * The error handling decision is made by the target driver,
- * when the request is completed.
- */
- tio->error = error;
- return;
- }
-
- /*
- * I/O for the bio successfully completed.
- * Notice the data completion to the upper layer.
- */
-
- /*
- * bios are processed from the head of the list.
- * So the completing bio should always be rq->bio.
- * If it's not, something wrong is happening.
- */
- if (tio->orig->bio != bio)
- DMERR("bio completion is going in the middle of the request");
-
- /*
- * Update the original request.
- * Do not use blk_end_request() here, because it may complete
- * the original request before the clone, and break the ordering.
- */
- blk_update_request(tio->orig, 0, nr_bytes);
-}
-
static struct dm_rq_target_io *tio_from_request(struct request *rq)
{
return (rq->q->mq_ops ? blk_mq_rq_to_pdu(rq) : rq->special);
@@ -1087,8 +1036,6 @@ static void free_rq_clone(struct request *clone)
struct dm_rq_target_io *tio = clone->end_io_data;
struct mapped_device *md = tio->md;
- blk_rq_unprep_clone(clone);
-
if (clone->q->mq_ops)
tio->ti->type->release_clone_rq(clone);
else if (!md->queue->mq_ops)
@@ -1813,39 +1760,13 @@ static void dm_dispatch_clone_request(struct request *clone, struct request *rq)
dm_complete_request(rq, r);
}
-static int dm_rq_bio_constructor(struct bio *bio, struct bio *bio_orig,
- void *data)
-{
- struct dm_rq_target_io *tio = data;
- struct dm_rq_clone_bio_info *info =
- container_of(bio, struct dm_rq_clone_bio_info, clone);
-
- info->orig = bio_orig;
- info->tio = tio;
- bio->bi_end_io = end_clone_bio;
-
- return 0;
-}
-
-static int setup_clone(struct request *clone, struct request *rq,
- struct dm_rq_target_io *tio, gfp_t gfp_mask)
+static void setup_clone(struct request *clone, struct request *rq,
+ struct dm_rq_target_io *tio)
{
- int r;
-
- r = blk_rq_prep_clone(clone, rq, tio->md->bs, gfp_mask,
- dm_rq_bio_constructor, tio);
- if (r)
- return r;
-
- clone->cmd = rq->cmd;
- clone->cmd_len = rq->cmd_len;
- clone->sense = rq->sense;
+ blk_rq_prep_clone(clone, rq);
clone->end_io = end_clone_request;
clone->end_io_data = tio;
-
tio->clone = clone;
-
- return 0;
}
static struct request *clone_rq(struct request *rq, struct mapped_device *md,
@@ -1866,13 +1787,7 @@ static struct request *clone_rq(struct request *rq, struct mapped_device *md,
clone = tio->clone;
blk_rq_init(NULL, clone);
- if (setup_clone(clone, rq, tio, gfp_mask)) {
- /* -ENOMEM */
- if (alloc_clone)
- free_clone_request(md, clone);
- return NULL;
- }
-
+ setup_clone(clone, rq, tio);
return clone;
}
@@ -1965,11 +1880,7 @@ static int map_request(struct dm_rq_target_io *tio, struct request *rq,
}
if (IS_ERR(clone))
return DM_MAPIO_REQUEUE;
- if (setup_clone(clone, rq, tio, GFP_ATOMIC)) {
- /* -ENOMEM */
- ti->type->release_clone_rq(clone);
- return DM_MAPIO_REQUEUE;
- }
+ setup_clone(clone, rq, tio);
}
switch (r) {
@@ -2423,8 +2334,6 @@ static void __bind_mempools(struct mapped_device *md, struct dm_table *t)
goto out;
}
- BUG_ON(!p || md->io_pool || md->rq_pool || md->bs);
-
md->io_pool = p->io_pool;
p->io_pool = NULL;
md->rq_pool = p->rq_pool;
@@ -3531,48 +3440,24 @@ int dm_noflush_suspending(struct dm_target *ti)
}
EXPORT_SYMBOL_GPL(dm_noflush_suspending);
-struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, unsigned type,
- unsigned integrity, unsigned per_bio_data_size)
+struct dm_md_mempools *dm_alloc_bio_mempools(unsigned integrity,
+ unsigned per_bio_data_size)
{
- struct dm_md_mempools *pools = kzalloc(sizeof(*pools), GFP_KERNEL);
- struct kmem_cache *cachep = NULL;
- unsigned int pool_size = 0;
+ struct dm_md_mempools *pools;
+ unsigned int pool_size = dm_get_reserved_bio_based_ios();
unsigned int front_pad;
+ pools = kzalloc(sizeof(*pools), GFP_KERNEL);
if (!pools)
return NULL;
- type = filter_md_type(type, md);
+ front_pad = roundup(per_bio_data_size,
+ __alignof__(struct dm_target_io)) +
+ offsetof(struct dm_target_io, clone);
- switch (type) {
- case DM_TYPE_BIO_BASED:
- cachep = _io_cache;
- pool_size = dm_get_reserved_bio_based_ios();
- front_pad = roundup(per_bio_data_size, __alignof__(struct dm_target_io)) + offsetof(struct dm_target_io, clone);
- break;
- case DM_TYPE_REQUEST_BASED:
- cachep = _rq_tio_cache;
- pool_size = dm_get_reserved_rq_based_ios();
- pools->rq_pool = mempool_create_slab_pool(pool_size, _rq_cache);
- if (!pools->rq_pool)
- goto out;
- /* fall through to setup remaining rq-based pools */
- case DM_TYPE_MQ_REQUEST_BASED:
- if (!pool_size)
- pool_size = dm_get_reserved_rq_based_ios();
- front_pad = offsetof(struct dm_rq_clone_bio_info, clone);
- /* per_bio_data_size is not used. See __bind_mempools(). */
- WARN_ON(per_bio_data_size != 0);
- break;
- default:
- BUG();
- }
-
- if (cachep) {
- pools->io_pool = mempool_create_slab_pool(pool_size, cachep);
- if (!pools->io_pool)
- goto out;
- }
+ pools->io_pool = mempool_create_slab_pool(pool_size, _io_cache);
+ if (!pools->io_pool)
+ goto out;
pools->bs = bioset_create_nobvec(pool_size, front_pad);
if (!pools->bs)
@@ -3585,7 +3470,33 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, unsigned t
out:
dm_free_md_mempools(pools);
+ return NULL;
+}
+
+struct dm_md_mempools *dm_alloc_rq_mempools(struct mapped_device *md,
+ unsigned type)
+{
+ unsigned int pool_size = dm_get_reserved_rq_based_ios();
+ struct dm_md_mempools *pools;
+ pools = kzalloc(sizeof(*pools), GFP_KERNEL);
+ if (!pools)
+ return NULL;
+
+ if (filter_md_type(type, md) == DM_TYPE_REQUEST_BASED) {
+ pools->rq_pool = mempool_create_slab_pool(pool_size, _rq_cache);
+ if (!pools->rq_pool)
+ goto out;
+ }
+
+ pools->io_pool = mempool_create_slab_pool(pool_size, _rq_tio_cache);
+ if (!pools->io_pool)
+ goto out;
+
+ return pools;
+
+out:
+ dm_free_md_mempools(pools);
return NULL;
}
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 6123c2b..0222d56 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -222,8 +222,10 @@ void dm_kcopyd_exit(void);
/*
* Mempool operations
*/
-struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, unsigned type,
- unsigned integrity, unsigned per_bio_data_size);
+struct dm_md_mempools *dm_alloc_bio_mempools(unsigned integrity,
+ unsigned per_bio_data_size);
+struct dm_md_mempools *dm_alloc_rq_mempools(struct mapped_device *md,
+ unsigned type);
void dm_free_md_mempools(struct dm_md_mempools *pools);
/*
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index a1b25e3..3b0f8f4 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -193,6 +193,7 @@ enum rq_flag_bits {
__REQ_HASHED, /* on IO scheduler merge hash */
__REQ_MQ_INFLIGHT, /* track inflight for MQ */
__REQ_NO_TIMEOUT, /* requests may never expire */
+ __REQ_CLONE, /* cloned bios */
__REQ_NR_BITS, /* stops here */
};
@@ -247,5 +248,6 @@ enum rq_flag_bits {
#define REQ_HASHED (1ULL << __REQ_HASHED)
#define REQ_MQ_INFLIGHT (1ULL << __REQ_MQ_INFLIGHT)
#define REQ_NO_TIMEOUT (1ULL << __REQ_NO_TIMEOUT)
+#define REQ_CLONE (1ULL << __REQ_CLONE)
#endif /* __LINUX_BLK_TYPES_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7f9a516..9cb8174 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -804,11 +804,7 @@ extern void blk_add_request_payload(struct request *rq, struct page *page,
unsigned int len);
extern int blk_rq_check_limits(struct request_queue *q, struct request *rq);
extern int blk_lld_busy(struct request_queue *q);
-extern int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
- struct bio_set *bs, gfp_t gfp_mask,
- int (*bio_ctr)(struct bio *, struct bio *, void *),
- void *data);
-extern void blk_rq_unprep_clone(struct request *rq);
+extern void blk_rq_prep_clone(struct request *rq, struct request *rq_src);
extern int blk_insert_cloned_request(struct request_queue *q,
struct request *rq);
extern void blk_delay_queue(struct request_queue *, unsigned long);
--
1.9.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v2] block, dm: don't copy bios for request clones
2015-04-25 10:23 [PATCH v2] block, dm: don't copy bios for request clones Christoph Hellwig
@ 2015-04-25 18:13 ` Hannes Reinecke
2015-04-26 13:35 ` Mike Snitzer
2015-04-26 14:20 ` Christoph Hellwig
2015-04-28 0:59 ` [PATCH v3] " Mike Snitzer
1 sibling, 2 replies; 12+ messages in thread
From: Hannes Reinecke @ 2015-04-25 18:13 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Jens Axboe, dm-devel, Mike Snitzer
On 04/25/2015 12:23 PM, Christoph Hellwig wrote:
> Currently dm-multipath has to clone the bios for every request sent
> to the lower devices, which wastes cpu cycles and ties down memory.
>
> This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio
> to not complete bios attached to a request, which we set on clone
> requests similar to bios in a flush sequence. With this change I/O
> errors on a path failure only get propagated to dm-multipath, which
> can then either resubmit the I/O or complete the bios on the original
> request.
>
Hehe.
I seem to remember having sent a similar patch about a year ago;
which then got shot down due to the missing partial completion
handling.
> I've done some basic testing of this on a Linux target with ALUA support,
> and it survives path failures during I/O nicely.
>
So did I ...
Anyway; we've discussed this at LSF in Boston, haven't we?
AFAICR we've found that having to resubmit the entire command
in the case of partial completion is okay with the storage
vendors, so this patch is a viable way of handling things.
_But_ I really would like to have a consensus here that this
_is_ the correct way of handling partial request; because
if that is the case then we should adopt this strategy
throughout the SCSI layer (ie in scsi_io_completion())
and document the fact.
I really don't like to have two different completion paths;
we should decide on one way and then use it throughout
the stack.
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
hare@suse.de +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] block, dm: don't copy bios for request clones
2015-04-25 18:13 ` Hannes Reinecke
@ 2015-04-26 13:35 ` Mike Snitzer
2015-04-26 14:20 ` Christoph Hellwig
1 sibling, 0 replies; 12+ messages in thread
From: Mike Snitzer @ 2015-04-26 13:35 UTC (permalink / raw)
To: Hannes Reinecke; +Cc: Christoph Hellwig, Jens Axboe, dm-devel
On Sat, Apr 25 2015 at 2:13pm -0400,
Hannes Reinecke <hare@suse.de> wrote:
> On 04/25/2015 12:23 PM, Christoph Hellwig wrote:
> > Currently dm-multipath has to clone the bios for every request sent
> > to the lower devices, which wastes cpu cycles and ties down memory.
> >
> > This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio
> > to not complete bios attached to a request, which we set on clone
> > requests similar to bios in a flush sequence. With this change I/O
> > errors on a path failure only get propagated to dm-multipath, which
> > can then either resubmit the I/O or complete the bios on the original
> > request.
> >
> Hehe.
>
> I seem to remember having sent a similar patch about a year ago;
> which then got shot down due to the missing partial completion
> handling.
But your approch was entirely different and _not_ acceptable considering
it completely eliminated request cloning. In the context of blk-mq we
need the request to be allocated directly from the blk-mq device --
"cloning" allows enough indirection to make that workable (as has
already landed upstream).
And based on discussion that hch, Jens and I had at LSF eliminating the
cloning of a request's bios was very much a near term goal. I even
forecast as much in this commit 022333427 ("dm: optimize dm_mq_queue_rq
to _not_ use kthread if using pure blk-mq"):
"In the future the bioset allocations will hopefully go away (by
removing support for partial completions of bios in a cloned request)."
> > I've done some basic testing of this on a Linux target with ALUA support,
> > and it survives path failures during I/O nicely.
> >
> So did I ...
>
> Anyway; we've discussed this at LSF in Boston, haven't we?
> AFAICR we've found that having to resubmit the entire command
> in the case of partial completion is okay with the storage
> vendors, so this patch is a viable way of handling things.
>
> _But_ I really would like to have a consensus here that this
> _is_ the correct way of handling partial request; because
> if that is the case then we should adopt this strategy
> throughout the SCSI layer (ie in scsi_io_completion())
> and document the fact.
>
> I really don't like to have two different completion paths;
> we should decide on one way and then use it throughout
> the stack.
Can you elaborate on why DM should be constrained by what SCSI does?
For DM partial completion was more about trapping failures more quickly
(not concerns about command resubmission, etc).
AFAICT there is no reason to impose that both eliminate partial
completion at the same time. What am I missing?
But if you have consensus from storage vendors that eliminating partial
completion is OK then why not just make it happen? If you do so for 4.2
then it'll look like we coordinated between DM and SCSI ;)
Mike
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] block, dm: don't copy bios for request clones
2015-04-25 18:13 ` Hannes Reinecke
2015-04-26 13:35 ` Mike Snitzer
@ 2015-04-26 14:20 ` Christoph Hellwig
1 sibling, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2015-04-26 14:20 UTC (permalink / raw)
To: Hannes Reinecke; +Cc: Christoph Hellwig, Jens Axboe, dm-devel, Mike Snitzer
On Sat, Apr 25, 2015 at 08:13:28PM +0200, Hannes Reinecke wrote:
> I seem to remember having sent a similar patch about a year ago;
This one?
http://www.redhat.com/archives/dm-devel/2014-June/msg00023.html
> which then got shot down due to the missing partial completion
> handling.
It got shot down because it passed the upper level request down,
which is a non-no for blk-mq.
> Anyway; we've discussed this at LSF in Boston, haven't we?
> AFAICR we've found that having to resubmit the entire command
> in the case of partial completion is okay with the storage
> vendors, so this patch is a viable way of handling things.
Yes. But even if it wasn't for some reason we could inspect blk_rq_bytes()
on the clone and do a partial completion on the original request, but
I'd prefer to avoid special casing this case if we can avoid it.
> _But_ I really would like to have a consensus here that this
> _is_ the correct way of handling partial request; because
> if that is the case then we should adopt this strategy
> throughout the SCSI layer (ie in scsi_io_completion())
> and document the fact.
SCSI is a very different case, we do regularly get patial completions
from arrays, and we need to handle them efficiently in the normal
I/O path. The only case where we will resubmit completed I/O in
dm-mpath is when we got a successfull partial completion in SCSI,
and then a following completions returns an error that causes a
failover. I don't think there is a need to micro-optimize this case.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v3] block, dm: don't copy bios for request clones
2015-04-25 10:23 [PATCH v2] block, dm: don't copy bios for request clones Christoph Hellwig
2015-04-25 18:13 ` Hannes Reinecke
@ 2015-04-28 0:59 ` Mike Snitzer
2015-04-28 1:03 ` [PATCH 2/1] dm: do not allocate any mempools for blk-mq request-based DM Mike Snitzer
` (3 more replies)
1 sibling, 4 replies; 12+ messages in thread
From: Mike Snitzer @ 2015-04-28 0:59 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Jens Axboe, dm-devel
From: Christoph Hellwig <hch@lst.de>
Currently dm-multipath has to clone the bios for every request sent
to the lower devices, which wastes cpu cycles and ties down memory.
This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio
to not complete bios attached to a request, which we set on clone
requests similar to bios in a flush sequence. With this change I/O
errors on a path failure only get propagated to dm-multipath, which
can then either resubmit the I/O or complete the bios on the original
request.
I've done some basic testing of this on a Linux target with ALUA support,
and it survives path failures during I/O nicely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
block/blk-core.c | 94 +++----------------------
drivers/md/dm-table.c | 25 ++++---
drivers/md/dm.c | 171 +++++++++++-----------------------------------
drivers/md/dm.h | 5 +-
include/linux/blk_types.h | 2 +
include/linux/blkdev.h | 6 +-
6 files changed, 73 insertions(+), 230 deletions(-)
v3: really just some line-wrapping, whitespace, and misc small nits
diff --git a/block/blk-core.c b/block/blk-core.c
index fd154b9..4ef93a8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -117,7 +117,7 @@ EXPORT_SYMBOL(blk_rq_init);
static void req_bio_endio(struct request *rq, struct bio *bio,
unsigned int nbytes, int error)
{
- if (error)
+ if (error && !(rq->cmd_flags & REQ_CLONE))
clear_bit(BIO_UPTODATE, &bio->bi_flags);
else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
error = -EIO;
@@ -128,7 +128,8 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
bio_advance(bio, nbytes);
/* don't actually finish bio if it's part of flush sequence */
- if (bio->bi_iter.bi_size == 0 && !(rq->cmd_flags & REQ_FLUSH_SEQ))
+ if (bio->bi_iter.bi_size == 0 &&
+ !(rq->cmd_flags & (REQ_FLUSH_SEQ|REQ_CLONE)))
bio_endio(bio, error);
}
@@ -2901,95 +2902,22 @@ int blk_lld_busy(struct request_queue *q)
}
EXPORT_SYMBOL_GPL(blk_lld_busy);
-/**
- * blk_rq_unprep_clone - Helper function to free all bios in a cloned request
- * @rq: the clone request to be cleaned up
- *
- * Description:
- * Free all bios in @rq for a cloned request.
- */
-void blk_rq_unprep_clone(struct request *rq)
-{
- struct bio *bio;
-
- while ((bio = rq->bio) != NULL) {
- rq->bio = bio->bi_next;
-
- bio_put(bio);
- }
-}
-EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
-
-/*
- * Copy attributes of the original request to the clone request.
- * The actual data parts (e.g. ->cmd, ->sense) are not copied.
- */
-static void __blk_rq_prep_clone(struct request *dst, struct request *src)
+void blk_rq_prep_clone(struct request *dst, struct request *src)
{
dst->cpu = src->cpu;
- dst->cmd_flags |= (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
+ dst->cmd_flags |= (src->cmd_flags & REQ_CLONE_MASK);
+ dst->cmd_flags |= REQ_NOMERGE | REQ_CLONE;
dst->cmd_type = src->cmd_type;
dst->__sector = blk_rq_pos(src);
dst->__data_len = blk_rq_bytes(src);
dst->nr_phys_segments = src->nr_phys_segments;
dst->ioprio = src->ioprio;
dst->extra_len = src->extra_len;
-}
-
-/**
- * blk_rq_prep_clone - Helper function to setup clone request
- * @rq: the request to be setup
- * @rq_src: original request to be cloned
- * @bs: bio_set that bios for clone are allocated from
- * @gfp_mask: memory allocation mask for bio
- * @bio_ctr: setup function to be called for each clone bio.
- * Returns %0 for success, non %0 for failure.
- * @data: private data to be passed to @bio_ctr
- *
- * Description:
- * Clones bios in @rq_src to @rq, and copies attributes of @rq_src to @rq.
- * The actual data parts of @rq_src (e.g. ->cmd, ->sense)
- * are not copied, and copying such parts is the caller's responsibility.
- * Also, pages which the original bios are pointing to are not copied
- * and the cloned bios just point same pages.
- * So cloned bios must be completed before original bios, which means
- * the caller must complete @rq before @rq_src.
- */
-int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
- struct bio_set *bs, gfp_t gfp_mask,
- int (*bio_ctr)(struct bio *, struct bio *, void *),
- void *data)
-{
- struct bio *bio, *bio_src;
-
- if (!bs)
- bs = fs_bio_set;
-
- __rq_for_each_bio(bio_src, rq_src) {
- bio = bio_clone_fast(bio_src, gfp_mask, bs);
- if (!bio)
- goto free_and_out;
-
- if (bio_ctr && bio_ctr(bio, bio_src, data))
- goto free_and_out;
-
- if (rq->bio) {
- rq->biotail->bi_next = bio;
- rq->biotail = bio;
- } else
- rq->bio = rq->biotail = bio;
- }
-
- __blk_rq_prep_clone(rq, rq_src);
-
- return 0;
-
-free_and_out:
- if (bio)
- bio_put(bio);
- blk_rq_unprep_clone(rq);
-
- return -ENOMEM;
+ dst->bio = src->bio;
+ dst->biotail = src->biotail;
+ dst->cmd = src->cmd;
+ dst->cmd_len = src->cmd_len;
+ dst->sense = src->sense;
}
EXPORT_SYMBOL_GPL(blk_rq_prep_clone);
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index d9b00b8..3662b2e 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -940,21 +940,28 @@ static int dm_table_alloc_md_mempools(struct dm_table *t, struct mapped_device *
{
unsigned type = dm_table_get_type(t);
unsigned per_bio_data_size = 0;
- struct dm_target *tgt;
unsigned i;
- if (unlikely(type == DM_TYPE_NONE)) {
+ switch (type) {
+ case DM_TYPE_BIO_BASED:
+ for (i = 0; i < t->num_targets; i++) {
+ struct dm_target *tgt = t->targets + i;
+
+ per_bio_data_size = max(per_bio_data_size,
+ tgt->per_bio_data_size);
+ }
+ t->mempools = dm_alloc_bio_mempools(t->integrity_supported,
+ per_bio_data_size);
+ break;
+ case DM_TYPE_REQUEST_BASED:
+ case DM_TYPE_MQ_REQUEST_BASED:
+ t->mempools = dm_alloc_rq_mempools(md, type);
+ break;
+ default:
DMWARN("no table type is set, can't allocate mempools");
return -EINVAL;
}
- if (type == DM_TYPE_BIO_BASED)
- for (i = 0; i < t->num_targets; i++) {
- tgt = t->targets + i;
- per_bio_data_size = max(per_bio_data_size, tgt->per_bio_data_size);
- }
-
- t->mempools = dm_alloc_md_mempools(md, type, t->integrity_supported, per_bio_data_size);
if (!t->mempools)
return -ENOMEM;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index f8c7ca3..77ed371 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -990,57 +990,6 @@ static void clone_endio(struct bio *bio, int error)
dec_pending(io, error);
}
-/*
- * Partial completion handling for request-based dm
- */
-static void end_clone_bio(struct bio *clone, int error)
-{
- struct dm_rq_clone_bio_info *info =
- container_of(clone, struct dm_rq_clone_bio_info, clone);
- struct dm_rq_target_io *tio = info->tio;
- struct bio *bio = info->orig;
- unsigned int nr_bytes = info->orig->bi_iter.bi_size;
-
- bio_put(clone);
-
- if (tio->error)
- /*
- * An error has already been detected on the request.
- * Once error occurred, just let clone->end_io() handle
- * the remainder.
- */
- return;
- else if (error) {
- /*
- * Don't notice the error to the upper layer yet.
- * The error handling decision is made by the target driver,
- * when the request is completed.
- */
- tio->error = error;
- return;
- }
-
- /*
- * I/O for the bio successfully completed.
- * Notice the data completion to the upper layer.
- */
-
- /*
- * bios are processed from the head of the list.
- * So the completing bio should always be rq->bio.
- * If it's not, something wrong is happening.
- */
- if (tio->orig->bio != bio)
- DMERR("bio completion is going in the middle of the request");
-
- /*
- * Update the original request.
- * Do not use blk_end_request() here, because it may complete
- * the original request before the clone, and break the ordering.
- */
- blk_update_request(tio->orig, 0, nr_bytes);
-}
-
static struct dm_rq_target_io *tio_from_request(struct request *rq)
{
return (rq->q->mq_ops ? blk_mq_rq_to_pdu(rq) : rq->special);
@@ -1087,8 +1036,6 @@ static void free_rq_clone(struct request *clone)
struct dm_rq_target_io *tio = clone->end_io_data;
struct mapped_device *md = tio->md;
- blk_rq_unprep_clone(clone);
-
if (clone->q->mq_ops)
tio->ti->type->release_clone_rq(clone);
else if (!md->queue->mq_ops)
@@ -1813,39 +1760,13 @@ static void dm_dispatch_clone_request(struct request *clone, struct request *rq)
dm_complete_request(rq, r);
}
-static int dm_rq_bio_constructor(struct bio *bio, struct bio *bio_orig,
- void *data)
+static void setup_clone(struct request *clone, struct request *rq,
+ struct dm_rq_target_io *tio)
{
- struct dm_rq_target_io *tio = data;
- struct dm_rq_clone_bio_info *info =
- container_of(bio, struct dm_rq_clone_bio_info, clone);
-
- info->orig = bio_orig;
- info->tio = tio;
- bio->bi_end_io = end_clone_bio;
-
- return 0;
-}
-
-static int setup_clone(struct request *clone, struct request *rq,
- struct dm_rq_target_io *tio, gfp_t gfp_mask)
-{
- int r;
-
- r = blk_rq_prep_clone(clone, rq, tio->md->bs, gfp_mask,
- dm_rq_bio_constructor, tio);
- if (r)
- return r;
-
- clone->cmd = rq->cmd;
- clone->cmd_len = rq->cmd_len;
- clone->sense = rq->sense;
+ blk_rq_prep_clone(clone, rq);
clone->end_io = end_clone_request;
clone->end_io_data = tio;
-
tio->clone = clone;
-
- return 0;
}
static struct request *clone_rq(struct request *rq, struct mapped_device *md,
@@ -1866,12 +1787,7 @@ static struct request *clone_rq(struct request *rq, struct mapped_device *md,
clone = tio->clone;
blk_rq_init(NULL, clone);
- if (setup_clone(clone, rq, tio, gfp_mask)) {
- /* -ENOMEM */
- if (alloc_clone)
- free_clone_request(md, clone);
- return NULL;
- }
+ setup_clone(clone, rq, tio);
return clone;
}
@@ -1965,11 +1881,7 @@ static int map_request(struct dm_rq_target_io *tio, struct request *rq,
}
if (IS_ERR(clone))
return DM_MAPIO_REQUEUE;
- if (setup_clone(clone, rq, tio, GFP_ATOMIC)) {
- /* -ENOMEM */
- ti->type->release_clone_rq(clone);
- return DM_MAPIO_REQUEUE;
- }
+ setup_clone(clone, rq, tio);
}
switch (r) {
@@ -2423,8 +2335,6 @@ static void __bind_mempools(struct mapped_device *md, struct dm_table *t)
goto out;
}
- BUG_ON(!p || md->io_pool || md->rq_pool || md->bs);
-
md->io_pool = p->io_pool;
p->io_pool = NULL;
md->rq_pool = p->rq_pool;
@@ -3531,48 +3441,23 @@ int dm_noflush_suspending(struct dm_target *ti)
}
EXPORT_SYMBOL_GPL(dm_noflush_suspending);
-struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, unsigned type,
- unsigned integrity, unsigned per_bio_data_size)
+struct dm_md_mempools *dm_alloc_bio_mempools(unsigned integrity,
+ unsigned per_bio_data_size)
{
- struct dm_md_mempools *pools = kzalloc(sizeof(*pools), GFP_KERNEL);
- struct kmem_cache *cachep = NULL;
- unsigned int pool_size = 0;
+ struct dm_md_mempools *pools;
+ unsigned int pool_size = dm_get_reserved_bio_based_ios();
unsigned int front_pad;
+ pools = kzalloc(sizeof(*pools), GFP_KERNEL);
if (!pools)
return NULL;
- type = filter_md_type(type, md);
+ front_pad = roundup(per_bio_data_size, __alignof__(struct dm_target_io)) +
+ offsetof(struct dm_target_io, clone);
- switch (type) {
- case DM_TYPE_BIO_BASED:
- cachep = _io_cache;
- pool_size = dm_get_reserved_bio_based_ios();
- front_pad = roundup(per_bio_data_size, __alignof__(struct dm_target_io)) + offsetof(struct dm_target_io, clone);
- break;
- case DM_TYPE_REQUEST_BASED:
- cachep = _rq_tio_cache;
- pool_size = dm_get_reserved_rq_based_ios();
- pools->rq_pool = mempool_create_slab_pool(pool_size, _rq_cache);
- if (!pools->rq_pool)
- goto out;
- /* fall through to setup remaining rq-based pools */
- case DM_TYPE_MQ_REQUEST_BASED:
- if (!pool_size)
- pool_size = dm_get_reserved_rq_based_ios();
- front_pad = offsetof(struct dm_rq_clone_bio_info, clone);
- /* per_bio_data_size is not used. See __bind_mempools(). */
- WARN_ON(per_bio_data_size != 0);
- break;
- default:
- BUG();
- }
-
- if (cachep) {
- pools->io_pool = mempool_create_slab_pool(pool_size, cachep);
- if (!pools->io_pool)
- goto out;
- }
+ pools->io_pool = mempool_create_slab_pool(pool_size, _io_cache);
+ if (!pools->io_pool)
+ goto out;
pools->bs = bioset_create_nobvec(pool_size, front_pad);
if (!pools->bs)
@@ -3582,10 +3467,34 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, unsigned t
goto out;
return pools;
-
out:
dm_free_md_mempools(pools);
+ return NULL;
+}
+
+struct dm_md_mempools *dm_alloc_rq_mempools(struct mapped_device *md,
+ unsigned type)
+{
+ unsigned int pool_size = dm_get_reserved_rq_based_ios();
+ struct dm_md_mempools *pools;
+ pools = kzalloc(sizeof(*pools), GFP_KERNEL);
+ if (!pools)
+ return NULL;
+
+ if (filter_md_type(type, md) == DM_TYPE_REQUEST_BASED) {
+ pools->rq_pool = mempool_create_slab_pool(pool_size, _rq_cache);
+ if (!pools->rq_pool)
+ goto out;
+ }
+
+ pools->io_pool = mempool_create_slab_pool(pool_size, _rq_tio_cache);
+ if (!pools->io_pool)
+ goto out;
+
+ return pools;
+out:
+ dm_free_md_mempools(pools);
return NULL;
}
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 6123c2b..e6e66d0 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -222,8 +222,9 @@ void dm_kcopyd_exit(void);
/*
* Mempool operations
*/
-struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, unsigned type,
- unsigned integrity, unsigned per_bio_data_size);
+struct dm_md_mempools *dm_alloc_bio_mempools(unsigned integrity,
+ unsigned per_bio_data_size);
+struct dm_md_mempools *dm_alloc_rq_mempools(struct mapped_device *md, unsigned type);
void dm_free_md_mempools(struct dm_md_mempools *pools);
/*
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index a1b25e3..3b0f8f4 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -193,6 +193,7 @@ enum rq_flag_bits {
__REQ_HASHED, /* on IO scheduler merge hash */
__REQ_MQ_INFLIGHT, /* track inflight for MQ */
__REQ_NO_TIMEOUT, /* requests may never expire */
+ __REQ_CLONE, /* cloned bios */
__REQ_NR_BITS, /* stops here */
};
@@ -247,5 +248,6 @@ enum rq_flag_bits {
#define REQ_HASHED (1ULL << __REQ_HASHED)
#define REQ_MQ_INFLIGHT (1ULL << __REQ_MQ_INFLIGHT)
#define REQ_NO_TIMEOUT (1ULL << __REQ_NO_TIMEOUT)
+#define REQ_CLONE (1ULL << __REQ_CLONE)
#endif /* __LINUX_BLK_TYPES_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 727a8e2..02e20f3 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -804,11 +804,7 @@ extern void blk_add_request_payload(struct request *rq, struct page *page,
unsigned int len);
extern int blk_rq_check_limits(struct request_queue *q, struct request *rq);
extern int blk_lld_busy(struct request_queue *q);
-extern int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
- struct bio_set *bs, gfp_t gfp_mask,
- int (*bio_ctr)(struct bio *, struct bio *, void *),
- void *data);
-extern void blk_rq_unprep_clone(struct request *rq);
+extern void blk_rq_prep_clone(struct request *rq, struct request *rq_src);
extern int blk_insert_cloned_request(struct request_queue *q,
struct request *rq);
extern void blk_delay_queue(struct request_queue *, unsigned long);
--
2.3.2 (Apple Git-55)
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 2/1] dm: do not allocate any mempools for blk-mq request-based DM
2015-04-28 0:59 ` [PATCH v3] " Mike Snitzer
@ 2015-04-28 1:03 ` Mike Snitzer
2015-04-28 6:28 ` Christoph Hellwig
2015-04-28 5:49 ` [PATCH v3] block, dm: don't copy bios for request clones Hannes Reinecke
` (2 subsequent siblings)
3 siblings, 1 reply; 12+ messages in thread
From: Mike Snitzer @ 2015-04-28 1:03 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Jens Axboe, dm-devel
Do not allocate the io_pool mempool for blk-mq request-based DM
(DM_TYPE_MQ_REQUEST_BASED) in dm_alloc_rq_mempools().
Also refine __bind_mempools() to have more precise awareness of which
mempools each type of DM device uses -- avoids mempool churn when
reloading DM tables (particularly for DM_TYPE_REQUEST_BASED).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-table.c | 4 +--
drivers/md/dm.c | 69 ++++++++++++++++++++++++++++-----------------------
2 files changed, 40 insertions(+), 33 deletions(-)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 3662b2e..2000fea 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -962,8 +962,8 @@ static int dm_table_alloc_md_mempools(struct dm_table *t, struct mapped_device *
return -EINVAL;
}
- if (!t->mempools)
- return -ENOMEM;
+ if (IS_ERR(t->mempools))
+ return PTR_ERR(t->mempools);
return 0;
}
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 77ed371..1c0bc67 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2309,39 +2309,52 @@ static void free_dev(struct mapped_device *md)
kfree(md);
}
+static unsigned filter_md_type(unsigned type, struct mapped_device *md)
+{
+ if (type == DM_TYPE_BIO_BASED)
+ return type;
+
+ return !md->use_blk_mq ? DM_TYPE_REQUEST_BASED : DM_TYPE_MQ_REQUEST_BASED;
+}
+
static void __bind_mempools(struct mapped_device *md, struct dm_table *t)
{
struct dm_md_mempools *p = dm_table_get_md_mempools(t);
- if (md->bs) {
- /* The md already has necessary mempools. */
- if (dm_table_get_type(t) == DM_TYPE_BIO_BASED) {
+ switch (filter_md_type(dm_table_get_type(t), md)) {
+ case DM_TYPE_BIO_BASED:
+ if (md->bs && md->io_pool) {
/*
+ * This bio-based md already has necessary mempools.
* Reload bioset because front_pad may have changed
* because a different table was loaded.
*/
bioset_free(md->bs);
md->bs = p->bs;
p->bs = NULL;
+ goto out;
}
- /*
- * There's no need to reload with request-based dm
- * because the size of front_pad doesn't change.
- * Note for future: If you are to reload bioset,
- * prep-ed requests in the queue may refer
- * to bio from the old bioset, so you must walk
- * through the queue to unprep.
- */
- goto out;
+ break;
+ case DM_TYPE_REQUEST_BASED:
+ if (md->rq_pool && md->io_pool)
+ /*
+ * This request-based md already has necessary mempools.
+ */
+ goto out;
+ break;
+ case DM_TYPE_MQ_REQUEST_BASED:
+ BUG_ON(p); /* No mempools needed */
+ return;
}
+ BUG_ON(!p || md->io_pool || md->rq_pool || md->bs);
+
md->io_pool = p->io_pool;
p->io_pool = NULL;
md->rq_pool = p->rq_pool;
p->rq_pool = NULL;
md->bs = p->bs;
p->bs = NULL;
-
out:
/* mempool bind completed, no longer need any mempools in the table */
dm_table_free_md_mempools(t);
@@ -2721,14 +2734,6 @@ out_tag_set:
return err;
}
-static unsigned filter_md_type(unsigned type, struct mapped_device *md)
-{
- if (type == DM_TYPE_BIO_BASED)
- return type;
-
- return !md->use_blk_mq ? DM_TYPE_REQUEST_BASED : DM_TYPE_MQ_REQUEST_BASED;
-}
-
/*
* Setup the DM device's queue based on md's type
*/
@@ -3450,7 +3455,7 @@ struct dm_md_mempools *dm_alloc_bio_mempools(unsigned integrity,
pools = kzalloc(sizeof(*pools), GFP_KERNEL);
if (!pools)
- return NULL;
+ return ERR_PTR(-ENOMEM);
front_pad = roundup(per_bio_data_size, __alignof__(struct dm_target_io)) +
offsetof(struct dm_target_io, clone);
@@ -3469,24 +3474,26 @@ struct dm_md_mempools *dm_alloc_bio_mempools(unsigned integrity,
return pools;
out:
dm_free_md_mempools(pools);
- return NULL;
+ return ERR_PTR(-ENOMEM);
}
struct dm_md_mempools *dm_alloc_rq_mempools(struct mapped_device *md,
unsigned type)
{
- unsigned int pool_size = dm_get_reserved_rq_based_ios();
+ unsigned int pool_size;
struct dm_md_mempools *pools;
+ if (filter_md_type(type, md) == DM_TYPE_MQ_REQUEST_BASED)
+ return NULL; /* No mempools needed */
+
+ pool_size = dm_get_reserved_rq_based_ios();
pools = kzalloc(sizeof(*pools), GFP_KERNEL);
if (!pools)
- return NULL;
+ return ERR_PTR(-ENOMEM);
- if (filter_md_type(type, md) == DM_TYPE_REQUEST_BASED) {
- pools->rq_pool = mempool_create_slab_pool(pool_size, _rq_cache);
- if (!pools->rq_pool)
- goto out;
- }
+ pools->rq_pool = mempool_create_slab_pool(pool_size, _rq_cache);
+ if (!pools->rq_pool)
+ goto out;
pools->io_pool = mempool_create_slab_pool(pool_size, _rq_tio_cache);
if (!pools->io_pool)
@@ -3495,7 +3502,7 @@ struct dm_md_mempools *dm_alloc_rq_mempools(struct mapped_device *md,
return pools;
out:
dm_free_md_mempools(pools);
- return NULL;
+ return ERR_PTR(-ENOMEM);
}
void dm_free_md_mempools(struct dm_md_mempools *pools)
--
2.3.2 (Apple Git-55)
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v3] block, dm: don't copy bios for request clones
2015-04-28 0:59 ` [PATCH v3] " Mike Snitzer
2015-04-28 1:03 ` [PATCH 2/1] dm: do not allocate any mempools for blk-mq request-based DM Mike Snitzer
@ 2015-04-28 5:49 ` Hannes Reinecke
2015-04-28 6:29 ` Christoph Hellwig
2015-05-11 15:55 ` Mike Snitzer
3 siblings, 0 replies; 12+ messages in thread
From: Hannes Reinecke @ 2015-04-28 5:49 UTC (permalink / raw)
To: device-mapper development, Christoph Hellwig; +Cc: Jens Axboe
On 04/28/2015 02:59 AM, Mike Snitzer wrote:
> From: Christoph Hellwig <hch@lst.de>
>
> Currently dm-multipath has to clone the bios for every request sent
> to the lower devices, which wastes cpu cycles and ties down memory.
>
> This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio
> to not complete bios attached to a request, which we set on clone
> requests similar to bios in a flush sequence. With this change I/O
> errors on a path failure only get propagated to dm-multipath, which
> can then either resubmit the I/O or complete the bios on the original
> request.
>
> I've done some basic testing of this on a Linux target with ALUA support,
> and it survives path failures during I/O nicely.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
hare@suse.de +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/1] dm: do not allocate any mempools for blk-mq request-based DM
2015-04-28 1:03 ` [PATCH 2/1] dm: do not allocate any mempools for blk-mq request-based DM Mike Snitzer
@ 2015-04-28 6:28 ` Christoph Hellwig
2015-04-28 10:22 ` Mike Snitzer
0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2015-04-28 6:28 UTC (permalink / raw)
To: Mike Snitzer; +Cc: Jens Axboe, dm-devel, Christoph Hellwig
On Mon, Apr 27, 2015 at 09:03:04PM -0400, Mike Snitzer wrote:
> Do not allocate the io_pool mempool for blk-mq request-based DM
> (DM_TYPE_MQ_REQUEST_BASED) in dm_alloc_rq_mempools().
>
> Also refine __bind_mempools() to have more precise awareness of which
> mempools each type of DM device uses -- avoids mempool churn when
> reloading DM tables (particularly for DM_TYPE_REQUEST_BASED).
Btw, I looked into this code and didn't dare to touch it before I
understood how we deal with the case of dm-mpath using blk-mq and
low level driver not or the other way around.
As far as I can see we'd need the request mempool as long as the
low level driver does not use dm-mq, independent of what dm-mpath
itsel uses.
The code doesn't seem to handle this right, but as mentioned I'm not
100% sure and need to dive deeper into this. Is there a place enforcing
dm-mpath is using the same request type as the underlying devices?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v3] block, dm: don't copy bios for request clones
2015-04-28 0:59 ` [PATCH v3] " Mike Snitzer
2015-04-28 1:03 ` [PATCH 2/1] dm: do not allocate any mempools for blk-mq request-based DM Mike Snitzer
2015-04-28 5:49 ` [PATCH v3] block, dm: don't copy bios for request clones Hannes Reinecke
@ 2015-04-28 6:29 ` Christoph Hellwig
2015-04-28 9:45 ` Mike Snitzer
2015-05-11 15:55 ` Mike Snitzer
3 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2015-04-28 6:29 UTC (permalink / raw)
To: Mike Snitzer; +Cc: Jens Axboe, dm-devel
What is the change from v2? I've not really noticed anything.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v3] block, dm: don't copy bios for request clones
2015-04-28 6:29 ` Christoph Hellwig
@ 2015-04-28 9:45 ` Mike Snitzer
0 siblings, 0 replies; 12+ messages in thread
From: Mike Snitzer @ 2015-04-28 9:45 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Jens Axboe, dm-devel
On Tue, Apr 28 2015 at 2:29am -0400,
Christoph Hellwig <hch@lst.de> wrote:
> What is the change from v2? I've not really noticed anything.
Just whitespace really, I said:
"v3: really just some line-wrapping, whitespace, and misc small nits"
I could get you a diff but I just had some superficial changes.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/1] dm: do not allocate any mempools for blk-mq request-based DM
2015-04-28 6:28 ` Christoph Hellwig
@ 2015-04-28 10:22 ` Mike Snitzer
0 siblings, 0 replies; 12+ messages in thread
From: Mike Snitzer @ 2015-04-28 10:22 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Jens Axboe, dm-devel
On Tue, Apr 28 2015 at 2:28am -0400,
Christoph Hellwig <hch@lst.de> wrote:
> On Mon, Apr 27, 2015 at 09:03:04PM -0400, Mike Snitzer wrote:
> > Do not allocate the io_pool mempool for blk-mq request-based DM
> > (DM_TYPE_MQ_REQUEST_BASED) in dm_alloc_rq_mempools().
> >
> > Also refine __bind_mempools() to have more precise awareness of which
> > mempools each type of DM device uses -- avoids mempool churn when
> > reloading DM tables (particularly for DM_TYPE_REQUEST_BASED).
>
> Btw, I looked into this code and didn't dare to touch it before I
> understood how we deal with the case of dm-mpath using blk-mq and
> low level driver not or the other way around.
Current code does deal with:
1) blk-mq queue on non-blk-mq underlying devices
2) blk-mq queue on blk-mq underlying devices
3) non-blk-mq queue on non-blk-mq underlying devices
4) non-blk-mq queue on blk-mq underlying devices
But all these permutations do definitely make the code less
approachable and maintainable.
> As far as I can see we'd need the request mempool as long as the
> low level driver does not use dm-mq, independent of what dm-mpath
> itsel uses.
>
> The code doesn't seem to handle this right, but as mentioned I'm not
> 100% sure and need to dive deeper into this. Is there a place enforcing
> dm-mpath is using the same request type as the underlying devices?
It is dm-table.c:dm_table_set_type() that determines whether a given DM
table load references underlying devices that are all blk-mq or not.
Based on what it finds it either establishes the type as
DM_TYPE_REQUEST_BASED or DM_TYPE_MQ_REQUEST_BASED.
DM_TYPE_REQUEST_BASED will make use of io_pool, whereas
DM_TYPE_MQ_REQUEST_BASED won't. There is an awkward duality where
use_blk_mq governs which mempools are created based on filter_md_type().
filter_md_type() is needed because the table's type is really only
concerned with the underlying devices' types -- not the top-level DM
queue. So things are awkward until we establish the top-level queue
(then we just make use of q->mq_ops like normal).
It is dm.c:dm_setup_md_queue() that establishes the top-level DM queue
based on the type.
How the DM queue's requests are cloned depends the underlying devices
(e.g. blk-mq vs not).
Top-level blk-mq DM device uses the md's type (DM_TYPE_REQUEST_BASED vs
DM_TYPE_MQ_REQUEST_BASED) to judge how to clone the request.
DM_TYPE_REQUEST_BASED implies legacy path.
There is still some awkward branching in the IO path but it ended up
being surprisingly manageable. Only concern I have is: in the long-run
will all this duality in the code compromise maintainability? But
hopefully we can kill the old request_fn path in block core (if/when
blk-mq IO scheduling lands) and the DM code can be cleaned up
accordingly.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v3] block, dm: don't copy bios for request clones
2015-04-28 0:59 ` [PATCH v3] " Mike Snitzer
` (2 preceding siblings ...)
2015-04-28 6:29 ` Christoph Hellwig
@ 2015-05-11 15:55 ` Mike Snitzer
3 siblings, 0 replies; 12+ messages in thread
From: Mike Snitzer @ 2015-05-11 15:55 UTC (permalink / raw)
To: Jens Axboe; +Cc: dm-devel, Christoph Hellwig
On Mon, Apr 27 2015 at 8:59pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:
> From: Christoph Hellwig <hch@lst.de>
>
> Currently dm-multipath has to clone the bios for every request sent
> to the lower devices, which wastes cpu cycles and ties down memory.
>
> This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio
> to not complete bios attached to a request, which we set on clone
> requests similar to bios in a flush sequence. With this change I/O
> errors on a path failure only get propagated to dm-multipath, which
> can then either resubmit the I/O or complete the bios on the original
> request.
>
> I've done some basic testing of this on a Linux target with ALUA support,
> and it survives path failures during I/O nicely.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
> block/blk-core.c | 94 +++----------------------
> drivers/md/dm-table.c | 25 ++++---
> drivers/md/dm.c | 171 +++++++++++-----------------------------------
> drivers/md/dm.h | 5 +-
> include/linux/blk_types.h | 2 +
> include/linux/blkdev.h | 6 +-
> 6 files changed, 73 insertions(+), 230 deletions(-)
>
> v3: really just some line-wrapping, whitespace, and misc small nits
Hey Jens,
Given the blk-core changes, would you like to pick this up for 4.2 and
I'll just base DM ontop of linux-block.git again for the 4.2 cycle?
The patch is available here: https://patchwork.kernel.org/patch/6285081/
Mike
p.s. I do have a followup DM patch that cleans things up further but I
can just stage it via linux-dm.git:
https://patchwork.kernel.org/patch/6285091/
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2015-05-11 15:55 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-04-25 10:23 [PATCH v2] block, dm: don't copy bios for request clones Christoph Hellwig
2015-04-25 18:13 ` Hannes Reinecke
2015-04-26 13:35 ` Mike Snitzer
2015-04-26 14:20 ` Christoph Hellwig
2015-04-28 0:59 ` [PATCH v3] " Mike Snitzer
2015-04-28 1:03 ` [PATCH 2/1] dm: do not allocate any mempools for blk-mq request-based DM Mike Snitzer
2015-04-28 6:28 ` Christoph Hellwig
2015-04-28 10:22 ` Mike Snitzer
2015-04-28 5:49 ` [PATCH v3] block, dm: don't copy bios for request clones Hannes Reinecke
2015-04-28 6:29 ` Christoph Hellwig
2015-04-28 9:45 ` Mike Snitzer
2015-05-11 15:55 ` Mike Snitzer
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.