* [PATCH 00/13] Block cleanups (for bcache)
@ 2012-05-18  2:59 koverstreet
  2012-05-18  2:59 ` [PATCH 01/13] block: Generalized bio pool freeing koverstreet
                   ` (12 more replies)
  0 siblings, 13 replies; 48+ messages in thread
From: koverstreet @ 2012-05-18  2:59 UTC (permalink / raw)
  To: linux-bcache, linux-kernel, dm-devel, linux-fsdevel
  Cc: tj, axboe, Kent Overstreet, agk
From: Kent Overstreet <koverstreet@google.com>
Couple related things in this patch series. This is mostly stuff I did for
bcache, polished/expanded up a bit:
 * Bio pool freeing. This moves freeing of bios allocated from bio pools into
   generic code.
 * Kill bi_destructor. That was Tejun's idea, but it turned out to be easier
   than I expected.
 * Improved bio splitting. This was originally part of bcache, but I pulled it
   out and replaced the existing bio splitting code with it.
 * Closures - this is from bcache. I didn't really need to use it for the next
   patch, but IMO it makes the code a bit more elegant.
 * Make generic_make_request() handle arbitrary size bios. I think this is
   going to enable a lot of cleanups in the future.
   The idea here isn't for generic_make_request() to be doing the splitting in
   practice long term, it's more just an intermediate stage. If this goes in, I
   think a lot of driver code - certainly a lot of virtual block drivers -
   could easily be made to handle arbitrary sized bios, and splitting will only
   happen when a bio is being redirected to two different devices or something
   like that. But this should enable a lot of cleanups without having to change
   every block driver first.
Kent Overstreet (13):
  block: Generalized bio pool freeing
  dm: kill dm_rq_bio_destructor
  block: Add bio_clone_bioset()
  block: Add bio_clone_kmalloc()
  block: Only clone bio vecs that are in use
  block: Add bio_reset()
  pktcdvd: Switch to bio_kmalloc()
  block: Kill bi_destructor
  block: Add an explicit bio flag for bios that own their bvec
  block: Rework bio splitting
  Closures
  Make generic_make_request handle arbitrarily large bios
  Gut bio_add_page()
 Documentation/block/biodoc.txt      |    5 -
 block/blk-core.c                    |  119 +++++++-
 drivers/block/drbd/drbd_req.c       |   18 +-
 drivers/block/osdblk.c              |    3 +-
 drivers/block/pktcdvd.c             |  121 +++-----
 drivers/block/rbd.c                 |   12 +-
 drivers/md/dm-crypt.c               |    9 -
 drivers/md/dm-io.c                  |   11 -
 drivers/md/dm.c                     |   48 +--
 drivers/md/linear.c                 |    6 +-
 drivers/md/md.c                     |   42 +---
 drivers/md/raid0.c                  |    8 +-
 drivers/md/raid10.c                 |   23 +-
 drivers/target/target_core_iblock.c |    9 -
 fs/bio-integrity.c                  |   44 ---
 fs/bio.c                            |  407 ++++++++++++-----------
 fs/exofs/ore.c                      |    5 +-
 include/linux/bio.h                 |   37 ++-
 include/linux/blk_types.h           |    9 +-
 include/linux/blkdev.h              |    3 +
 include/linux/closure.h             |  614 +++++++++++++++++++++++++++++++++++
 lib/Kconfig.debug                   |    8 +
 lib/Makefile                        |    2 +-
 lib/closure.c                       |  363 +++++++++++++++++++++
 24 files changed, 1419 insertions(+), 507 deletions(-)
 create mode 100644 include/linux/closure.h
 create mode 100644 lib/closure.c
-- 
1.7.9.rc2
^ permalink raw reply	[flat|nested] 48+ messages in thread
* [PATCH 01/13] block: Generalized bio pool freeing
  2012-05-18  2:59 [PATCH 00/13] Block cleanups (for bcache) koverstreet
@ 2012-05-18  2:59 ` koverstreet
       [not found]   ` <ea774fea2a27c9f1028a12ce31a7ee5e5517bef4.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2012-05-18  2:59 ` [PATCH 02/13] dm: kill dm_rq_bio_destructor koverstreet
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 48+ messages in thread
From: koverstreet @ 2012-05-18  2:59 UTC (permalink / raw)
  To: linux-bcache, linux-kernel, dm-devel, linux-fsdevel
  Cc: Kent Overstreet, tj, axboe, agk, neilb
From: Kent Overstreet <koverstreet@google.com>
With the old code, when you allocate a bio from a bio pool you have to
implement your own destructor that knows how to find the bio pool the
bio was originally allocated from.
This adds a new field to struct bio (bi_pool) and changes
bio_alloc_bioset() to use it. This makes various bio destructors
unnecessary, so they're then deleted.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
---
 drivers/md/dm-crypt.c               |    9 ---------
 drivers/md/dm-io.c                  |   11 -----------
 drivers/md/dm.c                     |   10 ----------
 drivers/md/md.c                     |   28 ++++------------------------
 drivers/target/target_core_iblock.c |    9 ---------
 fs/bio.c                            |   26 ++++++++------------------
 include/linux/blk_types.h           |    3 +++
 7 files changed, 15 insertions(+), 81 deletions(-)
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 3f06df5..40716d8 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -808,14 +808,6 @@ static int crypt_convert(struct crypt_config *cc,
 	return 0;
 }
 
-static void dm_crypt_bio_destructor(struct bio *bio)
-{
-	struct dm_crypt_io *io = bio->bi_private;
-	struct crypt_config *cc = io->target->private;
-
-	bio_free(bio, cc->bs);
-}
-
 /*
  * Generate a new unfragmented bio with the given size
  * This should never violate the device limitations
@@ -985,7 +977,6 @@ static void clone_init(struct dm_crypt_io *io, struct bio *clone)
 	clone->bi_end_io  = crypt_endio;
 	clone->bi_bdev    = cc->dev->bdev;
 	clone->bi_rw      = io->base_bio->bi_rw;
-	clone->bi_destructor = dm_crypt_bio_destructor;
 }
 
 static int kcryptd_io_read(struct dm_crypt_io *io, gfp_t gfp)
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index ea5dd28..1c46f97 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -249,16 +249,6 @@ static void vm_dp_init(struct dpages *dp, void *data)
 	dp->context_ptr = data;
 }
 
-static void dm_bio_destructor(struct bio *bio)
-{
-	unsigned region;
-	struct io *io;
-
-	retrieve_io_and_region_from_bio(bio, &io, ®ion);
-
-	bio_free(bio, io->client->bios);
-}
-
 /*
  * Functions for getting the pages from kernel memory.
  */
@@ -317,7 +307,6 @@ static void do_region(int rw, unsigned region, struct dm_io_region *where,
 		bio->bi_sector = where->sector + (where->count - remaining);
 		bio->bi_bdev = where->bdev;
 		bio->bi_end_io = endio;
-		bio->bi_destructor = dm_bio_destructor;
 		store_io_and_region_in_bio(bio, io, region);
 
 		if (rw & REQ_DISCARD) {
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index e24143c..2114ec1 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1036,13 +1036,6 @@ struct clone_info {
 	unsigned short idx;
 };
 
-static void dm_bio_destructor(struct bio *bio)
-{
-	struct bio_set *bs = bio->bi_private;
-
-	bio_free(bio, bs);
-}
-
 /*
  * Creates a little bio that just does part of a bvec.
  */
@@ -1054,7 +1047,6 @@ static struct bio *split_bvec(struct bio *bio, sector_t sector,
 	struct bio_vec *bv = bio->bi_io_vec + idx;
 
 	clone = bio_alloc_bioset(GFP_NOIO, 1, bs);
-	clone->bi_destructor = dm_bio_destructor;
 	*clone->bi_io_vec = *bv;
 
 	clone->bi_sector = sector;
@@ -1086,7 +1078,6 @@ static struct bio *clone_bio(struct bio *bio, sector_t sector,
 
 	clone = bio_alloc_bioset(GFP_NOIO, bio->bi_max_vecs, bs);
 	__bio_clone(clone, bio);
-	clone->bi_destructor = dm_bio_destructor;
 	clone->bi_sector = sector;
 	clone->bi_idx = idx;
 	clone->bi_vcnt = idx + bv_count;
@@ -1131,7 +1122,6 @@ static void __issue_target_request(struct clone_info *ci, struct dm_target *ti,
 	 */
 	clone = bio_alloc_bioset(GFP_NOIO, ci->bio->bi_max_vecs, ci->md->bs);
 	__bio_clone(clone, ci->bio);
-	clone->bi_destructor = dm_bio_destructor;
 	if (len) {
 		clone->bi_sector = ci->sector;
 		clone->bi_size = to_bytes(len);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 01233d8..a5a524e 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -155,32 +155,17 @@ static int start_readonly;
  * like bio_clone, but with a local bio set
  */
 
-static void mddev_bio_destructor(struct bio *bio)
-{
-	struct mddev *mddev, **mddevp;
-
-	mddevp = (void*)bio;
-	mddev = mddevp[-1];
-
-	bio_free(bio, mddev->bio_set);
-}
-
 struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
 			    struct mddev *mddev)
 {
 	struct bio *b;
-	struct mddev **mddevp;
 
 	if (!mddev || !mddev->bio_set)
 		return bio_alloc(gfp_mask, nr_iovecs);
 
-	b = bio_alloc_bioset(gfp_mask, nr_iovecs,
-			     mddev->bio_set);
+	b = bio_alloc_bioset(gfp_mask, nr_iovecs, mddev->bio_set);
 	if (!b)
 		return NULL;
-	mddevp = (void*)b;
-	mddevp[-1] = mddev;
-	b->bi_destructor = mddev_bio_destructor;
 	return b;
 }
 EXPORT_SYMBOL_GPL(bio_alloc_mddev);
@@ -189,18 +174,14 @@ struct bio *bio_clone_mddev(struct bio *bio, gfp_t gfp_mask,
 			    struct mddev *mddev)
 {
 	struct bio *b;
-	struct mddev **mddevp;
 
 	if (!mddev || !mddev->bio_set)
 		return bio_clone(bio, gfp_mask);
 
-	b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs,
-			     mddev->bio_set);
+	b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, mddev->bio_set);
 	if (!b)
 		return NULL;
-	mddevp = (void*)b;
-	mddevp[-1] = mddev;
-	b->bi_destructor = mddev_bio_destructor;
+
 	__bio_clone(b, bio);
 	if (bio_integrity(bio)) {
 		int ret;
@@ -4831,8 +4812,7 @@ int md_run(struct mddev *mddev)
 	}
 
 	if (mddev->bio_set == NULL)
-		mddev->bio_set = bioset_create(BIO_POOL_SIZE,
-					       sizeof(struct mddev *));
+		mddev->bio_set = bioset_create(BIO_POOL_SIZE, 0);
 
 	spin_lock(&pers_lock);
 	pers = find_pers(mddev->level, mddev->clevel);
diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
index 2ec299e..85b20c4 100644
--- a/drivers/target/target_core_iblock.c
+++ b/drivers/target/target_core_iblock.c
@@ -448,14 +448,6 @@ static ssize_t iblock_show_configfs_dev_params(
 	return bl;
 }
 
-static void iblock_bio_destructor(struct bio *bio)
-{
-	struct se_task *task = bio->bi_private;
-	struct iblock_dev *ib_dev = task->task_se_cmd->se_dev->dev_ptr;
-
-	bio_free(bio, ib_dev->ibd_bio_set);
-}
-
 static struct bio *
 iblock_get_bio(struct se_task *task, sector_t lba, u32 sg_num)
 {
@@ -482,7 +474,6 @@ iblock_get_bio(struct se_task *task, sector_t lba, u32 sg_num)
 
 	bio->bi_bdev = ib_dev->ibd_bd;
 	bio->bi_private = task;
-	bio->bi_destructor = iblock_bio_destructor;
 	bio->bi_end_io = &iblock_bio_done;
 	bio->bi_sector = lba;
 	atomic_inc(&ib_req->pending);
diff --git a/fs/bio.c b/fs/bio.c
index e453924..3667cef 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -269,10 +269,6 @@ EXPORT_SYMBOL(bio_init);
  *   bio_alloc_bioset will try its own mempool to satisfy the allocation.
  *   If %__GFP_WAIT is set then we will block on the internal pool waiting
  *   for a &struct bio to become free.
- *
- *   Note that the caller must set ->bi_destructor on successful return
- *   of a bio, to do the appropriate freeing of the bio once the reference
- *   count drops to zero.
  **/
 struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 {
@@ -287,6 +283,7 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 	bio = p + bs->front_pad;
 
 	bio_init(bio);
+	bio->bi_pool = bs;
 
 	if (unlikely(!nr_iovecs))
 		goto out_set;
@@ -313,11 +310,6 @@ err_free:
 }
 EXPORT_SYMBOL(bio_alloc_bioset);
 
-static void bio_fs_destructor(struct bio *bio)
-{
-	bio_free(bio, fs_bio_set);
-}
-
 /**
  *	bio_alloc - allocate a new bio, memory pool backed
  *	@gfp_mask: allocation mask to use
@@ -339,12 +331,7 @@ static void bio_fs_destructor(struct bio *bio)
  */
 struct bio *bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
 {
-	struct bio *bio = bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
-
-	if (bio)
-		bio->bi_destructor = bio_fs_destructor;
-
-	return bio;
+	return bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
 }
 EXPORT_SYMBOL(bio_alloc);
 
@@ -419,7 +406,11 @@ void bio_put(struct bio *bio)
 	 */
 	if (atomic_dec_and_test(&bio->bi_cnt)) {
 		bio->bi_next = NULL;
-		bio->bi_destructor(bio);
+
+		if (bio->bi_pool)
+			bio_free(bio, bio->bi_pool);
+		else
+			bio->bi_destructor(bio);
 	}
 }
 EXPORT_SYMBOL(bio_put);
@@ -470,12 +461,11 @@ EXPORT_SYMBOL(__bio_clone);
  */
 struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
 {
-	struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, fs_bio_set);
+	struct bio *b = bio_alloc(gfp_mask, bio->bi_max_vecs);
 
 	if (!b)
 		return NULL;
 
-	b->bi_destructor = bio_fs_destructor;
 	__bio_clone(b, bio);
 
 	if (bio_integrity(bio)) {
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 4053cbd..dc0e399 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -70,6 +70,9 @@ struct bio {
 	struct bio_integrity_payload *bi_integrity;  /* data integrity */
 #endif
 
+	/* If bi_pool is non NULL, bi_destructor is not called */
+	struct bio_set		*bi_pool;
+
 	bio_destructor_t	*bi_destructor;	/* destructor */
 
 	/*
-- 
1.7.9.rc2
^ permalink raw reply related	[flat|nested] 48+ messages in thread
* [PATCH 02/13] dm: kill dm_rq_bio_destructor
  2012-05-18  2:59 [PATCH 00/13] Block cleanups (for bcache) koverstreet
  2012-05-18  2:59 ` [PATCH 01/13] block: Generalized bio pool freeing koverstreet
@ 2012-05-18  2:59 ` koverstreet
  2012-05-18 15:57   ` Tejun Heo
  2012-05-18  2:59 ` [PATCH 03/13] block: Add bio_clone_bioset() koverstreet
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 48+ messages in thread
From: koverstreet @ 2012-05-18  2:59 UTC (permalink / raw)
  To: linux-bcache, linux-kernel, dm-devel, linux-fsdevel
  Cc: Kent Overstreet, tj, axboe, agk, neilb
From: Kent Overstreet <koverstreet@google.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
---
 drivers/md/dm.c |   11 +----------
 1 files changed, 1 insertions(+), 10 deletions(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 2114ec1..3cc2169 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -701,6 +701,7 @@ static void end_clone_bio(struct bio *clone, int error)
 	struct bio *bio = info->orig;
 	unsigned int nr_bytes = info->orig->bi_size;
 
+	free_bio_info(info);
 	bio_put(clone);
 
 	if (tio->error)
@@ -1448,15 +1449,6 @@ void dm_dispatch_request(struct request *rq)
 }
 EXPORT_SYMBOL_GPL(dm_dispatch_request);
 
-static void dm_rq_bio_destructor(struct bio *bio)
-{
-	struct dm_rq_clone_bio_info *info = bio->bi_private;
-	struct mapped_device *md = info->tio->md;
-
-	free_bio_info(info);
-	bio_free(bio, md->bs);
-}
-
 static int dm_rq_bio_constructor(struct bio *bio, struct bio *bio_orig,
 				 void *data)
 {
@@ -1471,7 +1463,6 @@ static int dm_rq_bio_constructor(struct bio *bio, struct bio *bio_orig,
 	info->tio = tio;
 	bio->bi_end_io = end_clone_bio;
 	bio->bi_private = info;
-	bio->bi_destructor = dm_rq_bio_destructor;
 
 	return 0;
 }
-- 
1.7.9.rc2
^ permalink raw reply related	[flat|nested] 48+ messages in thread
* [PATCH 03/13] block: Add bio_clone_bioset()
  2012-05-18  2:59 [PATCH 00/13] Block cleanups (for bcache) koverstreet
  2012-05-18  2:59 ` [PATCH 01/13] block: Generalized bio pool freeing koverstreet
  2012-05-18  2:59 ` [PATCH 02/13] dm: kill dm_rq_bio_destructor koverstreet
@ 2012-05-18  2:59 ` koverstreet
  2012-05-18 16:05   ` Tejun Heo
       [not found]   ` <eb6a7d3fe7ae203202bc365d7274ee531631a9ca.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2012-05-18  2:59 ` [PATCH 04/13] block: Add bio_clone_kmalloc() koverstreet
                   ` (9 subsequent siblings)
  12 siblings, 2 replies; 48+ messages in thread
From: koverstreet @ 2012-05-18  2:59 UTC (permalink / raw)
  To: linux-bcache, linux-kernel, dm-devel, linux-fsdevel
  Cc: tj, axboe, Kent Overstreet, agk
From: Kent Overstreet <koverstreet@google.com>
This consolidates some code, and will help in a later patch changing how
bio cloning works.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
---
 block/blk-core.c    |    8 +-------
 drivers/md/dm.c     |   27 ++++++++++-----------------
 drivers/md/md.c     |   18 +-----------------
 fs/bio.c            |   16 ++++++++++++----
 include/linux/bio.h |    1 +
 5 files changed, 25 insertions(+), 45 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 1f61b74..91617eb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2660,16 +2660,10 @@ int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
 	blk_rq_init(NULL, rq);
 
 	__rq_for_each_bio(bio_src, rq_src) {
-		bio = bio_alloc_bioset(gfp_mask, bio_src->bi_max_vecs, bs);
+		bio = bio_clone_bioset(bio_src, gfp_mask, bs);
 		if (!bio)
 			goto free_and_out;
 
-		__bio_clone(bio, bio_src);
-
-		if (bio_integrity(bio_src) &&
-		    bio_integrity_clone(bio, bio_src, gfp_mask, bs))
-			goto free_and_out;
-
 		if (bio_ctr && bio_ctr(bio, bio_src, data))
 			goto free_and_out;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3cc2169..3e33039 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1072,26 +1072,19 @@ static struct bio *split_bvec(struct bio *bio, sector_t sector,
  * Creates a bio that consists of range of complete bvecs.
  */
 static struct bio *clone_bio(struct bio *bio, sector_t sector,
-			     unsigned short idx, unsigned short bv_count,
+			     unsigned short bv_count,
 			     unsigned int len, struct bio_set *bs)
 {
 	struct bio *clone;
 
-	clone = bio_alloc_bioset(GFP_NOIO, bio->bi_max_vecs, bs);
-	__bio_clone(clone, bio);
+	clone = bio_clone_bioset(bio, GFP_NOIO, bs);
 	clone->bi_sector = sector;
-	clone->bi_idx = idx;
-	clone->bi_vcnt = idx + bv_count;
+	clone->bi_vcnt = bv_count;
 	clone->bi_size = to_bytes(len);
-	clone->bi_flags &= ~(1 << BIO_SEG_VALID);
 
-	if (bio_integrity(bio)) {
-		bio_integrity_clone(clone, bio, GFP_NOIO, bs);
-
-		if (idx != bio->bi_idx || clone->bi_size < bio->bi_size)
-			bio_integrity_trim(clone,
-					   bio_sector_offset(bio, idx, 0), len);
-	}
+	if (bio_integrity(bio) &&
+	    clone->bi_size < bio->bi_size)
+		bio_integrity_trim(clone, 0, len);
 
 	return clone;
 }
@@ -1121,8 +1114,8 @@ static void __issue_target_request(struct clone_info *ci, struct dm_target *ti,
 	 * ci->bio->bi_max_vecs is BIO_INLINE_VECS anyway, for both flush
 	 * and discard, so no need for concern about wasted bvec allocations.
 	 */
-	clone = bio_alloc_bioset(GFP_NOIO, ci->bio->bi_max_vecs, ci->md->bs);
-	__bio_clone(clone, ci->bio);
+	clone = bio_clone_bioset(ci->bio, GFP_NOIO, ci->md->bs);
+
 	if (len) {
 		clone->bi_sector = ci->sector;
 		clone->bi_size = to_bytes(len);
@@ -1161,7 +1154,7 @@ static void __clone_and_map_simple(struct clone_info *ci, struct dm_target *ti)
 	struct dm_target_io *tio;
 
 	tio = alloc_tio(ci, ti);
-	clone = clone_bio(bio, ci->sector, ci->idx,
+	clone = clone_bio(bio, ci->sector,
 			  bio->bi_vcnt - ci->idx, ci->sector_count,
 			  ci->md->bs);
 	__map_bio(ti, clone, tio);
@@ -1240,7 +1233,7 @@ static int __clone_and_map(struct clone_info *ci)
 		}
 
 		tio = alloc_tio(ci, ti);
-		clone = clone_bio(bio, ci->sector, ci->idx, i - ci->idx, len,
+		clone = clone_bio(bio, ci->sector, i - ci->idx, len,
 				  ci->md->bs);
 		__map_bio(ti, clone, tio);
 
diff --git a/drivers/md/md.c b/drivers/md/md.c
index a5a524e..47605e7 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -178,23 +178,7 @@ struct bio *bio_clone_mddev(struct bio *bio, gfp_t gfp_mask,
 	if (!mddev || !mddev->bio_set)
 		return bio_clone(bio, gfp_mask);
 
-	b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, mddev->bio_set);
-	if (!b)
-		return NULL;
-
-	__bio_clone(b, bio);
-	if (bio_integrity(bio)) {
-		int ret;
-
-		ret = bio_integrity_clone(b, bio, gfp_mask, mddev->bio_set);
-
-		if (ret < 0) {
-			bio_put(b);
-			return NULL;
-		}
-	}
-
-	return b;
+	return bio_clone_bioset(gfp_mask, bio_segments(bio), mddev->bio_set);
 }
 EXPORT_SYMBOL_GPL(bio_clone_mddev);
 
diff --git a/fs/bio.c b/fs/bio.c
index 3667cef..4cb621f 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -453,15 +453,17 @@ void __bio_clone(struct bio *bio, struct bio *bio_src)
 EXPORT_SYMBOL(__bio_clone);
 
 /**
- *	bio_clone	-	clone a bio
+ *	bio_clone_bioset -	clone a bio
  *	@bio: bio to clone
  *	@gfp_mask: allocation priority
+ *	@bs: bio_set to allocate from
  *
  * 	Like __bio_clone, only also allocates the returned bio
  */
-struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
+struct bio *bio_clone_bioset(struct bio *bio, gfp_t gfp_mask,
+			     struct bio_set *bs)
 {
-	struct bio *b = bio_alloc(gfp_mask, bio->bi_max_vecs);
+	struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, bs);
 
 	if (!b)
 		return NULL;
@@ -471,7 +473,7 @@ struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
 	if (bio_integrity(bio)) {
 		int ret;
 
-		ret = bio_integrity_clone(b, bio, gfp_mask, fs_bio_set);
+		ret = bio_integrity_clone(b, bio, gfp_mask, bs);
 
 		if (ret < 0) {
 			bio_put(b);
@@ -481,6 +483,12 @@ struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
 
 	return b;
 }
+EXPORT_SYMBOL(bio_clone_bioset);
+
+struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
+{
+	return bio_clone_bioset(bio, gfp_mask, fs_bio_set);
+}
 EXPORT_SYMBOL(bio_clone);
 
 /**
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 4d94eb8..e068a68 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -223,6 +223,7 @@ struct request_queue;
 extern int bio_phys_segments(struct request_queue *, struct bio *);
 
 extern void __bio_clone(struct bio *, struct bio *);
+extern struct bio *bio_clone_bioset(struct bio *, gfp_t, struct bio_set *bs);
 extern struct bio *bio_clone(struct bio *, gfp_t);
 
 extern void bio_init(struct bio *);
-- 
1.7.9.rc2
^ permalink raw reply related	[flat|nested] 48+ messages in thread
* [PATCH 04/13] block: Add bio_clone_kmalloc()
  2012-05-18  2:59 [PATCH 00/13] Block cleanups (for bcache) koverstreet
                   ` (2 preceding siblings ...)
  2012-05-18  2:59 ` [PATCH 03/13] block: Add bio_clone_bioset() koverstreet
@ 2012-05-18  2:59 ` koverstreet
       [not found]   ` <1c7c2d4b89bc3d0e907608cec37bcf0ee50f4c0e.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2012-05-18 16:45   ` Boaz Harrosh
  2012-05-18  2:59 ` [PATCH 05/13] block: Only clone bio vecs that are in use koverstreet
                   ` (8 subsequent siblings)
  12 siblings, 2 replies; 48+ messages in thread
From: koverstreet @ 2012-05-18  2:59 UTC (permalink / raw)
  To: linux-bcache, linux-kernel, dm-devel, linux-fsdevel
  Cc: tj, axboe, Kent Overstreet, agk
From: Kent Overstreet <koverstreet@google.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
---
 drivers/block/osdblk.c |    3 +--
 drivers/block/rbd.c    |    8 ++------
 fs/bio.c               |   13 +++++++++++++
 fs/exofs/ore.c         |    5 ++---
 include/linux/bio.h    |    1 +
 5 files changed, 19 insertions(+), 11 deletions(-)
diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
index 87311eb..1bbc681 100644
--- a/drivers/block/osdblk.c
+++ b/drivers/block/osdblk.c
@@ -266,11 +266,10 @@ static struct bio *bio_chain_clone(struct bio *old_chain, gfp_t gfpmask)
 	struct bio *tmp, *new_chain = NULL, *tail = NULL;
 
 	while (old_chain) {
-		tmp = bio_kmalloc(gfpmask, old_chain->bi_max_vecs);
+		tmp = bio_clone_kmalloc(old_chain, gfpmask);
 		if (!tmp)
 			goto err_out;
 
-		__bio_clone(tmp, old_chain);
 		tmp->bi_bdev = NULL;
 		gfpmask &= ~__GFP_WAIT;
 		tmp->bi_next = NULL;
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 013c7a5..5a953c6 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -729,7 +729,7 @@ static struct bio *bio_chain_clone(struct bio **old, struct bio **next,
 	}
 
 	while (old_chain && (total < len)) {
-		tmp = bio_kmalloc(gfpmask, old_chain->bi_max_vecs);
+		tmp = bio_clone_kmalloc(old_chain, gfpmask);
 		if (!tmp)
 			goto err_out;
 
@@ -751,13 +751,9 @@ static struct bio *bio_chain_clone(struct bio **old, struct bio **next,
 			if (!bp)
 				goto err_out;
 
-			__bio_clone(tmp, &bp->bio1);
-
 			*next = &bp->bio2;
-		} else {
-			__bio_clone(tmp, old_chain);
+		} else
 			*next = old_chain->bi_next;
-		}
 
 		tmp->bi_bdev = NULL;
 		gfpmask &= ~__GFP_WAIT;
diff --git a/fs/bio.c b/fs/bio.c
index 4cb621f..e2c0970 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -491,6 +491,19 @@ struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(bio_clone);
 
+struct bio *bio_clone_kmalloc(struct bio *bio, gfp_t gfp_mask)
+{
+	struct bio *b = bio_kmalloc(gfp_mask, bio->bi_max_vecs);
+
+	if (!b)
+		return NULL;
+
+	__bio_clone(b, bio);
+
+	return b;
+}
+EXPORT_SYMBOL(bio_clone_kmalloc);
+
 /**
  *	bio_get_nr_vecs		- return approx number of vecs
  *	@bdev:  I/O target
diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
index 49cf230..95c7264 100644
--- a/fs/exofs/ore.c
+++ b/fs/exofs/ore.c
@@ -820,8 +820,8 @@ static int _write_mirror(struct ore_io_state *ios, int cur_comp)
 			struct bio *bio;
 
 			if (per_dev != master_dev) {
-				bio = bio_kmalloc(GFP_KERNEL,
-						  master_dev->bio->bi_max_vecs);
+				bio = bio_clone_kmalloc(master_dev->bio,
+							GFP_KERNEL);
 				if (unlikely(!bio)) {
 					ORE_DBGMSG(
 					      "Failed to allocate BIO size=%u\n",
@@ -830,7 +830,6 @@ static int _write_mirror(struct ore_io_state *ios, int cur_comp)
 					goto out;
 				}
 
-				__bio_clone(bio, master_dev->bio);
 				bio->bi_bdev = NULL;
 				bio->bi_next = NULL;
 				per_dev->offset = master_dev->offset;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index e068a68..b27f16b 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -225,6 +225,7 @@ extern int bio_phys_segments(struct request_queue *, struct bio *);
 extern void __bio_clone(struct bio *, struct bio *);
 extern struct bio *bio_clone_bioset(struct bio *, gfp_t, struct bio_set *bs);
 extern struct bio *bio_clone(struct bio *, gfp_t);
+struct bio *bio_clone_kmalloc(struct bio *, gfp_t);
 
 extern void bio_init(struct bio *);
 
-- 
1.7.9.rc2
^ permalink raw reply related	[flat|nested] 48+ messages in thread
* [PATCH 05/13] block: Only clone bio vecs that are in use
  2012-05-18  2:59 [PATCH 00/13] Block cleanups (for bcache) koverstreet
                   ` (3 preceding siblings ...)
  2012-05-18  2:59 ` [PATCH 04/13] block: Add bio_clone_kmalloc() koverstreet
@ 2012-05-18  2:59 ` koverstreet
  2012-05-18 16:13   ` Tejun Heo
  2012-05-18  2:59 ` [PATCH 06/13] block: Add bio_reset() koverstreet
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 48+ messages in thread
From: koverstreet @ 2012-05-18  2:59 UTC (permalink / raw)
  To: linux-bcache, linux-kernel, dm-devel, linux-fsdevel
  Cc: tj, axboe, Kent Overstreet, agk
From: Kent Overstreet <koverstreet@google.com>
bcache creates large bios internally, and then splits them according to
the device requirements before it sends them down. If a lower level
device tries to clone the bio, and the original bio had more than
BIO_MAX_PAGES, the clone will fail unecessarily.
We can fix this by only cloning the bio vecs that are actually in use.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
---
 fs/bio.c |   13 +++++++------
 1 files changed, 7 insertions(+), 6 deletions(-)
diff --git a/fs/bio.c b/fs/bio.c
index e2c0970..de0733e 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -435,8 +435,9 @@ EXPORT_SYMBOL(bio_phys_segments);
  */
 void __bio_clone(struct bio *bio, struct bio *bio_src)
 {
-	memcpy(bio->bi_io_vec, bio_src->bi_io_vec,
-		bio_src->bi_max_vecs * sizeof(struct bio_vec));
+	memcpy(bio->bi_io_vec,
+	       bio_iovec(bio_src),
+	       bio_segments(bio_src) * sizeof(struct bio_vec));
 
 	/*
 	 * most users will be overriding ->bi_bdev with a new target,
@@ -445,10 +446,10 @@ void __bio_clone(struct bio *bio, struct bio *bio_src)
 	bio->bi_sector = bio_src->bi_sector;
 	bio->bi_bdev = bio_src->bi_bdev;
 	bio->bi_flags |= 1 << BIO_CLONED;
+	bio->bi_flags &= ~(1 << BIO_SEG_VALID);
 	bio->bi_rw = bio_src->bi_rw;
-	bio->bi_vcnt = bio_src->bi_vcnt;
+	bio->bi_vcnt = bio_segments(bio_src);
 	bio->bi_size = bio_src->bi_size;
-	bio->bi_idx = bio_src->bi_idx;
 }
 EXPORT_SYMBOL(__bio_clone);
 
@@ -463,7 +464,7 @@ EXPORT_SYMBOL(__bio_clone);
 struct bio *bio_clone_bioset(struct bio *bio, gfp_t gfp_mask,
 			     struct bio_set *bs)
 {
-	struct bio *b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, bs);
+	struct bio *b = bio_alloc_bioset(gfp_mask, bio_segments(bio), bs);
 
 	if (!b)
 		return NULL;
@@ -493,7 +494,7 @@ EXPORT_SYMBOL(bio_clone);
 
 struct bio *bio_clone_kmalloc(struct bio *bio, gfp_t gfp_mask)
 {
-	struct bio *b = bio_kmalloc(gfp_mask, bio->bi_max_vecs);
+	struct bio *b = bio_kmalloc(gfp_mask, bio_segments(bio));
 
 	if (!b)
 		return NULL;
-- 
1.7.9.rc2
^ permalink raw reply related	[flat|nested] 48+ messages in thread
* [PATCH 06/13] block: Add bio_reset()
  2012-05-18  2:59 [PATCH 00/13] Block cleanups (for bcache) koverstreet
                   ` (4 preceding siblings ...)
  2012-05-18  2:59 ` [PATCH 05/13] block: Only clone bio vecs that are in use koverstreet
@ 2012-05-18  2:59 ` koverstreet
  2012-05-18 16:16   ` Tejun Heo
  2012-05-18  2:59 ` [PATCH 07/13] pktcdvd: Switch to bio_kmalloc() koverstreet
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 48+ messages in thread
From: koverstreet @ 2012-05-18  2:59 UTC (permalink / raw)
  To: linux-bcache, linux-kernel, dm-devel, linux-fsdevel
  Cc: tj, axboe, Kent Overstreet, agk
From: Kent Overstreet <koverstreet@google.com>
Reusing bios is something that's been highly frowned upon in the past,
but driver code keeps doing it anyways. If it's going to happen anyways,
we should provide a generic method.
This'll help with getting rid of bi_destructor.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
---
 fs/bio.c                  |    8 ++++++++
 include/linux/bio.h       |    1 +
 include/linux/blk_types.h |    6 ++++++
 3 files changed, 15 insertions(+), 0 deletions(-)
diff --git a/fs/bio.c b/fs/bio.c
index de0733e..90e4c3a 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -259,6 +259,14 @@ void bio_init(struct bio *bio)
 }
 EXPORT_SYMBOL(bio_init);
 
+void bio_reset(struct bio *bio)
+{
+	memset(bio, 0, BIO_RESET_OFFSET);
+	bio->bi_flags = 1 << BIO_UPTODATE;
+
+}
+EXPORT_SYMBOL(bio_reset);
+
 /**
  * bio_alloc_bioset - allocate a bio for I/O
  * @gfp_mask:   the GFP_ mask given to the slab allocator
diff --git a/include/linux/bio.h b/include/linux/bio.h
index b27f16b..35f7c4d 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -228,6 +228,7 @@ extern struct bio *bio_clone(struct bio *, gfp_t);
 struct bio *bio_clone_kmalloc(struct bio *, gfp_t);
 
 extern void bio_init(struct bio *);
+extern void bio_reset(struct bio *);
 
 extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int);
 extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index dc0e399..4a47783 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -57,6 +57,10 @@ struct bio {
 	unsigned int		bi_seg_front_size;
 	unsigned int		bi_seg_back_size;
 
+	/*
+	 * Everything starting with bi_max_vecs will be preserved by bio_reset()
+	 */
+
 	unsigned int		bi_max_vecs;	/* max bvl_vecs we can hold */
 
 	atomic_t		bi_cnt;		/* pin count */
@@ -83,6 +87,8 @@ struct bio {
 	struct bio_vec		bi_inline_vecs[0];
 };
 
+#define BIO_RESET_OFFSET	offsetof(struct bio, bi_max_vecs)
+
 /*
  * bio flags
  */
-- 
1.7.9.rc2
^ permalink raw reply related	[flat|nested] 48+ messages in thread
* [PATCH 07/13] pktcdvd: Switch to bio_kmalloc()
  2012-05-18  2:59 [PATCH 00/13] Block cleanups (for bcache) koverstreet
                   ` (5 preceding siblings ...)
  2012-05-18  2:59 ` [PATCH 06/13] block: Add bio_reset() koverstreet
@ 2012-05-18  2:59 ` koverstreet
       [not found]   ` <04a4d6c2c8b6f0097e3594c6e0932093afffc1da.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2012-05-18  2:59 ` [PATCH 08/13] block: Kill bi_destructor koverstreet
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 48+ messages in thread
From: koverstreet @ 2012-05-18  2:59 UTC (permalink / raw)
  To: linux-bcache, linux-kernel, dm-devel, linux-fsdevel
  Cc: tj, axboe, Kent Overstreet, agk
From: Kent Overstreet <koverstreet@google.com>
This is prep work for killing bi_destructor
Signed-off-by: Kent Overstreet <koverstreet@google.com>
---
 drivers/block/pktcdvd.c |  115 ++++++++++++++++-------------------------------
 1 files changed, 39 insertions(+), 76 deletions(-)
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index ba66e44..6fe693a 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -522,36 +522,38 @@ static void pkt_bio_finished(struct pktcdvd_device *pd)
 	}
 }
 
-static void pkt_bio_destructor(struct bio *bio)
+static void pkt_end_io_read(struct bio *bio, int err)
 {
-	kfree(bio->bi_io_vec);
-	kfree(bio);
-}
+	struct packet_data *pkt = bio->bi_private;
+	struct pktcdvd_device *pd = pkt->pd;
+	BUG_ON(!pd);
 
-static struct bio *pkt_bio_alloc(int nr_iovecs)
-{
-	struct bio_vec *bvl = NULL;
-	struct bio *bio;
+	VPRINTK("pkt_end_io_read: bio=%p sec0=%llx sec=%llx err=%d\n", bio,
+		(unsigned long long)pkt->sector, (unsigned long long)bio->bi_sector, err);
 
-	bio = kmalloc(sizeof(struct bio), GFP_KERNEL);
-	if (!bio)
-		goto no_bio;
-	bio_init(bio);
+	if (err)
+		atomic_inc(&pkt->io_errors);
+	if (atomic_dec_and_test(&pkt->io_wait)) {
+		atomic_inc(&pkt->run_sm);
+		wake_up(&pd->wqueue);
+	}
+	pkt_bio_finished(pd);
+}
 
-	bvl = kcalloc(nr_iovecs, sizeof(struct bio_vec), GFP_KERNEL);
-	if (!bvl)
-		goto no_bvl;
+static void pkt_end_io_packet_write(struct bio *bio, int err)
+{
+	struct packet_data *pkt = bio->bi_private;
+	struct pktcdvd_device *pd = pkt->pd;
+	BUG_ON(!pd);
 
-	bio->bi_max_vecs = nr_iovecs;
-	bio->bi_io_vec = bvl;
-	bio->bi_destructor = pkt_bio_destructor;
+	VPRINTK("pkt_end_io_packet_write: id=%d, err=%d\n", pkt->id, err);
 
-	return bio;
+	pd->stats.pkt_ended++;
 
- no_bvl:
-	kfree(bio);
- no_bio:
-	return NULL;
+	pkt_bio_finished(pd);
+	atomic_dec(&pkt->io_wait);
+	atomic_inc(&pkt->run_sm);
+	wake_up(&pd->wqueue);
 }
 
 /*
@@ -567,10 +569,13 @@ static struct packet_data *pkt_alloc_packet_data(int frames)
 		goto no_pkt;
 
 	pkt->frames = frames;
-	pkt->w_bio = pkt_bio_alloc(frames);
+	pkt->w_bio = bio_kmalloc(GFP_KERNEL, frames);
 	if (!pkt->w_bio)
 		goto no_bio;
 
+	pkt->w_bio->bi_end_io = pkt_end_io_packet_write;
+	pkt->w_bio->bi_private = pkt;
+
 	for (i = 0; i < frames / FRAMES_PER_PAGE; i++) {
 		pkt->pages[i] = alloc_page(GFP_KERNEL|__GFP_ZERO);
 		if (!pkt->pages[i])
@@ -581,9 +586,12 @@ static struct packet_data *pkt_alloc_packet_data(int frames)
 	bio_list_init(&pkt->orig_bios);
 
 	for (i = 0; i < frames; i++) {
-		struct bio *bio = pkt_bio_alloc(1);
+		struct bio *bio = bio_kmalloc(GFP_KERNEL, 1);
 		if (!bio)
 			goto no_rd_bio;
+
+		bio->bi_end_io = pkt_end_io_read;
+		bio->bi_private = pkt;
 		pkt->r_bios[i] = bio;
 	}
 
@@ -1036,40 +1044,6 @@ static void pkt_make_local_copy(struct packet_data *pkt, struct bio_vec *bvec)
 	}
 }
 
-static void pkt_end_io_read(struct bio *bio, int err)
-{
-	struct packet_data *pkt = bio->bi_private;
-	struct pktcdvd_device *pd = pkt->pd;
-	BUG_ON(!pd);
-
-	VPRINTK("pkt_end_io_read: bio=%p sec0=%llx sec=%llx err=%d\n", bio,
-		(unsigned long long)pkt->sector, (unsigned long long)bio->bi_sector, err);
-
-	if (err)
-		atomic_inc(&pkt->io_errors);
-	if (atomic_dec_and_test(&pkt->io_wait)) {
-		atomic_inc(&pkt->run_sm);
-		wake_up(&pd->wqueue);
-	}
-	pkt_bio_finished(pd);
-}
-
-static void pkt_end_io_packet_write(struct bio *bio, int err)
-{
-	struct packet_data *pkt = bio->bi_private;
-	struct pktcdvd_device *pd = pkt->pd;
-	BUG_ON(!pd);
-
-	VPRINTK("pkt_end_io_packet_write: id=%d, err=%d\n", pkt->id, err);
-
-	pd->stats.pkt_ended++;
-
-	pkt_bio_finished(pd);
-	atomic_dec(&pkt->io_wait);
-	atomic_inc(&pkt->run_sm);
-	wake_up(&pd->wqueue);
-}
-
 /*
  * Schedule reads for the holes in a packet
  */
@@ -1111,21 +1085,15 @@ static void pkt_gather_data(struct pktcdvd_device *pd, struct packet_data *pkt)
 	 * Schedule reads for missing parts of the packet.
 	 */
 	for (f = 0; f < pkt->frames; f++) {
-		struct bio_vec *vec;
-
 		int p, offset;
+
 		if (written[f])
 			continue;
+
 		bio = pkt->r_bios[f];
-		vec = bio->bi_io_vec;
-		bio_init(bio);
-		bio->bi_max_vecs = 1;
-		bio->bi_sector = pkt->sector + f * (CD_FRAMESIZE >> 9);
-		bio->bi_bdev = pd->bdev;
-		bio->bi_end_io = pkt_end_io_read;
-		bio->bi_private = pkt;
-		bio->bi_io_vec = vec;
-		bio->bi_destructor = pkt_bio_destructor;
+		bio_reset(bio);
+		bio->bi_sector	= pkt->sector + f * (CD_FRAMESIZE >> 9);
+		bio->bi_bdev	= pd->bdev;
 
 		p = (f * CD_FRAMESIZE) / PAGE_SIZE;
 		offset = (f * CD_FRAMESIZE) % PAGE_SIZE;
@@ -1418,14 +1386,9 @@ static void pkt_start_write(struct pktcdvd_device *pd, struct packet_data *pkt)
 	}
 
 	/* Start the write request */
-	bio_init(pkt->w_bio);
-	pkt->w_bio->bi_max_vecs = PACKET_MAX_SIZE;
+	bio_reset(pkt->w_bio);
 	pkt->w_bio->bi_sector = pkt->sector;
 	pkt->w_bio->bi_bdev = pd->bdev;
-	pkt->w_bio->bi_end_io = pkt_end_io_packet_write;
-	pkt->w_bio->bi_private = pkt;
-	pkt->w_bio->bi_io_vec = bvec;
-	pkt->w_bio->bi_destructor = pkt_bio_destructor;
 	for (f = 0; f < pkt->frames; f++)
 		if (!bio_add_page(pkt->w_bio, bvec[f].bv_page, CD_FRAMESIZE, bvec[f].bv_offset))
 			BUG();
-- 
1.7.9.rc2
^ permalink raw reply related	[flat|nested] 48+ messages in thread
* [PATCH 08/13] block: Kill bi_destructor
  2012-05-18  2:59 [PATCH 00/13] Block cleanups (for bcache) koverstreet
                   ` (6 preceding siblings ...)
  2012-05-18  2:59 ` [PATCH 07/13] pktcdvd: Switch to bio_kmalloc() koverstreet
@ 2012-05-18  2:59 ` koverstreet
       [not found]   ` <de05855cf0fa4800cfab7e8340f106dccc7a75a1.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2012-05-18  2:59 ` [PATCH 09/13] block: Add an explicit bio flag for bios that own their bvec koverstreet
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 48+ messages in thread
From: koverstreet @ 2012-05-18  2:59 UTC (permalink / raw)
  To: linux-bcache, linux-kernel, dm-devel, linux-fsdevel
  Cc: tj, axboe, Kent Overstreet, agk
From: Kent Overstreet <koverstreet@google.com>
Now that we've got generic code for freeing bios allocated from bio
pools, this isn't needed anymore.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
---
 Documentation/block/biodoc.txt |    5 -----
 fs/bio.c                       |   15 +++++----------
 include/linux/blk_types.h      |    3 ---
 3 files changed, 5 insertions(+), 18 deletions(-)
diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt
index e418dc0..8df5e8e 100644
--- a/Documentation/block/biodoc.txt
+++ b/Documentation/block/biodoc.txt
@@ -465,7 +465,6 @@ struct bio {
        bio_end_io_t	*bi_end_io;  /* bi_end_io (bio) */
        atomic_t		bi_cnt;	     /* pin count: free when it hits zero */
        void             *bi_private;
-       bio_destructor_t *bi_destructor; /* bi_destructor (bio) */
 };
 
 With this multipage bio design:
@@ -647,10 +646,6 @@ for a non-clone bio. There are the 6 pools setup for different size biovecs,
 so bio_alloc(gfp_mask, nr_iovecs) will allocate a vec_list of the
 given size from these slabs.
 
-The bi_destructor() routine takes into account the possibility of the bio
-having originated from a different source (see later discussions on
-n/w to block transfers and kvec_cb)
-
 The bio_get() routine may be used to hold an extra reference on a bio prior
 to i/o submission, if the bio fields are likely to be accessed after the
 i/o is issued (since the bio may otherwise get freed in case i/o completion
diff --git a/fs/bio.c b/fs/bio.c
index 90e4c3a..ecc9088 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -343,13 +343,6 @@ struct bio *bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
 }
 EXPORT_SYMBOL(bio_alloc);
 
-static void bio_kmalloc_destructor(struct bio *bio)
-{
-	if (bio_integrity(bio))
-		bio_integrity_free(bio, fs_bio_set);
-	kfree(bio);
-}
-
 /**
  * bio_kmalloc - allocate a bio for I/O using kmalloc()
  * @gfp_mask:   the GFP_ mask given to the slab allocator
@@ -376,7 +369,6 @@ struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned int nr_iovecs)
 	bio->bi_flags |= BIO_POOL_NONE << BIO_POOL_OFFSET;
 	bio->bi_max_vecs = nr_iovecs;
 	bio->bi_io_vec = bio->bi_inline_vecs;
-	bio->bi_destructor = bio_kmalloc_destructor;
 
 	return bio;
 }
@@ -417,8 +409,11 @@ void bio_put(struct bio *bio)
 
 		if (bio->bi_pool)
 			bio_free(bio, bio->bi_pool);
-		else
-			bio->bi_destructor(bio);
+		else {
+			if (bio_integrity(bio))
+				bio_integrity_free(bio, fs_bio_set);
+			kfree(bio);
+		}
 	}
 }
 EXPORT_SYMBOL(bio_put);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 4a47783..bd9a610 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -74,11 +74,8 @@ struct bio {
 	struct bio_integrity_payload *bi_integrity;  /* data integrity */
 #endif
 
-	/* If bi_pool is non NULL, bi_destructor is not called */
 	struct bio_set		*bi_pool;
 
-	bio_destructor_t	*bi_destructor;	/* destructor */
-
 	/*
 	 * We can inline a number of vecs at the end of the bio, to avoid
 	 * double allocations for a small number of bio_vecs. This member
-- 
1.7.9.rc2
^ permalink raw reply related	[flat|nested] 48+ messages in thread
* [PATCH 09/13] block: Add an explicit bio flag for bios that own their bvec
  2012-05-18  2:59 [PATCH 00/13] Block cleanups (for bcache) koverstreet
                   ` (7 preceding siblings ...)
  2012-05-18  2:59 ` [PATCH 08/13] block: Kill bi_destructor koverstreet
@ 2012-05-18  2:59 ` koverstreet
       [not found]   ` <363875943e9d0e13bee6ed28239280543e6e5055.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2012-05-18  2:59 ` [PATCH 10/13] block: Rework bio splitting koverstreet
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 48+ messages in thread
From: koverstreet @ 2012-05-18  2:59 UTC (permalink / raw)
  To: linux-bcache, linux-kernel, dm-devel, linux-fsdevel
  Cc: Kent Overstreet, tj, axboe, agk, neilb
From: Kent Overstreet <koverstreet@google.com>
This is for the new bio splitting code. When we split a bio, if the
split occured on a bvec boundry we reuse the bvec for the new bio. But
that means bio_free() can't free it, hence the explicit flag.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
---
 fs/bio.c                  |    3 ++-
 include/linux/blk_types.h |    1 +
 2 files changed, 3 insertions(+), 1 deletions(-)
diff --git a/fs/bio.c b/fs/bio.c
index ecc9088..3332800 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -234,7 +234,7 @@ void bio_free(struct bio *bio, struct bio_set *bs)
 {
 	void *p;
 
-	if (bio_has_allocated_vec(bio))
+	if (bio_flagged(bio, BIO_HAS_VEC))
 		bvec_free_bs(bs, bio->bi_io_vec, BIO_POOL_IDX(bio));
 
 	if (bio_integrity(bio))
@@ -305,6 +305,7 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 			goto err_free;
 
 		nr_iovecs = bvec_nr_vecs(idx);
+		bio->bi_flags |= 1 << BIO_HAS_VEC;
 	}
 out_set:
 	bio->bi_flags |= idx << BIO_POOL_OFFSET;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index bd9a610..10fca21 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -101,6 +101,7 @@ struct bio {
 #define BIO_FS_INTEGRITY 9	/* fs owns integrity data, not block layer */
 #define BIO_QUIET	10	/* Make BIO Quiet */
 #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
+#define BIO_HAS_VEC	12	/* bio_free() should free bvec */
 #define bio_flagged(bio, flag)	((bio)->bi_flags & (1 << (flag)))
 
 /*
-- 
1.7.9.rc2
^ permalink raw reply related	[flat|nested] 48+ messages in thread
* [PATCH 10/13] block: Rework bio splitting
  2012-05-18  2:59 [PATCH 00/13] Block cleanups (for bcache) koverstreet
                   ` (8 preceding siblings ...)
  2012-05-18  2:59 ` [PATCH 09/13] block: Add an explicit bio flag for bios that own their bvec koverstreet
@ 2012-05-18  2:59 ` koverstreet
       [not found]   ` <ac4c1cbd10934fdc3a4af74d4cb5ec370f9139d5.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2012-05-18 17:46   ` Boaz Harrosh
       [not found] ` <cover.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
                   ` (2 subsequent siblings)
  12 siblings, 2 replies; 48+ messages in thread
From: koverstreet @ 2012-05-18  2:59 UTC (permalink / raw)
  To: linux-bcache, linux-kernel, dm-devel, linux-fsdevel
  Cc: Kent Overstreet, tj, axboe, agk, neilb
From: Kent Overstreet <koverstreet@google.com>
Break out bio_split() and bio_pair_split(), and get rid of the
limitations of bio_split()
Signed-off-by: Kent Overstreet <koverstreet@google.com>
---
 drivers/block/drbd/drbd_req.c |   18 +----
 drivers/block/pktcdvd.c       |    6 +-
 drivers/block/rbd.c           |    4 +-
 drivers/md/linear.c           |    6 +-
 drivers/md/raid0.c            |    8 +-
 drivers/md/raid10.c           |   23 ++-----
 fs/bio-integrity.c            |   44 ------------
 fs/bio.c                      |  149 ++++++++++++++++++++++++++---------------
 include/linux/bio.h           |   27 +++-----
 9 files changed, 125 insertions(+), 160 deletions(-)
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 4a0f314..68fa494 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1101,18 +1101,6 @@ void drbd_make_request(struct request_queue *q, struct bio *bio)
 	if (likely(s_enr == e_enr)) {
 		inc_ap_bio(mdev, 1);
 		drbd_make_request_common(mdev, bio, start_time);
-		return;
-	}
-
-	/* can this bio be split generically?
-	 * Maybe add our own split-arbitrary-bios function. */
-	if (bio->bi_vcnt != 1 || bio->bi_idx != 0 || bio->bi_size > DRBD_MAX_BIO_SIZE) {
-		/* rather error out here than BUG in bio_split */
-		dev_err(DEV, "bio would need to, but cannot, be split: "
-		    "(vcnt=%u,idx=%u,size=%u,sector=%llu)\n",
-		    bio->bi_vcnt, bio->bi_idx, bio->bi_size,
-		    (unsigned long long)bio->bi_sector);
-		bio_endio(bio, -EINVAL);
 	} else {
 		/* This bio crosses some boundary, so we have to split it. */
 		struct bio_pair *bp;
@@ -1128,7 +1116,7 @@ void drbd_make_request(struct request_queue *q, struct bio *bio)
 		const int sps = 1 << HT_SHIFT; /* sectors per slot */
 		const int mask = sps - 1;
 		const sector_t first_sectors = sps - (sect & mask);
-		bp = bio_split(bio, first_sectors);
+		bp = bio_pair_split(bio, first_sectors);
 
 		/* we need to get a "reference count" (ap_bio_cnt)
 		 * to avoid races with the disconnect/reconnect/suspend code.
@@ -1139,10 +1127,10 @@ void drbd_make_request(struct request_queue *q, struct bio *bio)
 
 		D_ASSERT(e_enr == s_enr + 1);
 
-		while (drbd_make_request_common(mdev, &bp->bio1, start_time))
+		while (drbd_make_request_common(mdev, &bp->split, start_time))
 			inc_ap_bio(mdev, 1);
 
-		while (drbd_make_request_common(mdev, &bp->bio2, start_time))
+		while (drbd_make_request_common(mdev, bio, start_time))
 			inc_ap_bio(mdev, 1);
 
 		dec_ap_bio(mdev);
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 6fe693a..1465155 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2467,10 +2467,10 @@ static void pkt_make_request(struct request_queue *q, struct bio *bio)
 		if (last_zone != zone) {
 			BUG_ON(last_zone != zone + pd->settings.size);
 			first_sectors = last_zone - bio->bi_sector;
-			bp = bio_split(bio, first_sectors);
+			bp = bio_pair_split(bio, first_sectors);
 			BUG_ON(!bp);
-			pkt_make_request(q, &bp->bio1);
-			pkt_make_request(q, &bp->bio2);
+			pkt_make_request(q, &bp->split);
+			pkt_make_request(q, bio);
 			bio_pair_release(bp);
 			return;
 		}
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 5a953c6..dd19311 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -747,11 +747,11 @@ static struct bio *bio_chain_clone(struct bio **old, struct bio **next,
 
 			/* split the bio. We'll release it either in the next
 			   call, or it will have to be released outside */
-			bp = bio_split(old_chain, (len - total) / SECTOR_SIZE);
+			bp = bio_pair_split(old_chain, (len - total) / SECTOR_SIZE);
 			if (!bp)
 				goto err_out;
 
-			*next = &bp->bio2;
+			*next = &bp->split;
 		} else
 			*next = old_chain->bi_next;
 
diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index fa211d8..7c6cafd 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -314,10 +314,10 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio)
 
 		rcu_read_unlock();
 
-		bp = bio_split(bio, end_sector - bio->bi_sector);
+		bp = bio_pair_split(bio, end_sector - bio->bi_sector);
 
-		linear_make_request(mddev, &bp->bio1);
-		linear_make_request(mddev, &bp->bio2);
+		linear_make_request(mddev, &bp->split);
+		linear_make_request(mddev, bio);
 		bio_pair_release(bp);
 		return;
 	}
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index de63a1f..3469adf 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -516,13 +516,13 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 		 * refuse to split for us, so we need to split it.
 		 */
 		if (likely(is_power_of_2(chunk_sects)))
-			bp = bio_split(bio, chunk_sects - (sector &
+			bp = bio_pair_split(bio, chunk_sects - (sector &
 							   (chunk_sects-1)));
 		else
-			bp = bio_split(bio, chunk_sects -
+			bp = bio_pair_split(bio, chunk_sects -
 				       sector_div(sector, chunk_sects));
-		raid0_make_request(mddev, &bp->bio1);
-		raid0_make_request(mddev, &bp->bio2);
+		raid0_make_request(mddev, &bp->split);
+		raid0_make_request(mddev, bio);
 		bio_pair_release(bp);
 		return;
 	}
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 3e7b154..0062326 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1001,15 +1001,9 @@ static void make_request(struct mddev *mddev, struct bio * bio)
 		      > chunk_sects &&
 		    conf->near_copies < conf->raid_disks)) {
 		struct bio_pair *bp;
-		/* Sanity check -- queue functions should prevent this happening */
-		if (bio->bi_vcnt != 1 ||
-		    bio->bi_idx != 0)
-			goto bad_map;
-		/* This is a one page bio that upper layers
-		 * refuse to split for us, so we need to split it.
-		 */
-		bp = bio_split(bio,
-			       chunk_sects - (bio->bi_sector & (chunk_sects - 1)) );
+
+		bp = bio_pair_split(bio,
+				    chunk_sects - (bio->bi_sector & (chunk_sects - 1)));
 
 		/* Each of these 'make_request' calls will call 'wait_barrier'.
 		 * If the first succeeds but the second blocks due to the resync
@@ -1023,8 +1017,8 @@ static void make_request(struct mddev *mddev, struct bio * bio)
 		conf->nr_waiting++;
 		spin_unlock_irq(&conf->resync_lock);
 
-		make_request(mddev, &bp->bio1);
-		make_request(mddev, &bp->bio2);
+		make_request(mddev, &bp->split);
+		make_request(mddev, bio);
 
 		spin_lock_irq(&conf->resync_lock);
 		conf->nr_waiting--;
@@ -1033,13 +1027,6 @@ static void make_request(struct mddev *mddev, struct bio * bio)
 
 		bio_pair_release(bp);
 		return;
-	bad_map:
-		printk("md/raid10:%s: make_request bug: can't convert block across chunks"
-		       " or bigger than %dk %llu %d\n", mdname(mddev), chunk_sects/2,
-		       (unsigned long long)bio->bi_sector, bio->bi_size >> 10);
-
-		bio_io_error(bio);
-		return;
 	}
 
 	md_write_start(mddev, bio);
diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index e85c04b..9ed2c44 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -682,50 +682,6 @@ void bio_integrity_trim(struct bio *bio, unsigned int offset,
 EXPORT_SYMBOL(bio_integrity_trim);
 
 /**
- * bio_integrity_split - Split integrity metadata
- * @bio:	Protected bio
- * @bp:		Resulting bio_pair
- * @sectors:	Offset
- *
- * Description: Splits an integrity page into a bio_pair.
- */
-void bio_integrity_split(struct bio *bio, struct bio_pair *bp, int sectors)
-{
-	struct blk_integrity *bi;
-	struct bio_integrity_payload *bip = bio->bi_integrity;
-	unsigned int nr_sectors;
-
-	if (bio_integrity(bio) == 0)
-		return;
-
-	bi = bdev_get_integrity(bio->bi_bdev);
-	BUG_ON(bi == NULL);
-	BUG_ON(bip->bip_vcnt != 1);
-
-	nr_sectors = bio_integrity_hw_sectors(bi, sectors);
-
-	bp->bio1.bi_integrity = &bp->bip1;
-	bp->bio2.bi_integrity = &bp->bip2;
-
-	bp->iv1 = bip->bip_vec[0];
-	bp->iv2 = bip->bip_vec[0];
-
-	bp->bip1.bip_vec[0] = bp->iv1;
-	bp->bip2.bip_vec[0] = bp->iv2;
-
-	bp->iv1.bv_len = sectors * bi->tuple_size;
-	bp->iv2.bv_offset += sectors * bi->tuple_size;
-	bp->iv2.bv_len -= sectors * bi->tuple_size;
-
-	bp->bip1.bip_sector = bio->bi_integrity->bip_sector;
-	bp->bip2.bip_sector = bio->bi_integrity->bip_sector + nr_sectors;
-
-	bp->bip1.bip_vcnt = bp->bip2.bip_vcnt = 1;
-	bp->bip1.bip_idx = bp->bip2.bip_idx = 0;
-}
-EXPORT_SYMBOL(bio_integrity_split);
-
-/**
  * bio_integrity_clone - Callback for cloning bios with integrity metadata
  * @bio:	New bio
  * @bio_src:	Original bio
diff --git a/fs/bio.c b/fs/bio.c
index 3332800..781a8cf 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -35,7 +35,7 @@
  */
 #define BIO_INLINE_VECS		4
 
-static mempool_t *bio_split_pool __read_mostly;
+static struct bio_set *bio_split_pool __read_mostly;
 
 /*
  * if you change this list, also change bvec_alloc or things will
@@ -1464,84 +1464,124 @@ void bio_endio(struct bio *bio, int error)
 }
 EXPORT_SYMBOL(bio_endio);
 
-void bio_pair_release(struct bio_pair *bp)
+struct bio *bio_split(struct bio *bio, int sectors,
+		      gfp_t gfp, struct bio_set *bs)
 {
-	if (atomic_dec_and_test(&bp->cnt)) {
-		struct bio *master = bp->bio1.bi_private;
+	unsigned idx, vcnt = 0, nbytes = sectors << 9;
+	struct bio_vec *bv;
+	struct bio *ret = NULL;
+
+	BUG_ON(sectors <= 0);
+
+	if (current->bio_list)
+		gfp &= ~__GFP_WAIT;
+
+	if (nbytes >= bio->bi_size)
+		return bio;
+
+	trace_block_split(bdev_get_queue(bio->bi_bdev), bio,
+				bio->bi_sector + sectors);
+
+	bio_for_each_segment(bv, bio, idx) {
+		vcnt = idx - bio->bi_idx;
+
+		if (!nbytes) {
+			ret = bio_alloc_bioset(gfp, 0, bs);
+			if (!ret)
+				return NULL;
+
+			ret->bi_io_vec = bio_iovec(bio);
+			ret->bi_flags |= 1 << BIO_CLONED;
+			break;
+		} else if (nbytes < bv->bv_len) {
+			ret = bio_alloc_bioset(gfp, ++vcnt, bs);
+			if (!ret)
+				return NULL;
+
+			memcpy(ret->bi_io_vec, bio_iovec(bio),
+			       sizeof(struct bio_vec) * vcnt);
+
+			ret->bi_io_vec[vcnt - 1].bv_len = nbytes;
+			bv->bv_offset	+= nbytes;
+			bv->bv_len	-= nbytes;
+			break;
+		}
+
+		nbytes -= bv->bv_len;
+	}
+
+	ret->bi_bdev	= bio->bi_bdev;
+	ret->bi_sector	= bio->bi_sector;
+	ret->bi_size	= sectors << 9;
+	ret->bi_rw	= bio->bi_rw;
+	ret->bi_vcnt	= vcnt;
+	ret->bi_max_vecs = vcnt;
+	ret->bi_end_io	= bio->bi_end_io;
+	ret->bi_private	= bio->bi_private;
 
-		bio_endio(master, bp->error);
-		mempool_free(bp, bp->bio2.bi_private);
+	bio->bi_sector	+= sectors;
+	bio->bi_size	-= sectors << 9;
+	bio->bi_idx	 = idx;
+
+	if (bio_integrity(bio)) {
+		bio_integrity_clone(ret, bio, gfp, bs);
+		bio_integrity_trim(ret, 0, bio_sectors(ret));
+		bio_integrity_trim(bio, bio_sectors(ret), bio_sectors(bio));
 	}
+
+	return ret;
 }
-EXPORT_SYMBOL(bio_pair_release);
+EXPORT_SYMBOL_GPL(bio_split);
 
-static void bio_pair_end_1(struct bio *bi, int err)
+void bio_pair_release(struct bio_pair *bp)
 {
-	struct bio_pair *bp = container_of(bi, struct bio_pair, bio1);
-
-	if (err)
-		bp->error = err;
+	if (atomic_dec_and_test(&bp->cnt)) {
+		bp->orig->bi_end_io = bp->bi_end_io;
+		bp->orig->bi_private = bp->bi_private;
 
-	bio_pair_release(bp);
+		bio_endio(bp->orig, 0);
+		bio_put(&bp->split);
+	}
 }
+EXPORT_SYMBOL(bio_pair_release);
 
-static void bio_pair_end_2(struct bio *bi, int err)
+static void bio_pair_end(struct bio *bio, int error)
 {
-	struct bio_pair *bp = container_of(bi, struct bio_pair, bio2);
+	struct bio_pair *bp = bio->bi_private;
 
-	if (err)
-		bp->error = err;
+	if (error)
+		clear_bit(BIO_UPTODATE, &bp->orig->bi_flags);
 
 	bio_pair_release(bp);
 }
 
-/*
- * split a bio - only worry about a bio with a single page in its iovec
- */
-struct bio_pair *bio_split(struct bio *bi, int first_sectors)
+struct bio_pair *bio_pair_split(struct bio *bio, int first_sectors)
 {
-	struct bio_pair *bp = mempool_alloc(bio_split_pool, GFP_NOIO);
-
-	if (!bp)
-		return bp;
+	struct bio_pair *bp;
+	struct bio *split = bio_split(bio, first_sectors,
+				      GFP_NOIO, bio_split_pool);
 
-	trace_block_split(bdev_get_queue(bi->bi_bdev), bi,
-				bi->bi_sector + first_sectors);
-
-	BUG_ON(bi->bi_vcnt != 1);
-	BUG_ON(bi->bi_idx != 0);
-	atomic_set(&bp->cnt, 3);
-	bp->error = 0;
-	bp->bio1 = *bi;
-	bp->bio2 = *bi;
-	bp->bio2.bi_sector += first_sectors;
-	bp->bio2.bi_size -= first_sectors << 9;
-	bp->bio1.bi_size = first_sectors << 9;
+	if (!split)
+		return NULL;
 
-	bp->bv1 = bi->bi_io_vec[0];
-	bp->bv2 = bi->bi_io_vec[0];
-	bp->bv2.bv_offset += first_sectors << 9;
-	bp->bv2.bv_len -= first_sectors << 9;
-	bp->bv1.bv_len = first_sectors << 9;
+	BUG_ON(split == bio);
 
-	bp->bio1.bi_io_vec = &bp->bv1;
-	bp->bio2.bi_io_vec = &bp->bv2;
+	bp = container_of(split, struct bio_pair, split);
 
-	bp->bio1.bi_max_vecs = 1;
-	bp->bio2.bi_max_vecs = 1;
+	atomic_set(&bp->cnt, 3);
 
-	bp->bio1.bi_end_io = bio_pair_end_1;
-	bp->bio2.bi_end_io = bio_pair_end_2;
+	bp->bi_end_io = bio->bi_end_io;
+	bp->bi_private = bio->bi_private;
 
-	bp->bio1.bi_private = bi;
-	bp->bio2.bi_private = bio_split_pool;
+	bio->bi_private = bp;
+	bio->bi_end_io = bio_pair_end;
 
-	if (bio_integrity(bi))
-		bio_integrity_split(bi, bp, first_sectors);
+	split->bi_private = bp;
+	split->bi_end_io = bio_pair_end;
 
 	return bp;
 }
-EXPORT_SYMBOL(bio_split);
+EXPORT_SYMBOL(bio_pair_split);
 
 /**
  *      bio_sector_offset - Find hardware sector offset in bio
@@ -1694,8 +1734,7 @@ static int __init init_bio(void)
 	if (bioset_integrity_create(fs_bio_set, BIO_POOL_SIZE))
 		panic("bio: can't create integrity pool\n");
 
-	bio_split_pool = mempool_create_kmalloc_pool(BIO_SPLIT_ENTRIES,
-						     sizeof(struct bio_pair));
+	bio_split_pool = bioset_create(BIO_POOL_SIZE, offsetof(struct bio_pair, split));
 	if (!bio_split_pool)
 		panic("bio: can't create split pool\n");
 
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 35f7c4d..db38175 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -197,16 +197,18 @@ struct bio_integrity_payload {
  *   in bio2.bi_private
  */
 struct bio_pair {
-	struct bio			bio1, bio2;
-	struct bio_vec			bv1, bv2;
-#if defined(CONFIG_BLK_DEV_INTEGRITY)
-	struct bio_integrity_payload	bip1, bip2;
-	struct bio_vec			iv1, iv2;
-#endif
-	atomic_t			cnt;
-	int				error;
+	atomic_t		cnt;
+
+	bio_end_io_t		*bi_end_io;
+	void			*bi_private;
+
+	struct bio		*orig;
+	struct bio		split;
 };
-extern struct bio_pair *bio_split(struct bio *bi, int first_sectors);
+
+extern struct bio *bio_split(struct bio *bio, int sectors,
+			     gfp_t gfp, struct bio_set *bs);
+extern struct bio_pair *bio_pair_split(struct bio *bi, int first_sectors);
 extern void bio_pair_release(struct bio_pair *dbio);
 
 extern struct bio_set *bioset_create(unsigned int, unsigned int);
@@ -511,7 +513,6 @@ extern int bio_integrity_prep(struct bio *);
 extern void bio_integrity_endio(struct bio *, int);
 extern void bio_integrity_advance(struct bio *, unsigned int);
 extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int);
-extern void bio_integrity_split(struct bio *, struct bio_pair *, int);
 extern int bio_integrity_clone(struct bio *, struct bio *, gfp_t, struct bio_set *);
 extern int bioset_integrity_create(struct bio_set *, int);
 extern void bioset_integrity_free(struct bio_set *);
@@ -555,12 +556,6 @@ static inline int bio_integrity_clone(struct bio *bio, struct bio *bio_src,
 	return 0;
 }
 
-static inline void bio_integrity_split(struct bio *bio, struct bio_pair *bp,
-				       int sectors)
-{
-	return;
-}
-
 static inline void bio_integrity_advance(struct bio *bio,
 					 unsigned int bytes_done)
 {
-- 
1.7.9.rc2
^ permalink raw reply related	[flat|nested] 48+ messages in thread
* [PATCH 11/13] Closures
       [not found] ` <cover.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-05-18  2:59   ` koverstreet-hpIqsD4AKlfQT0dZR+AlfA
       [not found]     ` <a184989cfbf92297d4cca2f823e5ac24ec8fe1e3.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: koverstreet-hpIqsD4AKlfQT0dZR+AlfA @ 2012-05-18  2:59 UTC (permalink / raw)
  To: linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: Kent Overstreet, tj-DgEjT+Ai2ygdnm+yROfE0A,
	axboe-tSWWG44O7X1aa/9Udqfwiw, agk-H+wXaHxf7aLQT0dZR+AlfA,
	neilb-l3A5Bk7waGM
From: Kent Overstreet <koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Asynchronous refcounty thingies; they embed a refcount and a work
struct. Extensive documentation follows in include/linux/closure.h
Signed-off-by: Kent Overstreet <koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 include/linux/closure.h |  614 +++++++++++++++++++++++++++++++++++++++++++++++
 lib/Kconfig.debug       |    8 +
 lib/Makefile            |    2 +-
 lib/closure.c           |  363 ++++++++++++++++++++++++++++
 4 files changed, 986 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/closure.h
 create mode 100644 lib/closure.c
diff --git a/include/linux/closure.h b/include/linux/closure.h
new file mode 100644
index 0000000..9dceb01
--- /dev/null
+++ b/include/linux/closure.h
@@ -0,0 +1,614 @@
+#ifndef _LINUX_CLOSURE_H
+#define _LINUX_CLOSURE_H
+
+#include <linux/llist.h>
+#include <linux/sched.h>
+#include <linux/workqueue.h>
+
+/*
+ * Closure is perhaps the most overused and abused term in computer science, but
+ * since I've been unable to come up with anything better you're stuck with it
+ * again.
+ *
+ * What are closures?
+ *
+ * They embed a refcount. The basic idea is they count "things that are in
+ * progress" - in flight bios, some other thread that's doing something else -
+ * anything you might want to wait on.
+ *
+ * The refcount may be manipulated with closure_get() and closure_put().
+ * closure_put() is where many of the interesting things happen, when it causes
+ * the refcount to go to 0.
+ *
+ * Closures can be used to wait on things both synchronously and asynchronously,
+ * and synchronous and asynchronous use can be mixed without restriction. To
+ * wait synchronously, use closure_sync() - you will sleep until your closure's
+ * refcount hits 1.
+ *
+ * To wait asynchronously, use
+ *   continue_at(cl, next_function, workqueue);
+ *
+ * passing it, as you might expect, the function to run when nothing is pending
+ * and the workqueue to run that function out of.
+ *
+ * continue_at() also, critically, is a macro that returns the calling function.
+ * There's good reason for this.
+ *
+ * To use safely closures asynchronously, they must always have a refcount while
+ * they are running owned by the thread that is running them. Otherwise, suppose
+ * you submit some bios and wish to have a function run when they all complete:
+ *
+ * foo_endio(struct bio *bio, int error)
+ * {
+ *	closure_put(cl);
+ * }
+ *
+ * closure_init(cl);
+ *
+ * do_stuff();
+ * closure_get(cl);
+ * bio1->bi_endio = foo_endio;
+ * bio_submit(bio1);
+ *
+ * do_more_stuff();
+ * closure_get(cl);
+ * bio2->bi_endio = foo_endio;
+ * bio_submit(bio2);
+ *
+ * continue_at(cl, complete_some_read, system_wq);
+ *
+ * If closure's refcount started at 0, complete_some_read() could run before the
+ * second bio was submitted - which is almost always not what you want! More
+ * importantly, it wouldn't be possible to say whether the original thread or
+ * complete_some_read()'s thread owned the closure - and whatever state it was
+ * associated with!
+ *
+ * So, closure_init() initializes a closure's refcount to 1 - and when a
+ * closure_fn is run, the refcount will be reset to 1 first.
+ *
+ * Then, the rule is - if you got the refcount with closure_get(), release it
+ * with closure_put() (i.e, in a bio->bi_endio function). If you have a refcount
+ * on a closure because you called closure_init() or you were run out of a
+ * closure - _always_ use continue_at(). Doing so consistently will help
+ * eliminate an entire class of particularly pernicious races.
+ *
+ * For a closure to wait on an arbitrary event, we need to introduce waitlists:
+ *
+ * closure_list_t list;
+ * closure_wait_event(list, cl, condition);
+ * closure_wake_up(wait_list);
+ *
+ * These work analagously to wait_event() and wake_up() - except that instead of
+ * operating on the current thread (for wait_event()) and lists of threads, they
+ * operate on an explicit closure and lists of closures.
+ *
+ * Because it's a closure we can now wait either synchronously or
+ * asynchronously. closure_wait_event() returns the current value of the
+ * condition, and if it returned false continue_at() or closure_sync() can be
+ * used to wait for it to become true.
+ *
+ * It's useful for waiting on things when you can't sleep in the context in
+ * which you must check the condition (perhaps a spinlock held, or you might be
+ * beneath generic_make_request() - in which case you can't sleep on IO).
+ *
+ * closure_wait_event() will wait either synchronously or asynchronously,
+ * depending on whether the closure is in blocking mode or not. You can pick a
+ * mode explicitly with closure_wait_event_sync() and
+ * closure_wait_event_async(), which do just what you might expect.
+ *
+ * Lastly, you might have a wait list dedicated to a specific event, and have no
+ * need for specifying the condition - you just want to wait until someone runs
+ * closure_wake_up() on the appropriate wait list. In that case, just use
+ * closure_wait(). It will return either true or false, depending on whether the
+ * closure was already on a wait list or not - a closure can only be on one wait
+ * list at a time.
+ *
+ * Parents:
+ *
+ * closure_init() takes two arguments - it takes the closure to initialize, and
+ * a (possibly null) parent.
+ *
+ * If parent is non null, the new closure will have a refcount for its lifetime;
+ * a closure is considered to be "finished" when its refcount hits 0 and the
+ * function to run is null. Hence
+ *
+ * continue_at(cl, NULL, NULL);
+ *
+ * returns up the (spaghetti) stack of closures, precisely like normal return
+ * returns up the C stack. continue_at() with non null fn is better thought of
+ * as doing a tail call.
+ *
+ * All this implies that a closure should typically be embedded in a particular
+ * struct (which its refcount will normally control the lifetime of), and that
+ * struct can very much be thought of as a stack frame.
+ *
+ * Locking:
+ *
+ * Closures are based on work items but they can be thought of as more like
+ * threads - in that like threads and unlike work items they have a well
+ * defined lifetime; they are created (with closure_init()) and eventually
+ * complete after a continue_at(cl, NULL, NULL).
+ *
+ * Suppose you've got some larger structure with a closure embedded in it that's
+ * used for periodically doing garbage collection. You only want one garbage
+ * collection happening at a time, so the natural thing to do is protect it with
+ * a lock. However, it's difficult to use a lock protecting a closure correctly
+ * because the unlock should come after the last continue_to() (additionally, if
+ * you're using the closure asynchronously a mutex won't work since a mutex has
+ * to be unlocked by the same process that locked it).
+ *
+ * So to make it less error prone and more efficient, we also have the ability
+ * to use closures as locks:
+ *
+ * closure_init_unlocked();
+ * closure_trylock();
+ *
+ * That's all we need for trylock() - the last closure_put() implicitly unlocks
+ * it for you.  But for closure_lock(), we also need a wait list:
+ *
+ * struct closure_with_waitlist frobnicator_cl;
+ *
+ * closure_init_unlocked(&frobnicator_cl);
+ * closure_lock(&frobnicator_cl);
+ *
+ * A closure_with_waitlist embeds a closure and a wait list - much like struct
+ * delayed_work embeds a work item and a timer_list. The important thing is, use
+ * it exactly like you would a regular closure and closure_put() will magically
+ * handle everything for you.
+ *
+ * We've got closures that embed timers, too. They're called, appropriately
+ * enough:
+ * struct closure_with_timer;
+ *
+ * This gives you access to closure_sleep(). It takes a refcount for a specified
+ * number of jiffies - you could then call closure_sync() (for a slightly
+ * convoluted version of msleep()) or continue_at() - which gives you the same
+ * effect as using a delayed work item, except you can reuse the work_struct
+ * already embedded in struct closure.
+ *
+ * Lastly, there's struct closure_with_waitlist_and_timer. It does what you
+ * probably expect, if you happen to need the features of both. (You don't
+ * really want to know how all this is implemented, but if I've done my job
+ * right you shouldn't have to care).
+ */
+
+struct closure;
+typedef void (closure_fn) (struct closure *);
+
+typedef struct llist_head	closure_list_t;
+
+struct closure {
+	union {
+		struct {
+			struct workqueue_struct *wq;
+			struct task_struct	*task;
+			struct llist_node	list;
+			closure_fn		*fn;
+		};
+		struct work_struct	work;
+	};
+
+	struct closure		*parent;
+
+#define CLOSURE_REMAINING_MASK	(~(~0 << 20))
+#define CLOSURE_GUARD_MASK					\
+	((1 << 20)|(1 << 22)|(1 << 24)|(1 << 26)|(1 << 28)|(1 << 30))
+
+	/*
+	 * CLOSURE_RUNNING: Set when a closure is running (i.e. by
+	 * closure_init() and when closure_put() runs then next function), and
+	 * must be cleared before remaining hits 0. Primarily to help guard
+	 * against incorrect usage and accidently transferring references.
+	 * continue_at() and closure_return() clear it for you, if you're doing
+	 * something unusual you can use closure_set_dead() which also helps
+	 * annotate where references are being transferred.
+	 *
+	 * CLOSURE_BLOCKING: Causes closure_wait_event() to block, instead of
+	 * waiting asynchronously
+	 *
+	 * CLOSURE_STACK: Sanity check - remaining should never hit 0 on a
+	 * closure with this flag set
+	 *
+	 * CLOSURE_WAITING: Set iff the closure is on a waitlist. Must be set by
+	 * the thread that owns the closure, and cleared by the thread that's
+	 * waking up the closure.
+	 *
+	 * CLOSURE_SLEEPING: Must be set before a thread uses a closure to sleep
+	 * - indicates that cl->task is valid and closure_put() may wake it up.
+	 * Only set or cleared by the thread that owns the closure.
+	 *
+	 * CLOSURE_TIMER: Analagous to CLOSURE_WAITING, indicates that a closure
+	 * has an outstanding timer. Must be set by the thread that owns the
+	 * closure, and cleared by the timer function when the timer goes off.
+	 */
+
+#define	CLOSURE_RUNNING		(1 << 21)
+#define	CLOSURE_BLOCKING	(1 << 23)
+#define CLOSURE_STACK		(1 << 25)
+#define	CLOSURE_WAITING		(1 << 27)
+#define	CLOSURE_SLEEPING	(1 << 29)
+#define	CLOSURE_TIMER		(1 << 31)
+	atomic_t		remaining;
+
+#define CLOSURE_REMAINING_INITIALIZER (1|CLOSURE_RUNNING)
+
+#define TYPE_closure			0U
+#define TYPE_closure_with_waitlist	1U
+#define TYPE_closure_with_timer		2U
+#define TYPE_closure_with_waitlist_and_timer 3U
+#define MAX_CLOSURE_TYPE		3U
+	unsigned		type;
+
+#ifdef CONFIG_DEBUG_CLOSURES
+#define CLOSURE_MAGIC_DEAD	0xc054dead
+#define CLOSURE_MAGIC_ALIVE	0xc054a11e
+
+	unsigned		magic;
+	struct list_head	all;
+	unsigned long		ip;
+	unsigned long		waiting_on;
+#endif
+};
+
+struct closure_with_waitlist {
+	struct closure	cl;
+	closure_list_t	wait;
+};
+
+struct closure_with_timer {
+	struct closure	cl;
+	struct timer_list timer;
+};
+
+struct closure_with_waitlist_and_timer {
+	struct closure	cl;
+	closure_list_t	wait;
+	struct timer_list timer;
+};
+
+extern unsigned invalid_closure_type(void);
+
+#define __CL_TYPE(cl, _t)						\
+	  __builtin_types_compatible_p(typeof(cl), struct _t)		\
+		? TYPE_ ## _t :						\
+
+#define __closure_type(cl)						\
+(									\
+	__CL_TYPE(cl, closure)						\
+	__CL_TYPE(cl, closure_with_waitlist)				\
+	__CL_TYPE(cl, closure_with_timer)				\
+	__CL_TYPE(cl, closure_with_waitlist_and_timer)			\
+	invalid_closure_type()						\
+)
+
+void closure_sub(struct closure *cl, int v);
+void closure_put(struct closure *cl);
+void closure_queue(struct closure *cl);
+void __closure_wake_up(closure_list_t *list);
+bool closure_wait(closure_list_t *list, struct closure *cl);
+void closure_sync(struct closure *cl);
+
+bool closure_trylock(struct closure *cl, struct closure *parent);
+void __closure_lock(struct closure *cl, struct closure *parent,
+		    closure_list_t *wait_list);
+
+void do_closure_timer_init(struct closure *cl);
+bool __closure_sleep(struct closure *cl, unsigned long delay,
+		     struct timer_list *timer);
+void __closure_flush(struct closure *cl, struct timer_list *timer);
+void __closure_flush_sync(struct closure *cl, struct timer_list *timer);
+
+#ifdef CONFIG_DEBUG_CLOSURES
+
+void closure_debug_create(struct closure *cl);
+void closure_debug_destroy(struct closure *cl);
+
+#else
+
+static inline void closure_debug_create(struct closure *cl) {}
+static inline void closure_debug_destroy(struct closure *cl) {}
+
+#endif
+
+static inline void closure_set_ip(struct closure *cl)
+{
+#ifdef CONFIG_DEBUG_CLOSURES
+	cl->ip = _THIS_IP_;
+#endif
+}
+
+static inline void closure_set_ret_ip(struct closure *cl)
+{
+#ifdef CONFIG_DEBUG_CLOSURES
+	cl->ip = _RET_IP_;
+#endif
+}
+
+static inline void closure_get(struct closure *cl)
+{
+#ifdef CONFIG_DEBUG_CLOSURES
+	BUG_ON((atomic_inc_return(&cl->remaining) &
+		CLOSURE_REMAINING_MASK) <= 1);
+#else
+	atomic_inc(&cl->remaining);
+#endif
+}
+
+static inline void closure_set_stopped(struct closure *cl)
+{
+	atomic_sub(CLOSURE_RUNNING, &cl->remaining);
+}
+
+static inline bool closure_is_stopped(struct closure *cl)
+{
+	return !(atomic_read(&cl->remaining) & CLOSURE_RUNNING);
+}
+
+static inline bool closure_is_unlocked(struct closure *cl)
+{
+	return atomic_read(&cl->remaining) == -1;
+}
+
+static inline void do_closure_init(struct closure *cl, struct closure *parent,
+				   bool running)
+{
+	switch (cl->type) {
+	case TYPE_closure_with_timer:
+	case TYPE_closure_with_waitlist_and_timer:
+		do_closure_timer_init(cl);
+	}
+
+#if defined(CONFIG_LOCKDEP) || defined(CONFIG_DEBUG_OBJECTS_WORK)
+	INIT_WORK(&cl->work, NULL);
+#endif
+	cl->parent = parent;
+	if (parent)
+		closure_get(parent);
+
+	if (running) {
+		closure_debug_create(cl);
+		atomic_set(&cl->remaining, CLOSURE_REMAINING_INITIALIZER);
+	} else
+		atomic_set(&cl->remaining, -1);
+}
+
+/*
+ * Hack to get at the embedded closure if there is one, by doing an unsafe cast:
+ * the result of __closure_type() is thrown away, it's used merely for type
+ * checking.
+ */
+#define __to_internal_closure(cl)				\
+({								\
+	BUILD_BUG_ON(__closure_type(*cl) > MAX_CLOSURE_TYPE);	\
+	(struct closure *) cl;					\
+})
+
+#define closure_init_type(cl, parent, running, memset)		\
+do {								\
+	struct closure *_cl = __to_internal_closure(cl);	\
+	_cl->type = __closure_type(*(cl));			\
+	closure_set_ip(_cl);					\
+	do_closure_init(_cl, parent, running);			\
+} while (0)
+
+/**
+ * __closure_init() - Initialize a closure, skipping the memset()
+ *
+ * May be used instead of closure_init() when memory has already been zeroed.
+ */
+#define __closure_init(cl, parent)				\
+	closure_init_type(cl, parent, true, false)
+
+/**
+ * closure_init() - Initialize a closure, setting the refcount to 1
+ * @cl:		closure to initialize
+ * @parent:	parent of the new closure. cl will take a refcount on it for its
+ *		lifetime; may be NULL.
+ */
+#define closure_init(cl, parent)				\
+	closure_init_type(cl, parent, true, true)
+
+static inline void closure_init_stack(struct closure *cl)
+{
+	memset(cl, 0, sizeof(struct closure));
+	atomic_set(&cl->remaining, CLOSURE_REMAINING_INITIALIZER|
+		   CLOSURE_BLOCKING|CLOSURE_STACK);
+}
+
+/**
+ * closure_init_unlocked() - Initialize a closure but leave it unlocked.
+ * @cl:		closure to initialize
+ *
+ * For when the closure will be used as a lock. The closure may not be used
+ * until after a closure_lock() or closure_trylock().
+ */
+#define closure_init_unlocked(cl)				\
+	closure_init_type(cl, NULL, false, true)
+
+/**
+ * closure_lock() - lock and initialize a closure.
+ * @cl:		the closure to lock
+ * @parent:	the new parent for this closure
+ *
+ * The closure must be of one of the types that has a waitlist (otherwise we
+ * wouldn't be able to sleep on contention).
+ *
+ * @parent has exactly the same meaning as in closure_init(); if non null, the
+ * closure will take a reference on @parent which will be released when it is
+ * unlocked.
+ */
+#define closure_lock(cl, parent)				\
+	__closure_lock(__to_internal_closure(cl), parent, &(cl)->wait)
+
+/**
+ * closure_sleep() - asynchronous sleep
+ * @cl:		the closure that will sleep
+ * @delay:	the delay in jiffies
+ *
+ * Takes a refcount on @cl which will be released after @delay jiffies; this may
+ * be used to have a function run after a delay with continue_at(), or
+ * closure_sync() may be used for a convoluted version of msleep().
+ */
+#define closure_sleep(cl, delay)				\
+	__closure_sleep(__to_internal_closure(cl), delay, &(cl)->timer)
+
+#define closure_flush(cl)				\
+	__closure_flush(__to_internal_closure(cl), &(cl)->timer)
+
+#define closure_flush_sync(cl)				\
+	__closure_flush_sync(__to_internal_closure(cl), &(cl)->timer)
+
+static inline void __closure_end_sleep(struct closure *cl)
+{
+	__set_current_state(TASK_RUNNING);
+
+	if (atomic_read(&cl->remaining) & CLOSURE_SLEEPING)
+		atomic_sub(CLOSURE_SLEEPING, &cl->remaining);
+}
+
+static inline void __closure_start_sleep(struct closure *cl)
+{
+	closure_set_ip(cl);
+	cl->task = current;
+	set_current_state(TASK_UNINTERRUPTIBLE);
+
+	if (!(atomic_read(&cl->remaining) & CLOSURE_SLEEPING))
+		atomic_add(CLOSURE_SLEEPING, &cl->remaining);
+}
+
+/**
+ * closure_blocking() - returns true if the closure is in blocking mode.
+ *
+ * If a closure is in blocking mode, closure_wait_event() will sleep until the
+ * condition is true instead of waiting asynchronously.
+ */
+static inline bool closure_blocking(struct closure *cl)
+{
+	return atomic_read(&cl->remaining) & CLOSURE_BLOCKING;
+}
+
+/**
+ * set_closure_blocking() - put a closure in blocking mode.
+ *
+ * If a closure is in blocking mode, closure_wait_event() will sleep until the
+ * condition is true instead of waiting asynchronously.
+ *
+ * Not thread safe - can only be called by the thread running the closure.
+ */
+static inline void set_closure_blocking(struct closure *cl)
+{
+	if (!closure_blocking(cl))
+		atomic_add(CLOSURE_BLOCKING, &cl->remaining);
+}
+
+/*
+ * Not thread safe - can only be called by the thread running the closure.
+ */
+static inline void clear_closure_blocking(struct closure *cl)
+{
+	if (closure_blocking(cl))
+		atomic_sub(CLOSURE_BLOCKING, &cl->remaining);
+}
+
+/**
+ * closure_wake_up() - wake up all closures on a wait list.
+ */
+static inline void closure_wake_up(closure_list_t *list)
+{
+	smp_mb();
+	__closure_wake_up(list);
+}
+
+/*
+ * Wait on an event, synchronously or asynchronously - analagous to wait_event()
+ * but for closures.
+ *
+ * The loop is oddly structured so as to avoid a race; we must check the
+ * condition again after we've added ourself to the waitlist. We know if we were
+ * already on the waitlist because closure_wait() returns false; thus, we only
+ * schedule or break if closure_wait() returns false. If it returns true, we
+ * just loop again - rechecking the condition.
+ *
+ * The __closure_wake_up() is necessary because we may race with the event
+ * becoming true; i.e. we see event false -> wait -> recheck condition, but the
+ * thread that made the event true may have called closure_wake_up() before we
+ * added ourself to the wait list.
+ *
+ * We have to call closure_sync() at the end instead of just
+ * __closure_end_sleep() because a different thread might've called
+ * closure_wake_up() before us and gotten preempted before they dropped the
+ * refcount on our closure. If this was a stack allocated closure, that would be
+ * bad.
+ */
+#define __closure_wait_event(list, cl, condition, _block)		\
+({									\
+	__label__ out;							\
+	bool block = _block;						\
+	typeof(condition) ret;						\
+									\
+	while (!(ret = (condition))) {					\
+		if (block)						\
+			__closure_start_sleep(cl);			\
+		if (!closure_wait(list, cl)) {				\
+			if (!block)					\
+				goto out;				\
+			schedule();					\
+		}							\
+	}								\
+	__closure_wake_up(list);					\
+	if (block)							\
+		closure_sync(cl);					\
+out:									\
+	ret;								\
+})
+
+/**
+ * closure_wait_event() - wait on a condition, synchronously or asynchronously.
+ * @list:	the wait list to wait on
+ * @cl:		the closure that is doing the waiting
+ * @condition:	a C expression for the event to wait for
+ *
+ * If the closure is in blocking mode, sleeps until the @condition evaluates to
+ * true - exactly like wait_event().
+ *
+ * If the closure is not in blocking mode, waits asynchronously; if the
+ * condition is currently false the @cl is put onto @list and returns. @list
+ * owns a refcount on @cl; closure_sync() or continue_at() may be used later to
+ * wait for another thread to wake up @list, which drops the refcount on @cl.
+ *
+ * Returns the value of @condition; @cl will be on @list iff @condition was
+ * false.
+ *
+ * closure_wake_up(@list) must be called after changing any variable that could
+ * cause @condition to become true.
+ */
+#define closure_wait_event(list, cl, condition)				\
+	__closure_wait_event(list, cl, condition, closure_blocking(cl))
+
+#define closure_wait_event_async(list, cl, condition)			\
+	__closure_wait_event(list, cl, condition, false)
+
+#define closure_wait_event_sync(list, cl, condition)			\
+	__closure_wait_event(list, cl, condition, true)
+
+static inline void set_closure_fn(struct closure *cl, closure_fn *fn,
+				  struct workqueue_struct *wq)
+{
+	cl->fn = fn;
+	cl->wq = wq;
+	/* between atomic_dec() in closure_put() */
+	smp_mb__before_atomic_dec();
+}
+
+#define continue_at(_cl, _fn, _wq, ...)					\
+do {									\
+	BUG_ON(!(_cl) || object_is_on_stack(_cl));			\
+	closure_set_ip(_cl);						\
+	set_closure_fn(_cl, _fn, _wq);					\
+	closure_sub(_cl, CLOSURE_RUNNING + 1);				\
+	return __VA_ARGS__;						\
+} while (0)
+
+#define closure_return(_cl)	continue_at((_cl), NULL, NULL)
+
+#endif /* _LINUX_CLOSURE_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 6777153..2a486e0 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -378,6 +378,14 @@ config DEBUG_OBJECTS_ENABLE_DEFAULT
         help
           Debug objects boot parameter default value
 
+config DEBUG_CLOSURES
+	bool "Debug closures"
+	select DEBUG_FS
+	---help---
+	Keeps all active closures in a linked list and provides a debugfs
+	interface to list them, which makes it possible to see asynchronous
+	operations that get stuck.
+
 config DEBUG_SLAB
 	bool "Debug slab memory allocations"
 	depends on DEBUG_KERNEL && SLAB && !KMEMCHECK
diff --git a/lib/Makefile b/lib/Makefile
index 18515f0..35ad204 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -22,7 +22,7 @@ lib-y	+= kobject.o klist.o
 obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \
 	 bust_spinlocks.o hexdump.o kasprintf.o bitmap.o scatterlist.o \
 	 string_helpers.o gcd.o lcm.o list_sort.o uuid.o flex_array.o \
-	 bsearch.o find_last_bit.o find_next_bit.o llist.o
+	 bsearch.o find_last_bit.o find_next_bit.o llist.o closure.o
 obj-y += kstrtox.o
 obj-$(CONFIG_TEST_KSTRTOX) += test-kstrtox.o
 
diff --git a/lib/closure.c b/lib/closure.c
new file mode 100644
index 0000000..5e9fd98
--- /dev/null
+++ b/lib/closure.c
@@ -0,0 +1,363 @@
+
+#include <linux/closure.h>
+#include <linux/debugfs.h>
+#include <linux/module.h>
+#include <linux/seq_file.h>
+
+/*
+ * Closure like things
+ * See include/linux/closure.h for full documentation
+ */
+
+void closure_queue(struct closure *cl)
+{
+	struct workqueue_struct *wq = cl->wq;
+	if (wq) {
+		cl->work.data = (atomic_long_t) WORK_DATA_INIT();
+		INIT_LIST_HEAD(&cl->work.entry);
+		BUG_ON(!queue_work(wq, &cl->work));
+	} else
+		cl->fn(cl);
+}
+EXPORT_SYMBOL_GPL(closure_queue);
+
+static void closure_wake_up_after_xchg(struct llist_node *);
+
+#define CL_FIELD(type, field)					\
+	case TYPE_ ## type:					\
+	return &container_of(cl, struct type, cl)->field
+
+static closure_list_t *closure_waitlist(struct closure *cl)
+{
+	switch (cl->type) {
+		CL_FIELD(closure_with_waitlist, wait);
+		CL_FIELD(closure_with_waitlist_and_timer, wait);
+	}
+	return NULL;
+}
+
+static struct timer_list *closure_timer(struct closure *cl)
+{
+	switch (cl->type) {
+		CL_FIELD(closure_with_timer, timer);
+		CL_FIELD(closure_with_waitlist_and_timer, timer);
+	}
+	return NULL;
+}
+
+static void closure_put_after_sub(struct closure *cl, int r)
+{
+	BUG_ON(r & CLOSURE_GUARD_MASK);
+	/* CLOSURE_BLOCK is the only flag that's allowed when r hits 0 */
+	BUG_ON((r & CLOSURE_REMAINING_MASK) == 0 &&
+	       (r & ~CLOSURE_BLOCKING));
+
+	/* Must deliver precisely one wakeup */
+	if ((r & CLOSURE_REMAINING_MASK) == 1 &&
+	    (r & CLOSURE_SLEEPING)) {
+		smp_mb__after_atomic_dec();
+		wake_up_process(cl->task);
+	}
+
+	if ((r & CLOSURE_REMAINING_MASK) == 0) {
+		smp_mb__after_atomic_dec();
+
+		if (cl->fn) {
+			/* CLOSURE_BLOCKING might be set - clear it */
+			atomic_set(&cl->remaining,
+				   CLOSURE_REMAINING_INITIALIZER);
+			closure_queue(cl);
+		} else {
+			struct closure *parent = cl->parent;
+			closure_list_t *wait = closure_waitlist(cl);
+
+			closure_debug_destroy(cl);
+
+			smp_wmb();
+			/* mb between last use of closure and unlocking it */
+			atomic_set(&cl->remaining, -1);
+
+			if (wait)
+				closure_wake_up(wait);
+
+			if (parent)
+				closure_put(parent);
+		}
+	}
+}
+
+/* For clearing flags with the same atomic op as a put */
+void closure_sub(struct closure *cl, int v)
+{
+	closure_put_after_sub(cl, atomic_sub_return(v, &cl->remaining));
+}
+EXPORT_SYMBOL_GPL(closure_sub);
+
+void closure_put(struct closure *cl)
+{
+	closure_put_after_sub(cl, atomic_dec_return(&cl->remaining));
+}
+EXPORT_SYMBOL_GPL(closure_put);
+
+static void set_waiting(struct closure *cl, unsigned long f)
+{
+#ifdef CONFIG_DEBUG_CLOSURES
+	cl->waiting_on = f;
+#endif
+}
+
+/*
+ * Broken up because closure_put() has to do the xchg() and grab the wait list
+ * before unlocking the closure, but the wakeup has to come after unlocking the
+ * closure.
+ */
+static void closure_wake_up_after_xchg(struct llist_node *list)
+{
+	struct closure *cl;
+	struct llist_node *reverse = NULL;
+
+	while (list) {
+		struct llist_node *t = list;
+		list = llist_next(list);
+
+		t->next = reverse;
+		reverse = t;
+	}
+
+	while (reverse) {
+		cl = container_of(reverse, struct closure, list);
+		reverse = llist_next(reverse);
+
+		set_waiting(cl, 0);
+		closure_sub(cl, CLOSURE_WAITING + 1);
+	}
+}
+
+void __closure_wake_up(closure_list_t *list)
+{
+	closure_wake_up_after_xchg(llist_del_all(list));
+}
+EXPORT_SYMBOL_GPL(__closure_wake_up);
+
+bool closure_wait(closure_list_t *list, struct closure *cl)
+{
+	if (atomic_read(&cl->remaining) & CLOSURE_WAITING)
+		return false;
+
+	set_waiting(cl, _RET_IP_);
+	atomic_add(CLOSURE_WAITING + 1, &cl->remaining);
+	llist_add(&cl->list, list);
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(closure_wait);
+
+/**
+ * closure_sync() - sleep until a closure a closure has nothing left to wait on
+ *
+ * Sleeps until the refcount hits 1 - the thread that's running the closure owns
+ * the last refcount.
+ */
+void closure_sync(struct closure *cl)
+{
+	while (1) {
+		__closure_start_sleep(cl);
+		closure_set_ret_ip(cl);
+
+		if ((atomic_read(&cl->remaining) &
+		     CLOSURE_REMAINING_MASK) == 1)
+			break;
+
+		schedule();
+	}
+
+	__closure_end_sleep(cl);
+}
+EXPORT_SYMBOL_GPL(closure_sync);
+
+/**
+ * closure_trylock() - try to acquire the closure, without waiting
+ * @cl:		closure to lock
+ *
+ * Returns true if the closure was succesfully locked.
+ */
+bool closure_trylock(struct closure *cl, struct closure *parent)
+{
+	if (atomic_cmpxchg(&cl->remaining, -1,
+			   CLOSURE_REMAINING_INITIALIZER) != -1)
+		return false;
+
+	closure_set_ret_ip(cl);
+
+	smp_mb();
+	cl->parent = parent;
+	if (parent)
+		closure_get(parent);
+
+	closure_debug_create(cl);
+	return true;
+}
+EXPORT_SYMBOL_GPL(closure_trylock);
+
+void __closure_lock(struct closure *cl, struct closure *parent,
+		    closure_list_t *wait_list)
+{
+	struct closure wait;
+	closure_init_stack(&wait);
+
+	while (1) {
+		if (closure_trylock(cl, parent))
+			return;
+
+		closure_wait_event_sync(wait_list, &wait,
+					atomic_read(&cl->remaining) == -1);
+	}
+}
+EXPORT_SYMBOL_GPL(__closure_lock);
+
+static void closure_sleep_timer_fn(unsigned long data)
+{
+	struct closure *cl = (struct closure *) data;
+	closure_sub(cl, CLOSURE_TIMER + 1);
+}
+
+void do_closure_timer_init(struct closure *cl)
+{
+	struct timer_list *timer = closure_timer(cl);
+
+	init_timer(timer);
+	timer->data	= (unsigned long) cl;
+	timer->function = closure_sleep_timer_fn;
+}
+EXPORT_SYMBOL_GPL(do_closure_timer_init);
+
+bool __closure_sleep(struct closure *cl, unsigned long delay,
+		     struct timer_list *timer)
+{
+	if (atomic_read(&cl->remaining) & CLOSURE_TIMER)
+		return false;
+
+	BUG_ON(timer_pending(timer));
+
+	timer->expires	= jiffies + delay;
+
+	atomic_add(CLOSURE_TIMER + 1, &cl->remaining);
+	add_timer(timer);
+	return true;
+}
+EXPORT_SYMBOL_GPL(__closure_sleep);
+
+void __closure_flush(struct closure *cl, struct timer_list *timer)
+{
+	if (del_timer(timer))
+		closure_sub(cl, CLOSURE_TIMER + 1);
+}
+EXPORT_SYMBOL_GPL(__closure_flush);
+
+void __closure_flush_sync(struct closure *cl, struct timer_list *timer)
+{
+	if (del_timer_sync(timer))
+		closure_sub(cl, CLOSURE_TIMER + 1);
+}
+EXPORT_SYMBOL_GPL(__closure_flush_sync);
+
+#ifdef CONFIG_DEBUG_CLOSURES
+
+static LIST_HEAD(closure_list);
+static DEFINE_SPINLOCK(closure_list_lock);
+
+void closure_debug_create(struct closure *cl)
+{
+	unsigned long flags;
+
+	BUG_ON(cl->magic == CLOSURE_MAGIC_ALIVE);
+	cl->magic = CLOSURE_MAGIC_ALIVE;
+
+	spin_lock_irqsave(&closure_list_lock, flags);
+	list_add(&cl->all, &closure_list);
+	spin_unlock_irqrestore(&closure_list_lock, flags);
+}
+EXPORT_SYMBOL_GPL(closure_debug_create);
+
+void closure_debug_destroy(struct closure *cl)
+{
+	unsigned long flags;
+
+	BUG_ON(cl->magic != CLOSURE_MAGIC_ALIVE);
+	cl->magic = CLOSURE_MAGIC_DEAD;
+
+	spin_lock_irqsave(&closure_list_lock, flags);
+	list_del(&cl->all);
+	spin_unlock_irqrestore(&closure_list_lock, flags);
+}
+EXPORT_SYMBOL_GPL(closure_debug_destroy);
+
+static struct dentry *debug;
+
+#define work_data_bits(work) ((unsigned long *)(&(work)->data))
+
+static int debug_seq_show(struct seq_file *f, void *data)
+{
+	struct closure *cl;
+	spin_lock_irq(&closure_list_lock);
+
+	list_for_each_entry(cl, &closure_list, all) {
+		int r = atomic_read(&cl->remaining);
+
+		seq_printf(f, "%p: %pF -> %pf p %p r %i ",
+			   cl, (void *) cl->ip, cl->fn, cl->parent,
+			   r & CLOSURE_REMAINING_MASK);
+
+		if (test_bit(WORK_STRUCT_PENDING, work_data_bits(&cl->work)))
+			seq_printf(f, "Q");
+
+		if (r & CLOSURE_RUNNING)
+			seq_printf(f, "R");
+
+		if (r & CLOSURE_BLOCKING)
+			seq_printf(f, "B");
+
+		if (r & CLOSURE_STACK)
+			seq_printf(f, "S");
+
+		if (r & CLOSURE_SLEEPING)
+			seq_printf(f, "Sl");
+
+		if (r & CLOSURE_TIMER)
+			seq_printf(f, "T");
+
+		if (r & CLOSURE_WAITING)
+			seq_printf(f, " W %pF\n",
+				   (void *) cl->waiting_on);
+
+		seq_printf(f, "\n");
+	}
+
+	spin_unlock_irq(&closure_list_lock);
+	return 0;
+}
+
+static int debug_seq_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, debug_seq_show, NULL);
+}
+
+static const struct file_operations debug_ops = {
+	.owner		= THIS_MODULE,
+	.open		= debug_seq_open,
+	.read		= seq_read,
+	.release	= single_release
+};
+
+int __init closure_debug_init(void)
+{
+	debug = debugfs_create_file("closures", 0400, NULL, NULL, &debug_ops);
+	return 0;
+}
+
+module_init(closure_debug_init);
+
+#endif
+
+MODULE_AUTHOR("Kent Overstreet <koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>");
+MODULE_LICENSE("GPL");
-- 
1.7.9.rc2
^ permalink raw reply related	[flat|nested] 48+ messages in thread
* [PATCH 12/13] Make generic_make_request handle arbitrarily large bios
  2012-05-18  2:59 [PATCH 00/13] Block cleanups (for bcache) koverstreet
                   ` (10 preceding siblings ...)
       [not found] ` <cover.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-05-18  2:59 ` koverstreet
  2012-05-18  8:05   ` NeilBrown
                     ` (2 more replies)
  2012-05-18  3:00 ` [PATCH 13/13] Gut bio_add_page() koverstreet
  12 siblings, 3 replies; 48+ messages in thread
From: koverstreet @ 2012-05-18  2:59 UTC (permalink / raw)
  To: linux-bcache, linux-kernel, dm-devel, linux-fsdevel
  Cc: Kent Overstreet, tj, axboe, agk, neilb
From: Kent Overstreet <koverstreet@google.com>
The way the block layer is currently written, it goes to great lengths
to avoid having to split bios; upper layer code (such as bio_add_page())
checks what the underlying device can handle and tries to always create
bios that don't need to be split.
But this approach becomes unwieldy and eventually breaks down with
stacked devices and devices with dynamic limits, and it adds a lot of
complexity. If the block layer could split bios as needed, we could
eliminate a lot of complexity elsewhere - particularly in stacked
drivers. Code that creates bios can then create whatever size bios are
convenient, and more importantly stacked drivers don't have to deal with
both their own bio size limitations and the limitations of the
(potentially multiple) devices underneath them.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
---
 block/blk-core.c       |  111 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/bio.c               |   41 ++++++++++++++++++
 include/linux/bio.h    |    7 +++
 include/linux/blkdev.h |    3 +
 4 files changed, 162 insertions(+), 0 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 91617eb..1d7a482 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -29,6 +29,7 @@
 #include <linux/fault-inject.h>
 #include <linux/list_sort.h>
 #include <linux/delay.h>
+#include <linux/closure.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/block.h>
@@ -52,6 +53,12 @@ static struct kmem_cache *request_cachep;
 struct kmem_cache *blk_requestq_cachep;
 
 /*
+ * For bio_split_hook
+ */
+static struct kmem_cache *bio_split_cache;
+static struct workqueue_struct *bio_split_wq;
+
+/*
  * Controlling structure to kblockd
  */
 static struct workqueue_struct *kblockd_workqueue;
@@ -487,6 +494,14 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	if (q->id < 0)
 		goto fail_q;
 
+	q->bio_split_hook = mempool_create_slab_pool(4, bio_split_cache);
+	if (!q->bio_split_hook)
+		goto fail_split_hook;
+
+	q->bio_split = bioset_create(4, 0);
+	if (!q->bio_split)
+		goto fail_split;
+
 	q->backing_dev_info.ra_pages =
 			(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
 	q->backing_dev_info.state = 0;
@@ -526,6 +541,10 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 
 fail_id:
 	ida_simple_remove(&blk_queue_ida, q->id);
+fail_split:
+	bioset_free(q->bio_split);
+fail_split_hook:
+	mempool_destroy(q->bio_split_hook);
 fail_q:
 	kmem_cache_free(blk_requestq_cachep, q);
 	return NULL;
@@ -1493,6 +1512,83 @@ static inline bool should_fail_request(struct hd_struct *part,
 
 #endif /* CONFIG_FAIL_MAKE_REQUEST */
 
+struct bio_split_hook {
+	struct closure		cl;
+	struct request_queue	*q;
+	struct bio		*bio;
+	bio_end_io_t		*bi_end_io;
+	void			*bi_private;
+};
+
+static void bio_submit_split_done(struct closure *cl)
+{
+	struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
+
+	s->bio->bi_end_io	= s->bi_end_io;
+	s->bio->bi_private	= s->bi_private;
+	bio_endio(s->bio, 0);
+
+	closure_debug_destroy(&s->cl);
+	mempool_free(s, s->q->bio_split_hook);
+}
+
+static void bio_submit_split_endio(struct bio *bio, int error)
+{
+	struct closure *cl = bio->bi_private;
+	struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
+
+	if (error)
+		clear_bit(BIO_UPTODATE, &s->bio->bi_flags);
+
+	bio_put(bio);
+	closure_put(cl);
+}
+
+static void __bio_submit_split(struct closure *cl)
+{
+	struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
+	struct bio *bio = s->bio, *n;
+
+	do {
+		n = bio_split(bio, bio_max_sectors(bio),
+			      GFP_NOIO, s->q->bio_split);
+		if (!n)
+			continue_at(cl, __bio_submit_split, bio_split_wq);
+
+		closure_get(cl);
+		generic_make_request(n);
+	} while (n != bio);
+
+	continue_at(cl, bio_submit_split_done, NULL);
+}
+
+static int bio_submit_split(struct bio *bio)
+{
+	struct bio_split_hook *s;
+	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+
+	if (!bio_has_data(bio) ||
+	    !q ||
+	    !q->bio_split_hook ||
+	    bio_sectors(bio) <= bio_max_sectors(bio))
+		return 0;
+
+	s = mempool_alloc(q->bio_split_hook, GFP_NOIO);
+
+	closure_init(&s->cl, NULL);
+	s->bio		= bio;
+	s->q		= q;
+	s->bi_end_io	= bio->bi_end_io;
+	s->bi_private	= bio->bi_private;
+
+	bio_get(bio);
+	bio->bi_end_io	= bio_submit_split_endio;
+	bio->bi_private	= &s->cl;
+
+	__bio_submit_split(&s->cl);
+	return 1;
+}
+
 /*
  * Check whether this bio extends beyond the end of the device.
  */
@@ -1646,6 +1742,14 @@ void generic_make_request(struct bio *bio)
 	 * it is non-NULL, then a make_request is active, and new requests
 	 * should be added at the tail
 	 */
+
+	/*
+	 * If the device can't accept arbitrary sized bios, check if we
+	 * need to split:
+	 */
+	if (bio_submit_split(bio))
+		return;
+
 	if (current->bio_list) {
 		bio_list_add(current->bio_list, bio);
 		return;
@@ -2892,11 +2996,18 @@ int __init blk_dev_init(void)
 	if (!kblockd_workqueue)
 		panic("Failed to create kblockd\n");
 
+	bio_split_wq = alloc_workqueue("bio_split", WQ_MEM_RECLAIM, 0);
+	if (!bio_split_wq)
+		panic("Failed to create bio_split wq\n");
+
 	request_cachep = kmem_cache_create("blkdev_requests",
 			sizeof(struct request), 0, SLAB_PANIC, NULL);
 
 	blk_requestq_cachep = kmem_cache_create("blkdev_queue",
 			sizeof(struct request_queue), 0, SLAB_PANIC, NULL);
 
+	bio_split_cache = kmem_cache_create("bio_split_hook",
+			sizeof(struct bio_split_hook), 0, SLAB_PANIC, NULL);
+
 	return 0;
 }
diff --git a/fs/bio.c b/fs/bio.c
index 781a8cf..360ac93 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -428,6 +428,47 @@ inline int bio_phys_segments(struct request_queue *q, struct bio *bio)
 }
 EXPORT_SYMBOL(bio_phys_segments);
 
+unsigned __bio_max_sectors(struct bio *bio, struct block_device *bdev,
+			   sector_t sector)
+{
+	unsigned ret = bio_sectors(bio);
+	struct request_queue *q = bdev_get_queue(bdev);
+	struct bio_vec *bv, *end = bio_iovec(bio) +
+		min_t(int, bio_segments(bio), queue_max_segments(q));
+
+	struct bvec_merge_data bvm = {
+		.bi_bdev	= bdev,
+		.bi_sector	= sector,
+		.bi_size	= 0,
+		.bi_rw		= bio->bi_rw,
+	};
+
+	if (bio_segments(bio) > queue_max_segments(q) ||
+	    q->merge_bvec_fn) {
+		ret = 0;
+
+		for (bv = bio_iovec(bio); bv < end; bv++) {
+			if (q->merge_bvec_fn &&
+			    q->merge_bvec_fn(q, &bvm, bv) < (int) bv->bv_len)
+				break;
+
+			ret		+= bv->bv_len >> 9;
+			bvm.bi_size	+= bv->bv_len;
+		}
+
+		if (ret >= (BIO_MAX_PAGES * PAGE_SIZE) >> 9)
+			return (BIO_MAX_PAGES * PAGE_SIZE) >> 9;
+	}
+
+	ret = min(ret, queue_max_sectors(q));
+
+	WARN_ON(!ret);
+	ret = max_t(int, ret, bio_iovec(bio)->bv_len >> 9);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(__bio_max_sectors);
+
 /**
  * 	__bio_clone	-	clone a bio
  * 	@bio: destination bio
diff --git a/include/linux/bio.h b/include/linux/bio.h
index db38175..79f8173 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -224,6 +224,13 @@ extern void bio_endio(struct bio *, int);
 struct request_queue;
 extern int bio_phys_segments(struct request_queue *, struct bio *);
 
+unsigned __bio_max_sectors(struct bio *, struct block_device *, sector_t);
+
+static inline unsigned bio_max_sectors(struct bio *bio)
+{
+	return __bio_max_sectors(bio, bio->bi_bdev, bio->bi_sector);
+}
+
 extern void __bio_clone(struct bio *, struct bio *);
 extern struct bio *bio_clone_bioset(struct bio *, gfp_t, struct bio_set *bs);
 extern struct bio *bio_clone(struct bio *, gfp_t);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2aa2466..464adb7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -399,6 +399,9 @@ struct request_queue {
 	/* Throttle data */
 	struct throtl_data *td;
 #endif
+
+	mempool_t		*bio_split_hook;
+	struct bio_set		*bio_split;
 };
 
 #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
-- 
1.7.9.rc2
^ permalink raw reply related	[flat|nested] 48+ messages in thread
* [PATCH 13/13] Gut bio_add_page()
  2012-05-18  2:59 [PATCH 00/13] Block cleanups (for bcache) koverstreet
                   ` (11 preceding siblings ...)
  2012-05-18  2:59 ` [PATCH 12/13] Make generic_make_request handle arbitrarily large bios koverstreet
@ 2012-05-18  3:00 ` koverstreet
  12 siblings, 0 replies; 48+ messages in thread
From: koverstreet @ 2012-05-18  3:00 UTC (permalink / raw)
  To: linux-bcache, linux-kernel, dm-devel, linux-fsdevel
  Cc: Kent Overstreet, tj, axboe, agk, neilb
From: Kent Overstreet <koverstreet@google.com>
Since generic_make_request() can now handle arbitrary size bios, all we
have to do is make sure the bvec array doesn't overflow.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
---
 fs/bio.c |  133 ++++++++++++-------------------------------------------------
 1 files changed, 26 insertions(+), 107 deletions(-)
diff --git a/fs/bio.c b/fs/bio.c
index 360ac93..e3276bd 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -568,12 +568,22 @@ int bio_get_nr_vecs(struct block_device *bdev)
 }
 EXPORT_SYMBOL(bio_get_nr_vecs);
 
-static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
-			  *page, unsigned int len, unsigned int offset,
-			  unsigned short max_sectors)
+/**
+ *	bio_add_page	-	attempt to add page to bio
+ *	@bio: destination bio
+ *	@page: page to add
+ *	@len: vec entry length
+ *	@offset: vec entry offset
+ *
+ *	Attempt to add a page to the bio_vec maplist. This can fail for a
+ *	number of reasons, such as the bio being full or target block device
+ *	limitations. The target block device must allow bio's up to PAGE_SIZE,
+ *	so it is always possible to add a single page to an empty bio.
+ */
+int bio_add_page(struct bio *bio, struct page *page,
+		 unsigned int len, unsigned int offset)
 {
-	int retried_segments = 0;
-	struct bio_vec *bvec;
+	struct bio_vec *bv;
 
 	/*
 	 * cloned bio must not modify vec list
@@ -581,40 +591,17 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
 	if (unlikely(bio_flagged(bio, BIO_CLONED)))
 		return 0;
 
-	if (((bio->bi_size + len) >> 9) > max_sectors)
-		return 0;
-
 	/*
 	 * For filesystems with a blocksize smaller than the pagesize
 	 * we will often be called with the same page as last time and
 	 * a consecutive offset.  Optimize this special case.
 	 */
 	if (bio->bi_vcnt > 0) {
-		struct bio_vec *prev = &bio->bi_io_vec[bio->bi_vcnt - 1];
-
-		if (page == prev->bv_page &&
-		    offset == prev->bv_offset + prev->bv_len) {
-			unsigned int prev_bv_len = prev->bv_len;
-			prev->bv_len += len;
-
-			if (q->merge_bvec_fn) {
-				struct bvec_merge_data bvm = {
-					/* prev_bvec is already charged in
-					   bi_size, discharge it in order to
-					   simulate merging updated prev_bvec
-					   as new bvec. */
-					.bi_bdev = bio->bi_bdev,
-					.bi_sector = bio->bi_sector,
-					.bi_size = bio->bi_size - prev_bv_len,
-					.bi_rw = bio->bi_rw,
-				};
-
-				if (q->merge_bvec_fn(q, &bvm, prev) < prev->bv_len) {
-					prev->bv_len -= len;
-					return 0;
-				}
-			}
+		bv = bio_iovec_idx(bio, bio->bi_vcnt - 1);
 
+		if (page == bv->bv_page &&
+		    offset == bv->bv_offset + bv->bv_len) {
+			bv->bv_len += len;
 			goto done;
 		}
 	}
@@ -622,64 +609,17 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
 	if (bio->bi_vcnt >= bio->bi_max_vecs)
 		return 0;
 
-	/*
-	 * we might lose a segment or two here, but rather that than
-	 * make this too complex.
-	 */
-
-	while (bio->bi_phys_segments >= queue_max_segments(q)) {
-
-		if (retried_segments)
-			return 0;
-
-		retried_segments = 1;
-		blk_recount_segments(q, bio);
-	}
-
-	/*
-	 * setup the new entry, we might clear it again later if we
-	 * cannot add the page
-	 */
-	bvec = &bio->bi_io_vec[bio->bi_vcnt];
-	bvec->bv_page = page;
-	bvec->bv_len = len;
-	bvec->bv_offset = offset;
-
-	/*
-	 * if queue has other restrictions (eg varying max sector size
-	 * depending on offset), it can specify a merge_bvec_fn in the
-	 * queue to get further control
-	 */
-	if (q->merge_bvec_fn) {
-		struct bvec_merge_data bvm = {
-			.bi_bdev = bio->bi_bdev,
-			.bi_sector = bio->bi_sector,
-			.bi_size = bio->bi_size,
-			.bi_rw = bio->bi_rw,
-		};
-
-		/*
-		 * merge_bvec_fn() returns number of bytes it can accept
-		 * at this offset
-		 */
-		if (q->merge_bvec_fn(q, &bvm, bvec) < bvec->bv_len) {
-			bvec->bv_page = NULL;
-			bvec->bv_len = 0;
-			bvec->bv_offset = 0;
-			return 0;
-		}
-	}
-
-	/* If we may be able to merge these biovecs, force a recount */
-	if (bio->bi_vcnt && (BIOVEC_PHYS_MERGEABLE(bvec-1, bvec)))
-		bio->bi_flags &= ~(1 << BIO_SEG_VALID);
+	bv		= bio_iovec_idx(bio, bio->bi_vcnt);
+	bv->bv_page	= page;
+	bv->bv_len	= len;
+	bv->bv_offset	= offset;
 
 	bio->bi_vcnt++;
-	bio->bi_phys_segments++;
- done:
+done:
 	bio->bi_size += len;
 	return len;
 }
+EXPORT_SYMBOL(bio_add_page);
 
 /**
  *	bio_add_pc_page	-	attempt to add page to bio
@@ -699,31 +639,10 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
 int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page *page,
 		    unsigned int len, unsigned int offset)
 {
-	return __bio_add_page(q, bio, page, len, offset,
-			      queue_max_hw_sectors(q));
+	return bio_add_page(bio, page, len, offset);
 }
 EXPORT_SYMBOL(bio_add_pc_page);
 
-/**
- *	bio_add_page	-	attempt to add page to bio
- *	@bio: destination bio
- *	@page: page to add
- *	@len: vec entry length
- *	@offset: vec entry offset
- *
- *	Attempt to add a page to the bio_vec maplist. This can fail for a
- *	number of reasons, such as the bio being full or target block device
- *	limitations. The target block device must allow bio's up to PAGE_SIZE,
- *	so it is always possible to add a single page to an empty bio.
- */
-int bio_add_page(struct bio *bio, struct page *page, unsigned int len,
-		 unsigned int offset)
-{
-	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
-	return __bio_add_page(q, bio, page, len, offset, queue_max_sectors(q));
-}
-EXPORT_SYMBOL(bio_add_page);
-
 struct bio_map_data {
 	struct bio_vec *iovecs;
 	struct sg_iovec *sgvecs;
-- 
1.7.9.rc2
^ permalink raw reply related	[flat|nested] 48+ messages in thread
* Re: [PATCH 12/13] Make generic_make_request handle arbitrarily large bios
  2012-05-18  2:59 ` [PATCH 12/13] Make generic_make_request handle arbitrarily large bios koverstreet
@ 2012-05-18  8:05   ` NeilBrown
       [not found]     ` <20120518180550.0a6cdc34-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  2012-05-18 17:52   ` Tejun Heo
       [not found]   ` <1114e7019b0055fc09a54b59b36398d5c54f5e32.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2 siblings, 1 reply; 48+ messages in thread
From: NeilBrown @ 2012-05-18  8:05 UTC (permalink / raw)
  To: koverstreet
  Cc: linux-bcache, linux-kernel, dm-devel, linux-fsdevel, tj, axboe,
	agk
[-- Attachment #1: Type: text/plain, Size: 1923 bytes --]
Hi Kent,
 there is lots of good stuff in this series and I certainly hope a lot of it
 can eventually go upstream.
 However there is a part of this patch that I don't think is safe:
> +static void __bio_submit_split(struct closure *cl)
> +{
> +	struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
> +	struct bio *bio = s->bio, *n;
> +
> +	do {
> +		n = bio_split(bio, bio_max_sectors(bio),
> +			      GFP_NOIO, s->q->bio_split);
> +		if (!n)
> +			continue_at(cl, __bio_submit_split, bio_split_wq);
> +
> +		closure_get(cl);
> +		generic_make_request(n);
> +	} while (n != bio);
> +
> +	continue_at(cl, bio_submit_split_done, NULL);
> +}
Firstly a small point:  Can bio_split ever return NULL here?  I don't
think it can, so there is no need to test.
But if it can, then calling generic_make_request(NULL) doesn't seem like a
good idea.
More significantly though::
  This is called from generic_make_request which can be called recursively and
enforces a tail-recursion semantic.
If that generic_make_request was a recursive call, then the
generic_make_request in __bio_submit_split will not start the request, but
will queue the bio for later handling.  If we then call bio_split again, we
could try to allocation from a mempool while we are holding one entry
allocated from that pool captive.  This can deadlock.
i.e. if the original bio is so large that it needs to be split into 3 pieces,
then we will try to allocate the second piece before the first piece has a
chance to be released.  If this happens in enough threads to exhaust the pool
(4 I think), it will deadlock.
I realise this sounds like a very unlikely case, but of course they happen.
One possible approach might be to count how many splits will be required,
then have an interface to mempools so you can allocate them all at once,
or block and wait.
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [PATCH 12/13] Make generic_make_request handle arbitrarily large bios
       [not found]     ` <20120518180550.0a6cdc34-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2012-05-18  8:14       ` Kent Overstreet
       [not found]         ` <20120518081444.GA27205-RcKxWJ4Cfj3IzGYXcIpNmNLIRw13R84JkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Kent Overstreet @ 2012-05-18  8:14 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	axboe-tSWWG44O7X1aa/9Udqfwiw, agk-H+wXaHxf7aLQT0dZR+AlfA
On Fri, May 18, 2012 at 06:05:50PM +1000, NeilBrown wrote:
> 
> Hi Kent,
>  there is lots of good stuff in this series and I certainly hope a lot of it
>  can eventually go upstream.
> 
>  However there is a part of this patch that I don't think is safe:
> 
> 
> > +static void __bio_submit_split(struct closure *cl)
> > +{
> > +	struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
> > +	struct bio *bio = s->bio, *n;
> > +
> > +	do {
> > +		n = bio_split(bio, bio_max_sectors(bio),
> > +			      GFP_NOIO, s->q->bio_split);
> > +		if (!n)
> > +			continue_at(cl, __bio_submit_split, bio_split_wq);
> > +
> > +		closure_get(cl);
> > +		generic_make_request(n);
> > +	} while (n != bio);
> > +
> > +	continue_at(cl, bio_submit_split_done, NULL);
> > +}
> 
> Firstly a small point:  Can bio_split ever return NULL here?  I don't
> think it can, so there is no need to test.
> But if it can, then calling generic_make_request(NULL) doesn't seem like a
> good idea.
> 
> More significantly though::
>   This is called from generic_make_request which can be called recursively and
> enforces a tail-recursion semantic.
> If that generic_make_request was a recursive call, then the
> generic_make_request in __bio_submit_split will not start the request, but
> will queue the bio for later handling.  If we then call bio_split again, we
> could try to allocation from a mempool while we are holding one entry
> allocated from that pool captive.  This can deadlock.
> 
> i.e. if the original bio is so large that it needs to be split into 3 pieces,
> then we will try to allocate the second piece before the first piece has a
> chance to be released.  If this happens in enough threads to exhaust the pool
> (4 I think), it will deadlock.
> 
> I realise this sounds like a very unlikely case, but of course they happen.
> 
> One possible approach might be to count how many splits will be required,
> then have an interface to mempools so you can allocate them all at once,
> or block and wait.
Yeah, I actually thought of that (I probably should've documented it,
though).
That's why the code is checking if bio_split() returned NULL; my verison
of bio_split() checks if current->bio_list is non NULL, and if it is it
doesn't pass __GFP_WAIT to bio_alloc_bioset().
(I was debating whether bio_split() should do that itself or leave it up
to the caller. I tend to think it's too easy to accidentally screw up -
and not notice it - so it should be enforced by generic code. Possibly
the check should be moved to bio_alloc_bioset(), though.)
If we do get a memory allocation failure, then we just punt to workqueue
- continue_at() runs __bio_submit_split from the bio_split workqueue -
where we can safely use the mempool.
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [PATCH 01/13] block: Generalized bio pool freeing
       [not found]   ` <ea774fea2a27c9f1028a12ce31a7ee5e5517bef4.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-05-18 15:55     ` Tejun Heo
       [not found]       ` <20120518155538.GA19388-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2012-05-18 15:55 UTC (permalink / raw)
  To: koverstreet-hpIqsD4AKlfQT0dZR+AlfA
  Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	axboe-tSWWG44O7X1aa/9Udqfwiw, agk-H+wXaHxf7aLQT0dZR+AlfA,
	neilb-l3A5Bk7waGM
On Thu, May 17, 2012 at 10:59:48PM -0400, koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org wrote:
> From: Kent Overstreet <koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> 
> With the old code, when you allocate a bio from a bio pool you have to
> implement your own destructor that knows how to find the bio pool the
> bio was originally allocated from.
> 
> This adds a new field to struct bio (bi_pool) and changes
> bio_alloc_bioset() to use it. This makes various bio destructors
> unnecessary, so they're then deleted.
> 
> Signed-off-by: Kent Overstreet <koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Please add Cc: to maintainers of each modified subsystem.  Other than
that, nice cleanup even without the later removal of bi_destructor.
 Acked-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [PATCH 02/13] dm: kill dm_rq_bio_destructor
  2012-05-18  2:59 ` [PATCH 02/13] dm: kill dm_rq_bio_destructor koverstreet
@ 2012-05-18 15:57   ` Tejun Heo
       [not found]     ` <20120518155729.GB19388-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2012-05-18 15:57 UTC (permalink / raw)
  To: koverstreet
  Cc: linux-bcache, linux-kernel, dm-devel, linux-fsdevel, axboe, agk,
	neilb
On Thu, May 17, 2012 at 10:59:49PM -0400, koverstreet@google.com wrote:
> From: Kent Overstreet <koverstreet@google.com>
> 
> Signed-off-by: Kent Overstreet <koverstreet@google.com>
Please explain why this is done and how it's safe.  Alasdair / dm
folks, can you please ack this?
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [PATCH 03/13] block: Add bio_clone_bioset()
  2012-05-18  2:59 ` [PATCH 03/13] block: Add bio_clone_bioset() koverstreet
@ 2012-05-18 16:05   ` Tejun Heo
  2012-05-18 20:31     ` Kent Overstreet
       [not found]   ` <eb6a7d3fe7ae203202bc365d7274ee531631a9ca.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2012-05-18 16:05 UTC (permalink / raw)
  To: koverstreet
  Cc: linux-bcache, linux-kernel, dm-devel, linux-fsdevel, axboe, agk,
	neilb
Hey, Kent.
On Thu, May 17, 2012 at 10:59:50PM -0400, koverstreet@google.com wrote:
> From: Kent Overstreet <koverstreet@google.com>
> 
> This consolidates some code, and will help in a later patch changing how
> bio cloning works.
> 
> Signed-off-by: Kent Overstreet <koverstreet@google.com>
I'd prefer a bit more explanation on what's going on generally and why
and how dm conversion is different.
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 3cc2169..3e33039 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1072,26 +1072,19 @@ static struct bio *split_bvec(struct bio *bio, sector_t sector,
>   * Creates a bio that consists of range of complete bvecs.
>   */
>  static struct bio *clone_bio(struct bio *bio, sector_t sector,
> -			     unsigned short idx, unsigned short bv_count,
> +			     unsigned short bv_count,
>  			     unsigned int len, struct bio_set *bs)
>  {
>  	struct bio *clone;
>  
> -	clone = bio_alloc_bioset(GFP_NOIO, bio->bi_max_vecs, bs);
> -	__bio_clone(clone, bio);
> +	clone = bio_clone_bioset(bio, GFP_NOIO, bs);
>  	clone->bi_sector = sector;
> -	clone->bi_idx = idx;
> -	clone->bi_vcnt = idx + bv_count;
> +	clone->bi_vcnt = bv_count;
>  	clone->bi_size = to_bytes(len);
> -	clone->bi_flags &= ~(1 << BIO_SEG_VALID);
Maybe removal of @idx deserves a separate patch?
> -	if (bio_integrity(bio)) {
> -		bio_integrity_clone(clone, bio, GFP_NOIO, bs);
> -
> -		if (idx != bio->bi_idx || clone->bi_size < bio->bi_size)
> -			bio_integrity_trim(clone,
> -					   bio_sector_offset(bio, idx, 0), len);
> -	}
> +	if (bio_integrity(bio) &&
> +	    clone->bi_size < bio->bi_size)
Unnecessary line break.
Other than that, looks good to me.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [PATCH 04/13] block: Add bio_clone_kmalloc()
       [not found]   ` <1c7c2d4b89bc3d0e907608cec37bcf0ee50f4c0e.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-05-18 16:09     ` Tejun Heo
       [not found]       ` <20120518160903.GD19388-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2012-05-18 16:09 UTC (permalink / raw)
  To: koverstreet-hpIqsD4AKlfQT0dZR+AlfA
  Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	axboe-tSWWG44O7X1aa/9Udqfwiw, agk-H+wXaHxf7aLQT0dZR+AlfA,
	neilb-l3A5Bk7waGM
On Thu, May 17, 2012 at 10:59:51PM -0400, koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org wrote:
> @@ -729,7 +729,7 @@ static struct bio *bio_chain_clone(struct bio **old, struct bio **next,
>  	}
>  
>  	while (old_chain && (total < len)) {
> -		tmp = bio_kmalloc(gfpmask, old_chain->bi_max_vecs);
> +		tmp = bio_clone_kmalloc(old_chain, gfpmask);
>  		if (!tmp)
>  			goto err_out;
>  
> @@ -751,13 +751,9 @@ static struct bio *bio_chain_clone(struct bio **old, struct bio **next,
>  			if (!bp)
>  				goto err_out;
>  
> -			__bio_clone(tmp, &bp->bio1);
> -
This effectively swaps the order of bio_split() and __bio_clone().  Is
that safe?  Also, please cc the maintainer.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dm-devel] [PATCH 03/13] block: Add bio_clone_bioset()
       [not found]   ` <eb6a7d3fe7ae203202bc365d7274ee531631a9ca.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-05-18 16:11     ` Vivek Goyal
  2012-05-18 18:55       ` Kent Overstreet
  0 siblings, 1 reply; 48+ messages in thread
From: Vivek Goyal @ 2012-05-18 16:11 UTC (permalink / raw)
  To: koverstreet-hpIqsD4AKlfQT0dZR+AlfA
  Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	axboe-tSWWG44O7X1aa/9Udqfwiw, agk-H+wXaHxf7aLQT0dZR+AlfA
On Thu, May 17, 2012 at 10:59:50PM -0400, koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org wrote:
[..]
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index a5a524e..47605e7 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -178,23 +178,7 @@ struct bio *bio_clone_mddev(struct bio *bio, gfp_t gfp_mask,
>  	if (!mddev || !mddev->bio_set)
>  		return bio_clone(bio, gfp_mask);
>  
> -	b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, mddev->bio_set);
> -	if (!b)
> -		return NULL;
> -
> -	__bio_clone(b, bio);
> -	if (bio_integrity(bio)) {
> -		int ret;
> -
> -		ret = bio_integrity_clone(b, bio, gfp_mask, mddev->bio_set);
> -
> -		if (ret < 0) {
> -			bio_put(b);
> -			return NULL;
> -		}
> -	}
> -
> -	return b;
> +	return bio_clone_bioset(gfp_mask, bio_segments(bio), mddev->bio_set);
Are we passing wrong arguments to bio_clone_bioset()?
Thanks
Vivek
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [PATCH 05/13] block: Only clone bio vecs that are in use
  2012-05-18  2:59 ` [PATCH 05/13] block: Only clone bio vecs that are in use koverstreet
@ 2012-05-18 16:13   ` Tejun Heo
  2012-05-18 21:14     ` Kent Overstreet
  0 siblings, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2012-05-18 16:13 UTC (permalink / raw)
  To: koverstreet
  Cc: linux-bcache, linux-kernel, dm-devel, linux-fsdevel, axboe, agk,
	neilb
On Thu, May 17, 2012 at 10:59:52PM -0400, koverstreet@google.com wrote:
> diff --git a/fs/bio.c b/fs/bio.c
> index e2c0970..de0733e 100644
> --- a/fs/bio.c
> +++ b/fs/bio.c
> @@ -435,8 +435,9 @@ EXPORT_SYMBOL(bio_phys_segments);
>   */
>  void __bio_clone(struct bio *bio, struct bio *bio_src)
>  {
> -	memcpy(bio->bi_io_vec, bio_src->bi_io_vec,
> -		bio_src->bi_max_vecs * sizeof(struct bio_vec));
> +	memcpy(bio->bi_io_vec,
> +	       bio_iovec(bio_src),
Unnecessary line break.
> +	       bio_segments(bio_src) * sizeof(struct bio_vec));
>  
>  	/*
>  	 * most users will be overriding ->bi_bdev with a new target,
> @@ -445,10 +446,10 @@ void __bio_clone(struct bio *bio, struct bio *bio_src)
>  	bio->bi_sector = bio_src->bi_sector;
>  	bio->bi_bdev = bio_src->bi_bdev;
>  	bio->bi_flags |= 1 << BIO_CLONED;
> +	bio->bi_flags &= ~(1 << BIO_SEG_VALID);
Can probably be conditionalized on bi_idx?
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dm-devel] [PATCH 01/13] block: Generalized bio pool freeing
       [not found]       ` <20120518155538.GA19388-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-05-18 16:14         ` Alasdair G Kergon
  0 siblings, 0 replies; 48+ messages in thread
From: Alasdair G Kergon @ 2012-05-18 16:14 UTC (permalink / raw)
  To: koverstreet-hpIqsD4AKlfQT0dZR+AlfA
  Cc: Tejun Heo, axboe-tSWWG44O7X1aa/9Udqfwiw,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	Junichi Nomura
This dance should come out of clone_endio() and __map_bio() too?
        bio->bi_private = md->bs;
Alasdair
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [PATCH 06/13] block: Add bio_reset()
  2012-05-18  2:59 ` [PATCH 06/13] block: Add bio_reset() koverstreet
@ 2012-05-18 16:16   ` Tejun Heo
       [not found]     ` <20120518161608.GF19388-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2012-05-18 16:16 UTC (permalink / raw)
  To: koverstreet
  Cc: linux-bcache, linux-kernel, dm-devel, linux-fsdevel, axboe, agk,
	neilb
Hello,
On Thu, May 17, 2012 at 10:59:53PM -0400, koverstreet@google.com wrote:
> From: Kent Overstreet <koverstreet@google.com>
> 
> Reusing bios is something that's been highly frowned upon in the past,
> but driver code keeps doing it anyways. If it's going to happen anyways,
> we should provide a generic method.
> 
> This'll help with getting rid of bi_destructor.
It would be nice to explain a bit why this helps bi_destructor
removal.
> +#define BIO_RESET_OFFSET	offsetof(struct bio, bi_max_vecs)
Maybe better named BIO_RESET_BYTES?  As offset tends to indicate the
start of a region.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [PATCH 07/13] pktcdvd: Switch to bio_kmalloc()
       [not found]   ` <04a4d6c2c8b6f0097e3594c6e0932093afffc1da.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-05-18 16:18     ` Tejun Heo
  0 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2012-05-18 16:18 UTC (permalink / raw)
  To: koverstreet-hpIqsD4AKlfQT0dZR+AlfA
  Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	axboe-tSWWG44O7X1aa/9Udqfwiw, agk-H+wXaHxf7aLQT0dZR+AlfA,
	neilb-l3A5Bk7waGM
On Thu, May 17, 2012 at 10:59:54PM -0400, koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org wrote:
> From: Kent Overstreet <koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> 
> This is prep work for killing bi_destructor
Again, it would be great if the commit message described what
conversion is taking place how and how it's been tested.  Please also
cc the maintainer.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [PATCH 08/13] block: Kill bi_destructor
       [not found]   ` <de05855cf0fa4800cfab7e8340f106dccc7a75a1.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-05-18 16:21     ` Tejun Heo
       [not found]       ` <20120518162142.GH19388-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2012-05-18 16:21 UTC (permalink / raw)
  To: koverstreet-hpIqsD4AKlfQT0dZR+AlfA
  Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	axboe-tSWWG44O7X1aa/9Udqfwiw, agk-H+wXaHxf7aLQT0dZR+AlfA,
	neilb-l3A5Bk7waGM
Hello,
> @@ -417,8 +409,11 @@ void bio_put(struct bio *bio)
>  
>  		if (bio->bi_pool)
>  			bio_free(bio, bio->bi_pool);
> -		else
> -			bio->bi_destructor(bio);
> +		else {
> +			if (bio_integrity(bio))
> +				bio_integrity_free(bio, fs_bio_set);
> +			kfree(bio);
if {
} else {
}
And wouldn't it be better to make bio_free() handle kfreeing too?
Overall, I really like this change.  I hate how ->bi_destructor() has
been used.
Thanks!
-- 
tejun
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [PATCH 09/13] block: Add an explicit bio flag for bios that own their bvec
       [not found]   ` <363875943e9d0e13bee6ed28239280543e6e5055.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-05-18 16:30     ` Tejun Heo
  2012-05-18 21:49       ` Kent Overstreet
  2012-05-18 17:07     ` Boaz Harrosh
  1 sibling, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2012-05-18 16:30 UTC (permalink / raw)
  To: koverstreet-hpIqsD4AKlfQT0dZR+AlfA
  Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	axboe-tSWWG44O7X1aa/9Udqfwiw, agk-H+wXaHxf7aLQT0dZR+AlfA,
	neilb-l3A5Bk7waGM
On Thu, May 17, 2012 at 10:59:56PM -0400, koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org wrote:
> From: Kent Overstreet <koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> 
> This is for the new bio splitting code. When we split a bio, if the
> split occured on a bvec boundry we reuse the bvec for the new bio. But
> that means bio_free() can't free it, hence the explicit flag.
> 
> Signed-off-by: Kent Overstreet <koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/bio.c                  |    3 ++-
>  include/linux/blk_types.h |    1 +
>  2 files changed, 3 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/bio.c b/fs/bio.c
> index ecc9088..3332800 100644
> --- a/fs/bio.c
> +++ b/fs/bio.c
> @@ -234,7 +234,7 @@ void bio_free(struct bio *bio, struct bio_set *bs)
>  {
>  	void *p;
>  
> -	if (bio_has_allocated_vec(bio))
> +	if (bio_flagged(bio, BIO_HAS_VEC))
>  		bvec_free_bs(bs, bio->bi_io_vec, BIO_POOL_IDX(bio));
We can remove bio_has_allocated_vec()?
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dm-devel] [PATCH 02/13] dm: kill dm_rq_bio_destructor
       [not found]     ` <20120518155729.GB19388-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-05-18 16:43       ` Alasdair G Kergon
       [not found]         ` <20120518164319.GJ29330-FDJ95KluN3Z0klwcnFlA1dvLeJWuRmrY@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Alasdair G Kergon @ 2012-05-18 16:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: koverstreet-hpIqsD4AKlfQT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	Junichi Nomura, Mikulas Patocka
On Fri, May 18, 2012 at 08:57:29AM -0700, Tejun Heo wrote:
> Please explain why this is done and how it's safe.  Alasdair / dm
> folks, can you please ack this?
 
I think it's relying on there being never more than one reference on those 
bios so that that endio fn, called exactly once, always frees it and there
are no dm_puts elsewhere.
Alasdair
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [PATCH 04/13] block: Add bio_clone_kmalloc()
  2012-05-18  2:59 ` [PATCH 04/13] block: Add bio_clone_kmalloc() koverstreet
       [not found]   ` <1c7c2d4b89bc3d0e907608cec37bcf0ee50f4c0e.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-05-18 16:45   ` Boaz Harrosh
  1 sibling, 0 replies; 48+ messages in thread
From: Boaz Harrosh @ 2012-05-18 16:45 UTC (permalink / raw)
  To: koverstreet
  Cc: linux-bcache, linux-kernel, dm-devel, linux-fsdevel, tj, axboe,
	agk, neilb
On 05/18/2012 05:59 AM, koverstreet@google.com wrote:
> From: Kent Overstreet <koverstreet@google.com>
> 
For the osdblk.c and exofs/ore.c bits. Thanks
looks nice.
ACK-by: Boaz Harrosh <bharrosh@panasas.com>
> Signed-off-by: Kent Overstreet <koverstreet@google.com>
> ---
>  drivers/block/osdblk.c |    3 +--
>  drivers/block/rbd.c    |    8 ++------
>  fs/bio.c               |   13 +++++++++++++
>  fs/exofs/ore.c         |    5 ++---
>  include/linux/bio.h    |    1 +
>  5 files changed, 19 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
> index 87311eb..1bbc681 100644
> --- a/drivers/block/osdblk.c
> +++ b/drivers/block/osdblk.c
> @@ -266,11 +266,10 @@ static struct bio *bio_chain_clone(struct bio *old_chain, gfp_t gfpmask)
>  	struct bio *tmp, *new_chain = NULL, *tail = NULL;
>  
>  	while (old_chain) {
> -		tmp = bio_kmalloc(gfpmask, old_chain->bi_max_vecs);
> +		tmp = bio_clone_kmalloc(old_chain, gfpmask);
>  		if (!tmp)
>  			goto err_out;
>  
> -		__bio_clone(tmp, old_chain);
>  		tmp->bi_bdev = NULL;
>  		gfpmask &= ~__GFP_WAIT;
>  		tmp->bi_next = NULL;
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 013c7a5..5a953c6 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -729,7 +729,7 @@ static struct bio *bio_chain_clone(struct bio **old, struct bio **next,
>  	}
>  
>  	while (old_chain && (total < len)) {
> -		tmp = bio_kmalloc(gfpmask, old_chain->bi_max_vecs);
> +		tmp = bio_clone_kmalloc(old_chain, gfpmask);
>  		if (!tmp)
>  			goto err_out;
>  
> @@ -751,13 +751,9 @@ static struct bio *bio_chain_clone(struct bio **old, struct bio **next,
>  			if (!bp)
>  				goto err_out;
>  
> -			__bio_clone(tmp, &bp->bio1);
> -
>  			*next = &bp->bio2;
> -		} else {
> -			__bio_clone(tmp, old_chain);
> +		} else
>  			*next = old_chain->bi_next;
> -		}
>  
>  		tmp->bi_bdev = NULL;
>  		gfpmask &= ~__GFP_WAIT;
> diff --git a/fs/bio.c b/fs/bio.c
> index 4cb621f..e2c0970 100644
> --- a/fs/bio.c
> +++ b/fs/bio.c
> @@ -491,6 +491,19 @@ struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
>  }
>  EXPORT_SYMBOL(bio_clone);
>  
> +struct bio *bio_clone_kmalloc(struct bio *bio, gfp_t gfp_mask)
> +{
> +	struct bio *b = bio_kmalloc(gfp_mask, bio->bi_max_vecs);
> +
> +	if (!b)
> +		return NULL;
> +
> +	__bio_clone(b, bio);
> +
> +	return b;
> +}
> +EXPORT_SYMBOL(bio_clone_kmalloc);
> +
>  /**
>   *	bio_get_nr_vecs		- return approx number of vecs
>   *	@bdev:  I/O target
> diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
> index 49cf230..95c7264 100644
> --- a/fs/exofs/ore.c
> +++ b/fs/exofs/ore.c
> @@ -820,8 +820,8 @@ static int _write_mirror(struct ore_io_state *ios, int cur_comp)
>  			struct bio *bio;
>  
>  			if (per_dev != master_dev) {
> -				bio = bio_kmalloc(GFP_KERNEL,
> -						  master_dev->bio->bi_max_vecs);
> +				bio = bio_clone_kmalloc(master_dev->bio,
> +							GFP_KERNEL);
>  				if (unlikely(!bio)) {
>  					ORE_DBGMSG(
>  					      "Failed to allocate BIO size=%u\n",
> @@ -830,7 +830,6 @@ static int _write_mirror(struct ore_io_state *ios, int cur_comp)
>  					goto out;
>  				}
>  
> -				__bio_clone(bio, master_dev->bio);
>  				bio->bi_bdev = NULL;
>  				bio->bi_next = NULL;
>  				per_dev->offset = master_dev->offset;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index e068a68..b27f16b 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -225,6 +225,7 @@ extern int bio_phys_segments(struct request_queue *, struct bio *);
>  extern void __bio_clone(struct bio *, struct bio *);
>  extern struct bio *bio_clone_bioset(struct bio *, gfp_t, struct bio_set *bs);
>  extern struct bio *bio_clone(struct bio *, gfp_t);
> +struct bio *bio_clone_kmalloc(struct bio *, gfp_t);
>  
>  extern void bio_init(struct bio *);
>  
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [PATCH 10/13] block: Rework bio splitting
       [not found]   ` <ac4c1cbd10934fdc3a4af74d4cb5ec370f9139d5.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-05-18 17:07     ` Tejun Heo
  0 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2012-05-18 17:07 UTC (permalink / raw)
  To: koverstreet-hpIqsD4AKlfQT0dZR+AlfA
  Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	axboe-tSWWG44O7X1aa/9Udqfwiw, agk-H+wXaHxf7aLQT0dZR+AlfA,
	neilb-l3A5Bk7waGM
Hello,
On Thu, May 17, 2012 at 10:59:57PM -0400, koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org wrote:
> From: Kent Overstreet <koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> 
> Break out bio_split() and bio_pair_split(), and get rid of the
> limitations of bio_split()
Again, what are those limitations being removed and why and at what
cost?
> diff --git a/fs/bio.c b/fs/bio.c
> index 3332800..781a8cf 100644
> --- a/fs/bio.c
> +++ b/fs/bio.c
> @@ -35,7 +35,7 @@
>   */
>  #define BIO_INLINE_VECS		4
>  
> -static mempool_t *bio_split_pool __read_mostly;
> +static struct bio_set *bio_split_pool __read_mostly;
>  
>  /*
>   * if you change this list, also change bvec_alloc or things will
> @@ -1464,84 +1464,124 @@ void bio_endio(struct bio *bio, int error)
>  }
>  EXPORT_SYMBOL(bio_endio);
>  
Please add /** comment documenting the function.
> -void bio_pair_release(struct bio_pair *bp)
> +struct bio *bio_split(struct bio *bio, int sectors,
> +		      gfp_t gfp, struct bio_set *bs)
>  {
> -	if (atomic_dec_and_test(&bp->cnt)) {
> -		struct bio *master = bp->bio1.bi_private;
> +	unsigned idx, vcnt = 0, nbytes = sectors << 9;
> +	struct bio_vec *bv;
> +	struct bio *ret = NULL;
Maybe naming it @split is better?
> +
> +	BUG_ON(sectors <= 0);
> +
> +	if (current->bio_list)
> +		gfp &= ~__GFP_WAIT;
Please explain what this means.
> +	if (nbytes >= bio->bi_size)
> +		return bio;
> +
> +	trace_block_split(bdev_get_queue(bio->bi_bdev), bio,
> +				bio->bi_sector + sectors);
> +
> +	bio_for_each_segment(bv, bio, idx) {
> +		vcnt = idx - bio->bi_idx;
> +
> +		if (!nbytes) {
> +			ret = bio_alloc_bioset(gfp, 0, bs);
> +			if (!ret)
> +				return NULL;
> +
> +			ret->bi_io_vec = bio_iovec(bio);
> +			ret->bi_flags |= 1 << BIO_CLONED;
> +			break;
> +		} else if (nbytes < bv->bv_len) {
> +			ret = bio_alloc_bioset(gfp, ++vcnt, bs);
> +			if (!ret)
> +				return NULL;
> +
> +			memcpy(ret->bi_io_vec, bio_iovec(bio),
> +			       sizeof(struct bio_vec) * vcnt);
> +
> +			ret->bi_io_vec[vcnt - 1].bv_len = nbytes;
> +			bv->bv_offset	+= nbytes;
> +			bv->bv_len	-= nbytes;
Please don't indent assignments.
> +			break;
> +		}
> +
> +		nbytes -= bv->bv_len;
> +	}
I find the code a bit confusing.  Wouldn't it be better to structure
it as
	bio_for_each_segment() {
		find splitting point;
	}
	Do all of the splitting.
> +	ret->bi_bdev	= bio->bi_bdev;
> +	ret->bi_sector	= bio->bi_sector;
> +	ret->bi_size	= sectors << 9;
> +	ret->bi_rw	= bio->bi_rw;
> +	ret->bi_vcnt	= vcnt;
> +	ret->bi_max_vecs = vcnt;
> +	ret->bi_end_io	= bio->bi_end_io;
> +	ret->bi_private	= bio->bi_private;
>  
> -		bio_endio(master, bp->error);
> -		mempool_free(bp, bp->bio2.bi_private);
> +	bio->bi_sector	+= sectors;
> +	bio->bi_size	-= sectors << 9;
> +	bio->bi_idx	 = idx;
I personally would prefer not having indentations here either.
> +	if (bio_integrity(bio)) {
> +		bio_integrity_clone(ret, bio, gfp, bs);
> +		bio_integrity_trim(ret, 0, bio_sectors(ret));
> +		bio_integrity_trim(bio, bio_sectors(ret), bio_sectors(bio));
>  	}
> +
> +	return ret;
>  }
> -EXPORT_SYMBOL(bio_pair_release);
> +EXPORT_SYMBOL_GPL(bio_split);
/** comment would be nice.
> -static void bio_pair_end_1(struct bio *bi, int err)
> +void bio_pair_release(struct bio_pair *bp)
>  {
> -	struct bio_pair *bp = container_of(bi, struct bio_pair, bio1);
> -
> -	if (err)
> -		bp->error = err;
> +	if (atomic_dec_and_test(&bp->cnt)) {
> +		bp->orig->bi_end_io = bp->bi_end_io;
> +		bp->orig->bi_private = bp->bi_private;
So, before, split wouldn't override orig->bi_private.  Now, it does so
while the bio is in flight.  I don't know.  If the only benefit of
temporary override is avoiding have a separate end_io call, I think
not doing so is better.  Also, behavior changes as subtle as this
*must* be noted in the patch description.
> -	bio_pair_release(bp);
> +		bio_endio(bp->orig, 0);
> +		bio_put(&bp->split);
> +	}
>  }
> +EXPORT_SYMBOL(bio_pair_release);
And it would be super-nice if you can add /** comment here too.
> -/*
> - * split a bio - only worry about a bio with a single page in its iovec
> - */
> -struct bio_pair *bio_split(struct bio *bi, int first_sectors)
> +struct bio_pair *bio_pair_split(struct bio *bio, int first_sectors)
>  {
> -	struct bio_pair *bp = mempool_alloc(bio_split_pool, GFP_NOIO);
> -
> -	if (!bp)
> -		return bp;
> +	struct bio_pair *bp;
> +	struct bio *split = bio_split(bio, first_sectors,
> +				      GFP_NOIO, bio_split_pool);
I know fs/bio.c isn't a bastion of good style but let's try improve it
bit by bit.  It's generally considered a bad style to put a statement
which may fail in local var declaration.  IOW, please do,
	struct bio *split;
	split = bio_split();
	if (!split)
		return NULL;
> @@ -1694,8 +1734,7 @@ static int __init init_bio(void)
>  	if (bioset_integrity_create(fs_bio_set, BIO_POOL_SIZE))
>  		panic("bio: can't create integrity pool\n");
>  
> -	bio_split_pool = mempool_create_kmalloc_pool(BIO_SPLIT_ENTRIES,
> -						     sizeof(struct bio_pair));
> +	bio_split_pool = bioset_create(BIO_POOL_SIZE, offsetof(struct bio_pair, split));
>  	if (!bio_split_pool)
>  		panic("bio: can't create split pool\n");
BIO_SPLIT_ENTRIES is probably something dependent on how splits can
stack, so I don't think we want to change that to BIO_POOL_SIZE.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [PATCH 09/13] block: Add an explicit bio flag for bios that own their bvec
       [not found]   ` <363875943e9d0e13bee6ed28239280543e6e5055.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2012-05-18 16:30     ` Tejun Heo
@ 2012-05-18 17:07     ` Boaz Harrosh
  1 sibling, 0 replies; 48+ messages in thread
From: Boaz Harrosh @ 2012-05-18 17:07 UTC (permalink / raw)
  To: koverstreet-hpIqsD4AKlfQT0dZR+AlfA
  Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	axboe-tSWWG44O7X1aa/9Udqfwiw, agk-H+wXaHxf7aLQT0dZR+AlfA,
	neilb-l3A5Bk7waGM
On 05/18/2012 05:59 AM, koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org wrote:
> From: Kent Overstreet <koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> 
> This is for the new bio splitting code. When we split a bio, if the
> split occured on a bvec boundry we reuse the bvec for the new bio. But
> that means bio_free() can't free it, hence the explicit flag.
> 
> Signed-off-by: Kent Overstreet <koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/bio.c                  |    3 ++-
>  include/linux/blk_types.h |    1 +
>  2 files changed, 3 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/bio.c b/fs/bio.c
> index ecc9088..3332800 100644
> --- a/fs/bio.c
> +++ b/fs/bio.c
> @@ -234,7 +234,7 @@ void bio_free(struct bio *bio, struct bio_set *bs)
>  {
>  	void *p;
>  
> -	if (bio_has_allocated_vec(bio))
> +	if (bio_flagged(bio, BIO_HAS_VEC))
>  		bvec_free_bs(bs, bio->bi_io_vec, BIO_POOL_IDX(bio));
>  
>  	if (bio_integrity(bio))
> @@ -305,6 +305,7 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  			goto err_free;
>  
>  		nr_iovecs = bvec_nr_vecs(idx);
> +		bio->bi_flags |= 1 << BIO_HAS_VEC;
>  	}
>  out_set:
>  	bio->bi_flags |= idx << BIO_POOL_OFFSET;
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index bd9a610..10fca21 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -101,6 +101,7 @@ struct bio {
>  #define BIO_FS_INTEGRITY 9	/* fs owns integrity data, not block layer */
>  #define BIO_QUIET	10	/* Make BIO Quiet */
>  #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
> +#define BIO_HAS_VEC	12	/* bio_free() should free bvec */
Just an English nit
BIO with out any VECs is when bi_max_vecs is zero. Id prefer a name that
denotes allocation or ownership.
BIO_OWN_VEC or BIO_ALLOC_VEC or BIO_DELETE_VEC
And one technical problem.
I would prefer to negate this bit. Zero is the default, set on
the exception. Safer and also more logical.
 BIO_NO_DELETE_VEC
>  #define bio_flagged(bio, flag)	((bio)->bi_flags & (1 << (flag)))
>  
>  /*
Thanks
Boaz
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [PATCH 11/13] Closures
       [not found]     ` <a184989cfbf92297d4cca2f823e5ac24ec8fe1e3.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-05-18 17:08       ` Tejun Heo
  0 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2012-05-18 17:08 UTC (permalink / raw)
  To: koverstreet-hpIqsD4AKlfQT0dZR+AlfA
  Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	axboe-tSWWG44O7X1aa/9Udqfwiw, agk-H+wXaHxf7aLQT0dZR+AlfA,
	neilb-l3A5Bk7waGM
On Thu, May 17, 2012 at 10:59:58PM -0400, koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org wrote:
> From: Kent Overstreet <koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> 
> Asynchronous refcounty thingies; they embed a refcount and a work
> struct. Extensive documentation follows in include/linux/closure.h
> 
> Signed-off-by: Kent Overstreet <koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
I don't think this fits in this series and the more I think about it
the more I don't like this but let's have that discussion in different
threads, so tentaive NACK.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [PATCH 10/13] block: Rework bio splitting
  2012-05-18  2:59 ` [PATCH 10/13] block: Rework bio splitting koverstreet
       [not found]   ` <ac4c1cbd10934fdc3a4af74d4cb5ec370f9139d5.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-05-18 17:46   ` Boaz Harrosh
  1 sibling, 0 replies; 48+ messages in thread
From: Boaz Harrosh @ 2012-05-18 17:46 UTC (permalink / raw)
  To: koverstreet
  Cc: linux-bcache, linux-kernel, dm-devel, linux-fsdevel, tj, axboe,
	agk, neilb
On 05/18/2012 05:59 AM, koverstreet@google.com wrote:
> From: Kent Overstreet <koverstreet@google.com>
> 
> Break out bio_split() and bio_pair_split(), and get rid of the
> limitations of bio_split()
> 
> Signed-off-by: Kent Overstreet <koverstreet@google.com>
> ---
>  drivers/block/drbd/drbd_req.c |   18 +----
>  drivers/block/pktcdvd.c       |    6 +-
>  drivers/block/rbd.c           |    4 +-
>  drivers/md/linear.c           |    6 +-
>  drivers/md/raid0.c            |    8 +-
>  drivers/md/raid10.c           |   23 ++-----
>  fs/bio-integrity.c            |   44 ------------
>  fs/bio.c                      |  149 ++++++++++++++++++++++++++---------------
>  include/linux/bio.h           |   27 +++-----
>  9 files changed, 125 insertions(+), 160 deletions(-)
> 
> diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
> index 4a0f314..68fa494 100644
> --- a/drivers/block/drbd/drbd_req.c
> +++ b/drivers/block/drbd/drbd_req.c
> @@ -1101,18 +1101,6 @@ void drbd_make_request(struct request_queue *q, struct bio *bio)
>  	if (likely(s_enr == e_enr)) {
>  		inc_ap_bio(mdev, 1);
>  		drbd_make_request_common(mdev, bio, start_time);
> -		return;
> -	}
> -
> -	/* can this bio be split generically?
> -	 * Maybe add our own split-arbitrary-bios function. */
> -	if (bio->bi_vcnt != 1 || bio->bi_idx != 0 || bio->bi_size > DRBD_MAX_BIO_SIZE) {
> -		/* rather error out here than BUG in bio_split */
> -		dev_err(DEV, "bio would need to, but cannot, be split: "
> -		    "(vcnt=%u,idx=%u,size=%u,sector=%llu)\n",
> -		    bio->bi_vcnt, bio->bi_idx, bio->bi_size,
> -		    (unsigned long long)bio->bi_sector);
> -		bio_endio(bio, -EINVAL);
>  	} else {
>  		/* This bio crosses some boundary, so we have to split it. */
>  		struct bio_pair *bp;
> @@ -1128,7 +1116,7 @@ void drbd_make_request(struct request_queue *q, struct bio *bio)
>  		const int sps = 1 << HT_SHIFT; /* sectors per slot */
>  		const int mask = sps - 1;
>  		const sector_t first_sectors = sps - (sect & mask);
> -		bp = bio_split(bio, first_sectors);
> +		bp = bio_pair_split(bio, first_sectors);
I do not see the point of this rename. See below
>  
>  		/* we need to get a "reference count" (ap_bio_cnt)
>  		 * to avoid races with the disconnect/reconnect/suspend code.
> @@ -1139,10 +1127,10 @@ void drbd_make_request(struct request_queue *q, struct bio *bio)
>  
>  		D_ASSERT(e_enr == s_enr + 1);
>  
> -		while (drbd_make_request_common(mdev, &bp->bio1, start_time))
> +		while (drbd_make_request_common(mdev, &bp->split, start_time))
>  			inc_ap_bio(mdev, 1);
>  
> -		while (drbd_make_request_common(mdev, &bp->bio2, start_time))
> +		while (drbd_make_request_common(mdev, bio, start_time))
>  			inc_ap_bio(mdev, 1);
>  
>  		dec_ap_bio(mdev);
> diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
> index 6fe693a..1465155 100644
> --- a/drivers/block/pktcdvd.c
> +++ b/drivers/block/pktcdvd.c
> @@ -2467,10 +2467,10 @@ static void pkt_make_request(struct request_queue *q, struct bio *bio)
>  		if (last_zone != zone) {
>  			BUG_ON(last_zone != zone + pd->settings.size);
>  			first_sectors = last_zone - bio->bi_sector;
> -			bp = bio_split(bio, first_sectors);
> +			bp = bio_pair_split(bio, first_sectors);
just a rename. See below
>  			BUG_ON(!bp);
> -			pkt_make_request(q, &bp->bio1);
> -			pkt_make_request(q, &bp->bio2);
> +			pkt_make_request(q, &bp->split);
> +			pkt_make_request(q, bio);
>  			bio_pair_release(bp);
>  			return;
>  		}
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 5a953c6..dd19311 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -747,11 +747,11 @@ static struct bio *bio_chain_clone(struct bio **old, struct bio **next,
>  
>  			/* split the bio. We'll release it either in the next
>  			   call, or it will have to be released outside */
> -			bp = bio_split(old_chain, (len - total) / SECTOR_SIZE);
> +			bp = bio_pair_split(old_chain, (len - total) / SECTOR_SIZE);
just a rename. See below
>  			if (!bp)
>  				goto err_out;
>  
> -			*next = &bp->bio2;
> +			*next = &bp->split;
>  		} else
>  			*next = old_chain->bi_next;
>  
> diff --git a/drivers/md/linear.c b/drivers/md/linear.c
> index fa211d8..7c6cafd 100644
> --- a/drivers/md/linear.c
> +++ b/drivers/md/linear.c
> @@ -314,10 +314,10 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio)
>  
>  		rcu_read_unlock();
>  
> -		bp = bio_split(bio, end_sector - bio->bi_sector);
> +		bp = bio_pair_split(bio, end_sector - bio->bi_sector);
>  
just a rename. See below
> -		linear_make_request(mddev, &bp->bio1);
> -		linear_make_request(mddev, &bp->bio2);
> +		linear_make_request(mddev, &bp->split);
> +		linear_make_request(mddev, bio);
>  		bio_pair_release(bp);
>  		return;
>  	}
> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
> index de63a1f..3469adf 100644
> --- a/drivers/md/raid0.c
> +++ b/drivers/md/raid0.c
> @@ -516,13 +516,13 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
>  		 * refuse to split for us, so we need to split it.
>  		 */
>  		if (likely(is_power_of_2(chunk_sects)))
> -			bp = bio_split(bio, chunk_sects - (sector &
> +			bp = bio_pair_split(bio, chunk_sects - (sector &
>  							   (chunk_sects-1)));
>  		else
> -			bp = bio_split(bio, chunk_sects -
> +			bp = bio_pair_split(bio, chunk_sects -
>  				       sector_div(sector, chunk_sects));
just a rename. See below
> -		raid0_make_request(mddev, &bp->bio1);
> -		raid0_make_request(mddev, &bp->bio2);
> +		raid0_make_request(mddev, &bp->split);
> +		raid0_make_request(mddev, bio);
>  		bio_pair_release(bp);
>  		return;
>  	}
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 3e7b154..0062326 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1001,15 +1001,9 @@ static void make_request(struct mddev *mddev, struct bio * bio)
>  		      > chunk_sects &&
>  		    conf->near_copies < conf->raid_disks)) {
>  		struct bio_pair *bp;
> -		/* Sanity check -- queue functions should prevent this happening */
> -		if (bio->bi_vcnt != 1 ||
> -		    bio->bi_idx != 0)
> -			goto bad_map;
> -		/* This is a one page bio that upper layers
> -		 * refuse to split for us, so we need to split it.
> -		 */
> -		bp = bio_split(bio,
> -			       chunk_sects - (bio->bi_sector & (chunk_sects - 1)) );
> +
> +		bp = bio_pair_split(bio,
> +				    chunk_sects - (bio->bi_sector & (chunk_sects - 1)));
just a rename. See below
>  
>  		/* Each of these 'make_request' calls will call 'wait_barrier'.
>  		 * If the first succeeds but the second blocks due to the resync
> @@ -1023,8 +1017,8 @@ static void make_request(struct mddev *mddev, struct bio * bio)
>  		conf->nr_waiting++;
>  		spin_unlock_irq(&conf->resync_lock);
>  
> -		make_request(mddev, &bp->bio1);
> -		make_request(mddev, &bp->bio2);
> +		make_request(mddev, &bp->split);
> +		make_request(mddev, bio);
>  
>  		spin_lock_irq(&conf->resync_lock);
>  		conf->nr_waiting--;
> @@ -1033,13 +1027,6 @@ static void make_request(struct mddev *mddev, struct bio * bio)
>  
>  		bio_pair_release(bp);
>  		return;
> -	bad_map:
> -		printk("md/raid10:%s: make_request bug: can't convert block across chunks"
> -		       " or bigger than %dk %llu %d\n", mdname(mddev), chunk_sects/2,
> -		       (unsigned long long)bio->bi_sector, bio->bi_size >> 10);
> -
> -		bio_io_error(bio);
> -		return;
>  	}
>  
>  	md_write_start(mddev, bio);
> diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
> index e85c04b..9ed2c44 100644
> --- a/fs/bio-integrity.c
> +++ b/fs/bio-integrity.c
> @@ -682,50 +682,6 @@ void bio_integrity_trim(struct bio *bio, unsigned int offset,
>  EXPORT_SYMBOL(bio_integrity_trim);
>  
>  /**
> - * bio_integrity_split - Split integrity metadata
> - * @bio:	Protected bio
> - * @bp:		Resulting bio_pair
> - * @sectors:	Offset
> - *
> - * Description: Splits an integrity page into a bio_pair.
> - */
> -void bio_integrity_split(struct bio *bio, struct bio_pair *bp, int sectors)
> -{
> -	struct blk_integrity *bi;
> -	struct bio_integrity_payload *bip = bio->bi_integrity;
> -	unsigned int nr_sectors;
> -
> -	if (bio_integrity(bio) == 0)
> -		return;
> -
> -	bi = bdev_get_integrity(bio->bi_bdev);
> -	BUG_ON(bi == NULL);
> -	BUG_ON(bip->bip_vcnt != 1);
> -
> -	nr_sectors = bio_integrity_hw_sectors(bi, sectors);
> -
> -	bp->bio1.bi_integrity = &bp->bip1;
> -	bp->bio2.bi_integrity = &bp->bip2;
> -
> -	bp->iv1 = bip->bip_vec[0];
> -	bp->iv2 = bip->bip_vec[0];
> -
> -	bp->bip1.bip_vec[0] = bp->iv1;
> -	bp->bip2.bip_vec[0] = bp->iv2;
> -
> -	bp->iv1.bv_len = sectors * bi->tuple_size;
> -	bp->iv2.bv_offset += sectors * bi->tuple_size;
> -	bp->iv2.bv_len -= sectors * bi->tuple_size;
> -
> -	bp->bip1.bip_sector = bio->bi_integrity->bip_sector;
> -	bp->bip2.bip_sector = bio->bi_integrity->bip_sector + nr_sectors;
> -
> -	bp->bip1.bip_vcnt = bp->bip2.bip_vcnt = 1;
> -	bp->bip1.bip_idx = bp->bip2.bip_idx = 0;
> -}
> -EXPORT_SYMBOL(bio_integrity_split);
> -
> -/**
>   * bio_integrity_clone - Callback for cloning bios with integrity metadata
>   * @bio:	New bio
>   * @bio_src:	Original bio
> diff --git a/fs/bio.c b/fs/bio.c
> index 3332800..781a8cf 100644
> --- a/fs/bio.c
> +++ b/fs/bio.c
> @@ -35,7 +35,7 @@
>   */
>  #define BIO_INLINE_VECS		4
>  
> -static mempool_t *bio_split_pool __read_mostly;
> +static struct bio_set *bio_split_pool __read_mostly;
>  
>  /*
>   * if you change this list, also change bvec_alloc or things will
> @@ -1464,84 +1464,124 @@ void bio_endio(struct bio *bio, int error)
>  }
>  EXPORT_SYMBOL(bio_endio);
>  
> -void bio_pair_release(struct bio_pair *bp)
bio_pair_release() was just moved to down there. Which makes
a complete messy patch, unless I patch it on real code
Its hard to see what's written.
For review sake if I *must* move+change code, I split
it up by doing the pure move in a separate patch, or
sometime commenting the old code and remove it in 
a next patch.
But here I don't see the point of why you switched
them?
> +struct bio *bio_split(struct bio *bio, int sectors,
> +		      gfp_t gfp, struct bio_set *bs)
>  {
> -	if (atomic_dec_and_test(&bp->cnt)) {
> -		struct bio *master = bp->bio1.bi_private;
> +	unsigned idx, vcnt = 0, nbytes = sectors << 9;
> +	struct bio_vec *bv;
> +	struct bio *ret = NULL;
> +
> +	BUG_ON(sectors <= 0);
> +
> +	if (current->bio_list)
> +		gfp &= ~__GFP_WAIT;
> +
> +	if (nbytes >= bio->bi_size)
> +		return bio;
> +
> +	trace_block_split(bdev_get_queue(bio->bi_bdev), bio,
> +				bio->bi_sector + sectors);
> +
> +	bio_for_each_segment(bv, bio, idx) {
> +		vcnt = idx - bio->bi_idx;
> +
> +		if (!nbytes) {
> +			ret = bio_alloc_bioset(gfp, 0, bs);
> +			if (!ret)
> +				return NULL;
> +
> +			ret->bi_io_vec = bio_iovec(bio);
> +			ret->bi_flags |= 1 << BIO_CLONED;
> +			break;
> +		} else if (nbytes < bv->bv_len) {
> +			ret = bio_alloc_bioset(gfp, ++vcnt, bs);
> +			if (!ret)
> +				return NULL;
> +
> +			memcpy(ret->bi_io_vec, bio_iovec(bio),
> +			       sizeof(struct bio_vec) * vcnt);
> +
> +			ret->bi_io_vec[vcnt - 1].bv_len = nbytes;
> +			bv->bv_offset	+= nbytes;
> +			bv->bv_len	-= nbytes;
> +			break;
> +		}
> +
> +		nbytes -= bv->bv_len;
> +	}
> +
> +	ret->bi_bdev	= bio->bi_bdev;
> +	ret->bi_sector	= bio->bi_sector;
> +	ret->bi_size	= sectors << 9;
> +	ret->bi_rw	= bio->bi_rw;
> +	ret->bi_vcnt	= vcnt;
> +	ret->bi_max_vecs = vcnt;
> +	ret->bi_end_io	= bio->bi_end_io;
> +	ret->bi_private	= bio->bi_private;
>  
> -		bio_endio(master, bp->error);
> -		mempool_free(bp, bp->bio2.bi_private);
> +	bio->bi_sector	+= sectors;
> +	bio->bi_size	-= sectors << 9;
> +	bio->bi_idx	 = idx;
> +
> +	if (bio_integrity(bio)) {
> +		bio_integrity_clone(ret, bio, gfp, bs);
> +		bio_integrity_trim(ret, 0, bio_sectors(ret));
> +		bio_integrity_trim(bio, bio_sectors(ret), bio_sectors(bio));
>  	}
> +
> +	return ret;
>  }
> -EXPORT_SYMBOL(bio_pair_release);
> +EXPORT_SYMBOL_GPL(bio_split);
>  
> -static void bio_pair_end_1(struct bio *bi, int err)
> +void bio_pair_release(struct bio_pair *bp)
>  {
> -	struct bio_pair *bp = container_of(bi, struct bio_pair, bio1);
> -
> -	if (err)
> -		bp->error = err;
> +	if (atomic_dec_and_test(&bp->cnt)) {
> +		bp->orig->bi_end_io = bp->bi_end_io;
> +		bp->orig->bi_private = bp->bi_private;
>  
> -	bio_pair_release(bp);
> +		bio_endio(bp->orig, 0);
> +		bio_put(&bp->split);
> +	}
>  }
> +EXPORT_SYMBOL(bio_pair_release);
>  
> -static void bio_pair_end_2(struct bio *bi, int err)
> +static void bio_pair_end(struct bio *bio, int error)
>  {
> -	struct bio_pair *bp = container_of(bi, struct bio_pair, bio2);
> +	struct bio_pair *bp = bio->bi_private;
>  
> -	if (err)
> -		bp->error = err;
> +	if (error)
> +		clear_bit(BIO_UPTODATE, &bp->orig->bi_flags);
>  
>  	bio_pair_release(bp);
>  }
>  
> -/*
> - * split a bio - only worry about a bio with a single page in its iovec
> - */
> -struct bio_pair *bio_split(struct bio *bi, int first_sectors)
> +struct bio_pair *bio_pair_split(struct bio *bio, int first_sectors)
>  {
Here too if you would have staged the function order properly that
new-split above would better merge with the old-split and the
new-old-split that uses the new-split would be a new hunk by itself.
I do this not only for reviewers, but also for myself to prove
I have not changed something I didn't mean to change.
To summarize. I think undoing code movement and better staging
of the code could make this patch much cleaner.
> -	struct bio_pair *bp = mempool_alloc(bio_split_pool, GFP_NOIO);
> -
> -	if (!bp)
> -		return bp;
> +	struct bio_pair *bp;
> +	struct bio *split = bio_split(bio, first_sectors,
> +				      GFP_NOIO, bio_split_pool);
>  
> -	trace_block_split(bdev_get_queue(bi->bi_bdev), bi,
> -				bi->bi_sector + first_sectors);
> -
> -	BUG_ON(bi->bi_vcnt != 1);
> -	BUG_ON(bi->bi_idx != 0);
> -	atomic_set(&bp->cnt, 3);
> -	bp->error = 0;
> -	bp->bio1 = *bi;
> -	bp->bio2 = *bi;
> -	bp->bio2.bi_sector += first_sectors;
> -	bp->bio2.bi_size -= first_sectors << 9;
> -	bp->bio1.bi_size = first_sectors << 9;
> +	if (!split)
> +		return NULL;
>  
> -	bp->bv1 = bi->bi_io_vec[0];
> -	bp->bv2 = bi->bi_io_vec[0];
> -	bp->bv2.bv_offset += first_sectors << 9;
> -	bp->bv2.bv_len -= first_sectors << 9;
> -	bp->bv1.bv_len = first_sectors << 9;
> +	BUG_ON(split == bio);
>  
> -	bp->bio1.bi_io_vec = &bp->bv1;
> -	bp->bio2.bi_io_vec = &bp->bv2;
> +	bp = container_of(split, struct bio_pair, split);
>  
> -	bp->bio1.bi_max_vecs = 1;
> -	bp->bio2.bi_max_vecs = 1;
> +	atomic_set(&bp->cnt, 3);
>  
> -	bp->bio1.bi_end_io = bio_pair_end_1;
> -	bp->bio2.bi_end_io = bio_pair_end_2;
> +	bp->bi_end_io = bio->bi_end_io;
> +	bp->bi_private = bio->bi_private;
>  
> -	bp->bio1.bi_private = bi;
> -	bp->bio2.bi_private = bio_split_pool;
> +	bio->bi_private = bp;
> +	bio->bi_end_io = bio_pair_end;
>  
> -	if (bio_integrity(bi))
> -		bio_integrity_split(bi, bp, first_sectors);
> +	split->bi_private = bp;
> +	split->bi_end_io = bio_pair_end;
>  
>  	return bp;
>  }
> -EXPORT_SYMBOL(bio_split);
> +EXPORT_SYMBOL(bio_pair_split);
>  
>  /**
>   *      bio_sector_offset - Find hardware sector offset in bio
> @@ -1694,8 +1734,7 @@ static int __init init_bio(void)
>  	if (bioset_integrity_create(fs_bio_set, BIO_POOL_SIZE))
>  		panic("bio: can't create integrity pool\n");
>  
> -	bio_split_pool = mempool_create_kmalloc_pool(BIO_SPLIT_ENTRIES,
> -						     sizeof(struct bio_pair));
> +	bio_split_pool = bioset_create(BIO_POOL_SIZE, offsetof(struct bio_pair, split));
>  	if (!bio_split_pool)
>  		panic("bio: can't create split pool\n");
>  
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 35f7c4d..db38175 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -197,16 +197,18 @@ struct bio_integrity_payload {
>   *   in bio2.bi_private
>   */
>  struct bio_pair {
> -	struct bio			bio1, bio2;
> -	struct bio_vec			bv1, bv2;
> -#if defined(CONFIG_BLK_DEV_INTEGRITY)
> -	struct bio_integrity_payload	bip1, bip2;
> -	struct bio_vec			iv1, iv2;
> -#endif
> -	atomic_t			cnt;
> -	int				error;
> +	atomic_t		cnt;
> +
> +	bio_end_io_t		*bi_end_io;
> +	void			*bi_private;
> +
> +	struct bio		*orig;
> +	struct bio		split;
>  };
> -extern struct bio_pair *bio_split(struct bio *bi, int first_sectors);
> +
> +extern struct bio *bio_split(struct bio *bio, int sectors,
> +			     gfp_t gfp, struct bio_set *bs);
> +extern struct bio_pair *bio_pair_split(struct bio *bi, int first_sectors);
I do not get this. Please explain. We have two different topics here.
1. Introduction of a new/better split function.
2. Rename of an old one.
Now if you must, making the rename a patch of it's own is preferred.
But why must you? can't you think of a creative name for the new
one?
Though I admit your names do make sense. The first belongs to the
bio domain bio_DO, and the second belongs to the bio_pair domain
bio_pair_DO
Hey this patch and the rest are amazing. It gives one the incentive
to do some more complicating things, just because it is easier now.
Thanks!
Boaz
>  extern void bio_pair_release(struct bio_pair *dbio);
>  
>  extern struct bio_set *bioset_create(unsigned int, unsigned int);
> @@ -511,7 +513,6 @@ extern int bio_integrity_prep(struct bio *);
>  extern void bio_integrity_endio(struct bio *, int);
>  extern void bio_integrity_advance(struct bio *, unsigned int);
>  extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int);
> -extern void bio_integrity_split(struct bio *, struct bio_pair *, int);
>  extern int bio_integrity_clone(struct bio *, struct bio *, gfp_t, struct bio_set *);
>  extern int bioset_integrity_create(struct bio_set *, int);
>  extern void bioset_integrity_free(struct bio_set *);
> @@ -555,12 +556,6 @@ static inline int bio_integrity_clone(struct bio *bio, struct bio *bio_src,
>  	return 0;
>  }
>  
> -static inline void bio_integrity_split(struct bio *bio, struct bio_pair *bp,
> -				       int sectors)
> -{
> -	return;
> -}
> -
>  static inline void bio_integrity_advance(struct bio *bio,
>  					 unsigned int bytes_done)
>  {
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [PATCH 12/13] Make generic_make_request handle arbitrarily large bios
  2012-05-18  2:59 ` [PATCH 12/13] Make generic_make_request handle arbitrarily large bios koverstreet
  2012-05-18  8:05   ` NeilBrown
@ 2012-05-18 17:52   ` Tejun Heo
  2012-05-19  0:59     ` [dm-devel] " Alasdair G Kergon
       [not found]   ` <1114e7019b0055fc09a54b59b36398d5c54f5e32.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2 siblings, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2012-05-18 17:52 UTC (permalink / raw)
  To: koverstreet
  Cc: linux-bcache, linux-kernel, dm-devel, linux-fsdevel, axboe, agk,
	neilb
Hello, Kent.
On Thu, May 17, 2012 at 10:59:59PM -0400, koverstreet@google.com wrote:
> From: Kent Overstreet <koverstreet@google.com>
> 
> The way the block layer is currently written, it goes to great lengths
> to avoid having to split bios; upper layer code (such as bio_add_page())
> checks what the underlying device can handle and tries to always create
> bios that don't need to be split.
> 
> But this approach becomes unwieldy and eventually breaks down with
> stacked devices and devices with dynamic limits, and it adds a lot of
> complexity. If the block layer could split bios as needed, we could
> eliminate a lot of complexity elsewhere - particularly in stacked
> drivers. Code that creates bios can then create whatever size bios are
> convenient, and more importantly stacked drivers don't have to deal with
> both their own bio size limitations and the limitations of the
> (potentially multiple) devices underneath them.
I generally like the idea.  Device limit directly being propagated to
high level users is cumbersome.  Somebody has to be splitting anyway
and doing it at top makes things, including limit propagation through
stacked drivers, unnecessarily complicated.  Jens, what are your
thoughts?
> +static void bio_submit_split_done(struct closure *cl)
> +{
> +	struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
> +
> +	s->bio->bi_end_io	= s->bi_end_io;
> +	s->bio->bi_private	= s->bi_private;
> +	bio_endio(s->bio, 0);
I'd rather you didn't indent assignments.
> +	closure_debug_destroy(&s->cl);
> +	mempool_free(s, s->q->bio_split_hook);
> +}
...
> +static int bio_submit_split(struct bio *bio)
bool?
> +{
> +	struct bio_split_hook *s;
> +	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> +
> +	if (!bio_has_data(bio) ||
> +	    !q ||
> +	    !q->bio_split_hook ||
> +	    bio_sectors(bio) <= bio_max_sectors(bio))
Style issues.
> +		return 0;
> +
> +	s = mempool_alloc(q->bio_split_hook, GFP_NOIO);
> +
> +	closure_init(&s->cl, NULL);
Please use workqueue with open coded sequencer or maybe implement bio
sequencer which can be used by other stacking drivers too.
> +	s->bio		= bio;
> +	s->q		= q;
> +	s->bi_end_io	= bio->bi_end_io;
> +	s->bi_private	= bio->bi_private;
> +
> +	bio_get(bio);
> +	bio->bi_end_io	= bio_submit_split_endio;
> +	bio->bi_private	= &s->cl;
Maybe it's okay but I *hope* bi_private override weren't necessary -
it's way too subtle.  If there's no other way and this is gonna be an
integral part of block layer, just add a field to bio.
> +unsigned __bio_max_sectors(struct bio *bio, struct block_device *bdev,
> +			   sector_t sector)
> +{
> +	unsigned ret = bio_sectors(bio);
> +	struct request_queue *q = bdev_get_queue(bdev);
> +	struct bio_vec *bv, *end = bio_iovec(bio) +
> +		min_t(int, bio_segments(bio), queue_max_segments(q));
> +
> +	struct bvec_merge_data bvm = {
> +		.bi_bdev	= bdev,
> +		.bi_sector	= sector,
> +		.bi_size	= 0,
> +		.bi_rw		= bio->bi_rw,
> +	};
> +
> +	if (bio_segments(bio) > queue_max_segments(q) ||
> +	    q->merge_bvec_fn) {
> +		ret = 0;
> +
> +		for (bv = bio_iovec(bio); bv < end; bv++) {
> +			if (q->merge_bvec_fn &&
> +			    q->merge_bvec_fn(q, &bvm, bv) < (int) bv->bv_len)
> +				break;
> +
> +			ret		+= bv->bv_len >> 9;
> +			bvm.bi_size	+= bv->bv_len;
> +		}
> +
> +		if (ret >= (BIO_MAX_PAGES * PAGE_SIZE) >> 9)
> +			return (BIO_MAX_PAGES * PAGE_SIZE) >> 9;
> +	}
> +
> +	ret = min(ret, queue_max_sectors(q));
> +
> +	WARN_ON(!ret);
> +	ret = max_t(int, ret, bio_iovec(bio)->bv_len >> 9);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(__bio_max_sectors);
Does this by any chance allow killing ->merge_bvec_fn()?  That would
be *awesome*.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dm-devel] [PATCH 02/13] dm: kill dm_rq_bio_destructor
       [not found]         ` <20120518164319.GJ29330-FDJ95KluN3Z0klwcnFlA1dvLeJWuRmrY@public.gmane.org>
@ 2012-05-18 18:50           ` Kent Overstreet
       [not found]             ` <20120518185027.GA9673-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 48+ messages in thread
From: Kent Overstreet @ 2012-05-18 18:50 UTC (permalink / raw)
  To: Tejun Heo, axboe-tSWWG44O7X1aa/9Udqfwiw,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	Junichi Nomura, Mikulas Patocka
On Fri, May 18, 2012 at 05:43:19PM +0100, Alasdair G Kergon wrote:
> On Fri, May 18, 2012 at 08:57:29AM -0700, Tejun Heo wrote:
> > Please explain why this is done and how it's safe.  Alasdair / dm
> > folks, can you please ack this?
>  
> I think it's relying on there being never more than one reference on those 
> bios so that that endio fn, called exactly once, always frees it and there
> are no dm_puts elsewhere.
Is that a safe assumption? From my perusal of the code it certainly
looks like it should be, but I don't know dm all that well.
Seems like it might be better to use bio set's front_pad to put it in
the same allocation as the bio, but I don't really want to get into that
myself.
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dm-devel] [PATCH 03/13] block: Add bio_clone_bioset()
  2012-05-18 16:11     ` [dm-devel] " Vivek Goyal
@ 2012-05-18 18:55       ` Kent Overstreet
  0 siblings, 0 replies; 48+ messages in thread
From: Kent Overstreet @ 2012-05-18 18:55 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-bcache, linux-kernel, dm-devel, linux-fsdevel, tj, axboe,
	agk
On Fri, May 18, 2012 at 12:11:30PM -0400, Vivek Goyal wrote:
> On Thu, May 17, 2012 at 10:59:50PM -0400, koverstreet@google.com wrote:
> 
> [..]
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index a5a524e..47605e7 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -178,23 +178,7 @@ struct bio *bio_clone_mddev(struct bio *bio, gfp_t gfp_mask,
> >  	if (!mddev || !mddev->bio_set)
> >  		return bio_clone(bio, gfp_mask);
> >  
> > -	b = bio_alloc_bioset(gfp_mask, bio->bi_max_vecs, mddev->bio_set);
> > -	if (!b)
> > -		return NULL;
> > -
> > -	__bio_clone(b, bio);
> > -	if (bio_integrity(bio)) {
> > -		int ret;
> > -
> > -		ret = bio_integrity_clone(b, bio, gfp_mask, mddev->bio_set);
> > -
> > -		if (ret < 0) {
> > -			bio_put(b);
> > -			return NULL;
> > -		}
> > -	}
> > -
> > -	return b;
> > +	return bio_clone_bioset(gfp_mask, bio_segments(bio), mddev->bio_set);
> 
> Are we passing wrong arguments to bio_clone_bioset()?
Uh - whoops, yes. Good catch, thanks.
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [PATCH 03/13] block: Add bio_clone_bioset()
  2012-05-18 16:05   ` Tejun Heo
@ 2012-05-18 20:31     ` Kent Overstreet
  0 siblings, 0 replies; 48+ messages in thread
From: Kent Overstreet @ 2012-05-18 20:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-bcache, linux-kernel, dm-devel, linux-fsdevel, axboe, agk,
	neilb
On Fri, May 18, 2012 at 09:05:11AM -0700, Tejun Heo wrote:
> Hey, Kent.
> 
> On Thu, May 17, 2012 at 10:59:50PM -0400, koverstreet@google.com wrote:
> > From: Kent Overstreet <koverstreet@google.com>
> > 
> > This consolidates some code, and will help in a later patch changing how
> > bio cloning works.
> > 
> > Signed-off-by: Kent Overstreet <koverstreet@google.com>
> 
> I'd prefer a bit more explanation on what's going on generally and why
> and how dm conversion is different.
Well, besides the removal of idx it's all just removing stuff that
bio_clone_bioset() does for it.
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index 3cc2169..3e33039 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -1072,26 +1072,19 @@ static struct bio *split_bvec(struct bio *bio, sector_t sector,
> >   * Creates a bio that consists of range of complete bvecs.
> >   */
> >  static struct bio *clone_bio(struct bio *bio, sector_t sector,
> > -			     unsigned short idx, unsigned short bv_count,
> > +			     unsigned short bv_count,
> >  			     unsigned int len, struct bio_set *bs)
> >  {
> >  	struct bio *clone;
> >  
> > -	clone = bio_alloc_bioset(GFP_NOIO, bio->bi_max_vecs, bs);
> > -	__bio_clone(clone, bio);
> > +	clone = bio_clone_bioset(bio, GFP_NOIO, bs);
> >  	clone->bi_sector = sector;
> > -	clone->bi_idx = idx;
> > -	clone->bi_vcnt = idx + bv_count;
> > +	clone->bi_vcnt = bv_count;
> >  	clone->bi_size = to_bytes(len);
> > -	clone->bi_flags &= ~(1 << BIO_SEG_VALID);
> 
> Maybe removal of @idx deserves a separate patch?
I'm gonna back that out, I'm not sure it was correct now.
> > -	if (bio_integrity(bio)) {
> > -		bio_integrity_clone(clone, bio, GFP_NOIO, bs);
> > -
> > -		if (idx != bio->bi_idx || clone->bi_size < bio->bi_size)
> > -			bio_integrity_trim(clone,
> > -					   bio_sector_offset(bio, idx, 0), len);
> > -	}
> > +	if (bio_integrity(bio) &&
> > +	    clone->bi_size < bio->bi_size)
> 
> Unnecessary line break.
> 
> Other than that, looks good to me.
> 
> Thanks.
> 
> -- 
> tejun
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [PATCH 04/13] block: Add bio_clone_kmalloc()
       [not found]       ` <20120518160903.GD19388-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-05-18 20:39         ` Kent Overstreet
  0 siblings, 0 replies; 48+ messages in thread
From: Kent Overstreet @ 2012-05-18 20:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	axboe-tSWWG44O7X1aa/9Udqfwiw, agk-H+wXaHxf7aLQT0dZR+AlfA,
	neilb-l3A5Bk7waGM, yehuda-L5o5AL9CYN0tUFlbccrkMA,
	sage-BnTBU8nroG7k1uMJSBkQmQ
On Fri, May 18, 2012 at 09:09:03AM -0700, Tejun Heo wrote:
> On Thu, May 17, 2012 at 10:59:51PM -0400, koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org wrote:
> > @@ -729,7 +729,7 @@ static struct bio *bio_chain_clone(struct bio **old, struct bio **next,
> >  	}
> >  
> >  	while (old_chain && (total < len)) {
> > -		tmp = bio_kmalloc(gfpmask, old_chain->bi_max_vecs);
> > +		tmp = bio_clone_kmalloc(old_chain, gfpmask);
> >  		if (!tmp)
> >  			goto err_out;
> >  
> > @@ -751,13 +751,9 @@ static struct bio *bio_chain_clone(struct bio **old, struct bio **next,
> >  			if (!bp)
> >  				goto err_out;
> >  
> > -			__bio_clone(tmp, &bp->bio1);
> > -
> 
> This effectively swaps the order of bio_split() and __bio_clone().  Is
> that safe?  Also, please cc the maintainer.
Now that I look at that code some more, I'm not sure it was quite right
before - that or I don't follow how it's supposed to work.
That bio_split() effectively does its own clone, so it seems to me the
bio_kmalloc()/bio_clone_kmalloc() should only be happening in the non
split case.
Also, there's a bio_chain_clone in drivers/block/osdblk.c that looks
like it's doing something similar, probably we should have some generic
code for this.
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [PATCH 05/13] block: Only clone bio vecs that are in use
  2012-05-18 16:13   ` Tejun Heo
@ 2012-05-18 21:14     ` Kent Overstreet
  0 siblings, 0 replies; 48+ messages in thread
From: Kent Overstreet @ 2012-05-18 21:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-bcache, linux-kernel, dm-devel, linux-fsdevel, axboe, agk,
	neilb
On Fri, May 18, 2012 at 09:13:34AM -0700, Tejun Heo wrote:
> On Thu, May 17, 2012 at 10:59:52PM -0400, koverstreet@google.com wrote:
> > diff --git a/fs/bio.c b/fs/bio.c
> > index e2c0970..de0733e 100644
> > --- a/fs/bio.c
> > +++ b/fs/bio.c
> > @@ -435,8 +435,9 @@ EXPORT_SYMBOL(bio_phys_segments);
> >   */
> >  void __bio_clone(struct bio *bio, struct bio *bio_src)
> >  {
> > -	memcpy(bio->bi_io_vec, bio_src->bi_io_vec,
> > -		bio_src->bi_max_vecs * sizeof(struct bio_vec));
> > +	memcpy(bio->bi_io_vec,
> > +	       bio_iovec(bio_src),
> 
> Unnecessary line break.
> 
> > +	       bio_segments(bio_src) * sizeof(struct bio_vec));
> >  
> >  	/*
> >  	 * most users will be overriding ->bi_bdev with a new target,
> > @@ -445,10 +446,10 @@ void __bio_clone(struct bio *bio, struct bio *bio_src)
> >  	bio->bi_sector = bio_src->bi_sector;
> >  	bio->bi_bdev = bio_src->bi_bdev;
> >  	bio->bi_flags |= 1 << BIO_CLONED;
> > +	bio->bi_flags &= ~(1 << BIO_SEG_VALID);
> 
> Can probably be conditionalized on bi_idx?
I've never been that clear on the semantics of BIO_SEG_VALID. I'll defer
to you on that.
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [PATCH 06/13] block: Add bio_reset()
       [not found]     ` <20120518161608.GF19388-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-05-18 21:48       ` Kent Overstreet
  0 siblings, 0 replies; 48+ messages in thread
From: Kent Overstreet @ 2012-05-18 21:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	axboe-tSWWG44O7X1aa/9Udqfwiw, agk-H+wXaHxf7aLQT0dZR+AlfA,
	neilb-l3A5Bk7waGM
On Fri, May 18, 2012 at 09:16:08AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Thu, May 17, 2012 at 10:59:53PM -0400, koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org wrote:
> > From: Kent Overstreet <koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > 
> > Reusing bios is something that's been highly frowned upon in the past,
> > but driver code keeps doing it anyways. If it's going to happen anyways,
> > we should provide a generic method.
> > 
> > This'll help with getting rid of bi_destructor.
> 
> It would be nice to explain a bit why this helps bi_destructor
> removal.
drivers/block/pktcdvd.c was open coding this - it was doing a bio_init()
and then resetting bi_destructor (and other stuff). The generic
bio_reset() means it doesn't have to know about bi_destructor anymore.
> 
> > +#define BIO_RESET_OFFSET	offsetof(struct bio, bi_max_vecs)
> 
> Maybe better named BIO_RESET_BYTES?  As offset tends to indicate the
> start of a region.
Yeah, that's a better name.
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [PATCH 09/13] block: Add an explicit bio flag for bios that own their bvec
  2012-05-18 16:30     ` Tejun Heo
@ 2012-05-18 21:49       ` Kent Overstreet
  0 siblings, 0 replies; 48+ messages in thread
From: Kent Overstreet @ 2012-05-18 21:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-bcache, linux-kernel, dm-devel, linux-fsdevel, axboe, agk,
	neilb
On Fri, May 18, 2012 at 09:30:18AM -0700, Tejun Heo wrote:
> On Thu, May 17, 2012 at 10:59:56PM -0400, koverstreet@google.com wrote:
> > From: Kent Overstreet <koverstreet@google.com>
> > 
> > This is for the new bio splitting code. When we split a bio, if the
> > split occured on a bvec boundry we reuse the bvec for the new bio. But
> > that means bio_free() can't free it, hence the explicit flag.
> > 
> > Signed-off-by: Kent Overstreet <koverstreet@google.com>
> > ---
> >  fs/bio.c                  |    3 ++-
> >  include/linux/blk_types.h |    1 +
> >  2 files changed, 3 insertions(+), 1 deletions(-)
> > 
> > diff --git a/fs/bio.c b/fs/bio.c
> > index ecc9088..3332800 100644
> > --- a/fs/bio.c
> > +++ b/fs/bio.c
> > @@ -234,7 +234,7 @@ void bio_free(struct bio *bio, struct bio_set *bs)
> >  {
> >  	void *p;
> >  
> > -	if (bio_has_allocated_vec(bio))
> > +	if (bio_flagged(bio, BIO_HAS_VEC))
> >  		bvec_free_bs(bs, bio->bi_io_vec, BIO_POOL_IDX(bio));
> 
> We can remove bio_has_allocated_vec()?
Yep! Adding that to the patch.
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [PATCH 08/13] block: Kill bi_destructor
       [not found]       ` <20120518162142.GH19388-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-05-18 22:21         ` Kent Overstreet
  0 siblings, 0 replies; 48+ messages in thread
From: Kent Overstreet @ 2012-05-18 22:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	axboe-tSWWG44O7X1aa/9Udqfwiw, agk-H+wXaHxf7aLQT0dZR+AlfA,
	neilb-l3A5Bk7waGM
On Fri, May 18, 2012 at 09:21:42AM -0700, Tejun Heo wrote:
> Hello,
> 
> > @@ -417,8 +409,11 @@ void bio_put(struct bio *bio)
> >  
> >  		if (bio->bi_pool)
> >  			bio_free(bio, bio->bi_pool);
> > -		else
> > -			bio->bi_destructor(bio);
> > +		else {
> > +			if (bio_integrity(bio))
> > +				bio_integrity_free(bio, fs_bio_set);
> > +			kfree(bio);
> 
> if {
> } else {
> }
> 
> And wouldn't it be better to make bio_free() handle kfreeing too?
That'd kind of change the semantics of bio_free() - but I suppose it'd
make sense.
I'm not terribly happy with how the bio integrity data belongs to
fs_bio_set for kmalloced bios. Maybe I'll change that at some point.
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dm-devel] [PATCH 12/13] Make generic_make_request handle arbitrarily large bios
       [not found]   ` <1114e7019b0055fc09a54b59b36398d5c54f5e32.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-05-18 22:48     ` Mikulas Patocka
  0 siblings, 0 replies; 48+ messages in thread
From: Mikulas Patocka @ 2012-05-18 22:48 UTC (permalink / raw)
  To: device-mapper development
  Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	axboe-tSWWG44O7X1aa/9Udqfwiw, Kent Overstreet,
	agk-H+wXaHxf7aLQT0dZR+AlfA
Hi
This patch looks very good, this approach will hopefully fix the long 
standing bug - when you add usb disk to dm-mirro or md-raid-1, requests 
would fail because the usb disk has different limits.
bio_segments returns the number of pages (not segments) in the bio. And 
this is incorrectly compared to queue_max_segments. It should compare 
bio_phys_segments to queue_max_segments.
If you have this split infrastructure, it would be good if you allow to 
use it by other drivers - I'd suggest that you change q->make_request_fn 
to return 0 on success (bio submitted) and a number of sectors on failure 
(this means that the driver only accepts the specified number of sector 
and --- generic_make_request needs to split the bio and submit a smaller 
bio with the specified number of sectors). If you do it this way, md and 
dm can be simplified - they will no longer need to do their own request 
splitting and they could just return a number of sectors they accept.
What I mean:
generic_make_request calls q->make_request_fn(q, bio). If the return value 
is zero, continue with the next bio. If the return value is nonzero, split 
the current bio so that the first part of the bio has the number of 
sectors specified in the return value. Submit the partial bio (normally it 
should be accepted, but in the extreme case when someone changed dm 
mapping underneath, we would need to split it again) and the remainder.
The check bio_sectors(bio) <= bio_max_sectors(bio) can then be moved to 
blk_queue_bio.
This way, we could remove bio splitting from dm and md and simplify them.
Mikulas
On Thu, 17 May 2012, koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org wrote:
> From: Kent Overstreet <koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> 
> The way the block layer is currently written, it goes to great lengths
> to avoid having to split bios; upper layer code (such as bio_add_page())
> checks what the underlying device can handle and tries to always create
> bios that don't need to be split.
> 
> But this approach becomes unwieldy and eventually breaks down with
> stacked devices and devices with dynamic limits, and it adds a lot of
> complexity. If the block layer could split bios as needed, we could
> eliminate a lot of complexity elsewhere - particularly in stacked
> drivers. Code that creates bios can then create whatever size bios are
> convenient, and more importantly stacked drivers don't have to deal with
> both their own bio size limitations and the limitations of the
> (potentially multiple) devices underneath them.
> 
> Signed-off-by: Kent Overstreet <koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
>  block/blk-core.c       |  111 ++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/bio.c               |   41 ++++++++++++++++++
>  include/linux/bio.h    |    7 +++
>  include/linux/blkdev.h |    3 +
>  4 files changed, 162 insertions(+), 0 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 91617eb..1d7a482 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -29,6 +29,7 @@
>  #include <linux/fault-inject.h>
>  #include <linux/list_sort.h>
>  #include <linux/delay.h>
> +#include <linux/closure.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/block.h>
> @@ -52,6 +53,12 @@ static struct kmem_cache *request_cachep;
>  struct kmem_cache *blk_requestq_cachep;
>  
>  /*
> + * For bio_split_hook
> + */
> +static struct kmem_cache *bio_split_cache;
> +static struct workqueue_struct *bio_split_wq;
> +
> +/*
>   * Controlling structure to kblockd
>   */
>  static struct workqueue_struct *kblockd_workqueue;
> @@ -487,6 +494,14 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
>  	if (q->id < 0)
>  		goto fail_q;
>  
> +	q->bio_split_hook = mempool_create_slab_pool(4, bio_split_cache);
> +	if (!q->bio_split_hook)
> +		goto fail_split_hook;
> +
> +	q->bio_split = bioset_create(4, 0);
> +	if (!q->bio_split)
> +		goto fail_split;
> +
>  	q->backing_dev_info.ra_pages =
>  			(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
>  	q->backing_dev_info.state = 0;
> @@ -526,6 +541,10 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
>  
>  fail_id:
>  	ida_simple_remove(&blk_queue_ida, q->id);
> +fail_split:
> +	bioset_free(q->bio_split);
> +fail_split_hook:
> +	mempool_destroy(q->bio_split_hook);
>  fail_q:
>  	kmem_cache_free(blk_requestq_cachep, q);
>  	return NULL;
> @@ -1493,6 +1512,83 @@ static inline bool should_fail_request(struct hd_struct *part,
>  
>  #endif /* CONFIG_FAIL_MAKE_REQUEST */
>  
> +struct bio_split_hook {
> +	struct closure		cl;
> +	struct request_queue	*q;
> +	struct bio		*bio;
> +	bio_end_io_t		*bi_end_io;
> +	void			*bi_private;
> +};
> +
> +static void bio_submit_split_done(struct closure *cl)
> +{
> +	struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
> +
> +	s->bio->bi_end_io	= s->bi_end_io;
> +	s->bio->bi_private	= s->bi_private;
> +	bio_endio(s->bio, 0);
> +
> +	closure_debug_destroy(&s->cl);
> +	mempool_free(s, s->q->bio_split_hook);
> +}
> +
> +static void bio_submit_split_endio(struct bio *bio, int error)
> +{
> +	struct closure *cl = bio->bi_private;
> +	struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
> +
> +	if (error)
> +		clear_bit(BIO_UPTODATE, &s->bio->bi_flags);
> +
> +	bio_put(bio);
> +	closure_put(cl);
> +}
> +
> +static void __bio_submit_split(struct closure *cl)
> +{
> +	struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
> +	struct bio *bio = s->bio, *n;
> +
> +	do {
> +		n = bio_split(bio, bio_max_sectors(bio),
> +			      GFP_NOIO, s->q->bio_split);
> +		if (!n)
> +			continue_at(cl, __bio_submit_split, bio_split_wq);
> +
> +		closure_get(cl);
> +		generic_make_request(n);
> +	} while (n != bio);
> +
> +	continue_at(cl, bio_submit_split_done, NULL);
> +}
> +
> +static int bio_submit_split(struct bio *bio)
> +{
> +	struct bio_split_hook *s;
> +	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> +
> +	if (!bio_has_data(bio) ||
> +	    !q ||
> +	    !q->bio_split_hook ||
> +	    bio_sectors(bio) <= bio_max_sectors(bio))
> +		return 0;
> +
> +	s = mempool_alloc(q->bio_split_hook, GFP_NOIO);
> +
> +	closure_init(&s->cl, NULL);
> +	s->bio		= bio;
> +	s->q		= q;
> +	s->bi_end_io	= bio->bi_end_io;
> +	s->bi_private	= bio->bi_private;
> +
> +	bio_get(bio);
> +	bio->bi_end_io	= bio_submit_split_endio;
> +	bio->bi_private	= &s->cl;
> +
> +	__bio_submit_split(&s->cl);
> +	return 1;
> +}
> +
>  /*
>   * Check whether this bio extends beyond the end of the device.
>   */
> @@ -1646,6 +1742,14 @@ void generic_make_request(struct bio *bio)
>  	 * it is non-NULL, then a make_request is active, and new requests
>  	 * should be added at the tail
>  	 */
> +
> +	/*
> +	 * If the device can't accept arbitrary sized bios, check if we
> +	 * need to split:
> +	 */
> +	if (bio_submit_split(bio))
> +		return;
> +
>  	if (current->bio_list) {
>  		bio_list_add(current->bio_list, bio);
>  		return;
> @@ -2892,11 +2996,18 @@ int __init blk_dev_init(void)
>  	if (!kblockd_workqueue)
>  		panic("Failed to create kblockd\n");
>  
> +	bio_split_wq = alloc_workqueue("bio_split", WQ_MEM_RECLAIM, 0);
> +	if (!bio_split_wq)
> +		panic("Failed to create bio_split wq\n");
> +
>  	request_cachep = kmem_cache_create("blkdev_requests",
>  			sizeof(struct request), 0, SLAB_PANIC, NULL);
>  
>  	blk_requestq_cachep = kmem_cache_create("blkdev_queue",
>  			sizeof(struct request_queue), 0, SLAB_PANIC, NULL);
>  
> +	bio_split_cache = kmem_cache_create("bio_split_hook",
> +			sizeof(struct bio_split_hook), 0, SLAB_PANIC, NULL);
> +
>  	return 0;
>  }
> diff --git a/fs/bio.c b/fs/bio.c
> index 781a8cf..360ac93 100644
> --- a/fs/bio.c
> +++ b/fs/bio.c
> @@ -428,6 +428,47 @@ inline int bio_phys_segments(struct request_queue *q, struct bio *bio)
>  }
>  EXPORT_SYMBOL(bio_phys_segments);
>  
> +unsigned __bio_max_sectors(struct bio *bio, struct block_device *bdev,
> +			   sector_t sector)
> +{
> +	unsigned ret = bio_sectors(bio);
> +	struct request_queue *q = bdev_get_queue(bdev);
> +	struct bio_vec *bv, *end = bio_iovec(bio) +
> +		min_t(int, bio_segments(bio), queue_max_segments(q));
> +
> +	struct bvec_merge_data bvm = {
> +		.bi_bdev	= bdev,
> +		.bi_sector	= sector,
> +		.bi_size	= 0,
> +		.bi_rw		= bio->bi_rw,
> +	};
> +
> +	if (bio_segments(bio) > queue_max_segments(q) ||
> +	    q->merge_bvec_fn) {
> +		ret = 0;
> +
> +		for (bv = bio_iovec(bio); bv < end; bv++) {
> +			if (q->merge_bvec_fn &&
> +			    q->merge_bvec_fn(q, &bvm, bv) < (int) bv->bv_len)
> +				break;
> +
> +			ret		+= bv->bv_len >> 9;
> +			bvm.bi_size	+= bv->bv_len;
> +		}
> +
> +		if (ret >= (BIO_MAX_PAGES * PAGE_SIZE) >> 9)
> +			return (BIO_MAX_PAGES * PAGE_SIZE) >> 9;
> +	}
> +
> +	ret = min(ret, queue_max_sectors(q));
> +
> +	WARN_ON(!ret);
> +	ret = max_t(int, ret, bio_iovec(bio)->bv_len >> 9);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(__bio_max_sectors);
> +
>  /**
>   * 	__bio_clone	-	clone a bio
>   * 	@bio: destination bio
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index db38175..79f8173 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -224,6 +224,13 @@ extern void bio_endio(struct bio *, int);
>  struct request_queue;
>  extern int bio_phys_segments(struct request_queue *, struct bio *);
>  
> +unsigned __bio_max_sectors(struct bio *, struct block_device *, sector_t);
> +
> +static inline unsigned bio_max_sectors(struct bio *bio)
> +{
> +	return __bio_max_sectors(bio, bio->bi_bdev, bio->bi_sector);
> +}
> +
>  extern void __bio_clone(struct bio *, struct bio *);
>  extern struct bio *bio_clone_bioset(struct bio *, gfp_t, struct bio_set *bs);
>  extern struct bio *bio_clone(struct bio *, gfp_t);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 2aa2466..464adb7 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -399,6 +399,9 @@ struct request_queue {
>  	/* Throttle data */
>  	struct throtl_data *td;
>  #endif
> +
> +	mempool_t		*bio_split_hook;
> +	struct bio_set		*bio_split;
>  };
>  
>  #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
> -- 
> 1.7.9.rc2
> 
> --
> dm-devel mailing list
> dm-devel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> https://www.redhat.com/mailman/listinfo/dm-devel
> 
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dm-devel] [PATCH 12/13] Make generic_make_request handle arbitrarily large bios
  2012-05-18 17:52   ` Tejun Heo
@ 2012-05-19  0:59     ` Alasdair G Kergon
  0 siblings, 0 replies; 48+ messages in thread
From: Alasdair G Kergon @ 2012-05-19  0:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: koverstreet, axboe, dm-devel, linux-kernel, linux-bcache,
	linux-fsdevel, agk
On Fri, May 18, 2012 at 10:52:16AM -0700, Tejun Heo wrote:
> Does this by any chance allow killing ->merge_bvec_fn()?
merge_bvec also has an impact on efficiency.
Why create a too-large bio in the first place if it's only going to have to
get split up again?
If a change like this goes in, the various trade-offs should be re-examined.
Alasdair
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dm-devel] [PATCH 12/13] Make generic_make_request handle arbitrarily large bios
       [not found]         ` <20120518081444.GA27205-RcKxWJ4Cfj3IzGYXcIpNmNLIRw13R84JkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
@ 2012-05-21 17:17           ` Vivek Goyal
  2012-05-21 17:55             ` Kent Overstreet
  0 siblings, 1 reply; 48+ messages in thread
From: Vivek Goyal @ 2012-05-21 17:17 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: NeilBrown, axboe-tSWWG44O7X1aa/9Udqfwiw,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, agk-H+wXaHxf7aLQT0dZR+AlfA
On Fri, May 18, 2012 at 04:14:44AM -0400, Kent Overstreet wrote:
> On Fri, May 18, 2012 at 06:05:50PM +1000, NeilBrown wrote:
> > 
> > Hi Kent,
> >  there is lots of good stuff in this series and I certainly hope a lot of it
> >  can eventually go upstream.
> > 
> >  However there is a part of this patch that I don't think is safe:
> > 
> > 
> > > +static void __bio_submit_split(struct closure *cl)
> > > +{
> > > +	struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
> > > +	struct bio *bio = s->bio, *n;
> > > +
> > > +	do {
> > > +		n = bio_split(bio, bio_max_sectors(bio),
> > > +			      GFP_NOIO, s->q->bio_split);
> > > +		if (!n)
> > > +			continue_at(cl, __bio_submit_split, bio_split_wq);
> > > +
> > > +		closure_get(cl);
> > > +		generic_make_request(n);
> > > +	} while (n != bio);
> > > +
> > > +	continue_at(cl, bio_submit_split_done, NULL);
> > > +}
> > 
> > Firstly a small point:  Can bio_split ever return NULL here?  I don't
> > think it can, so there is no need to test.
> > But if it can, then calling generic_make_request(NULL) doesn't seem like a
> > good idea.
> > 
> > More significantly though::
> >   This is called from generic_make_request which can be called recursively and
> > enforces a tail-recursion semantic.
> > If that generic_make_request was a recursive call, then the
> > generic_make_request in __bio_submit_split will not start the request, but
> > will queue the bio for later handling.  If we then call bio_split again, we
> > could try to allocation from a mempool while we are holding one entry
> > allocated from that pool captive.  This can deadlock.
> > 
> > i.e. if the original bio is so large that it needs to be split into 3 pieces,
> > then we will try to allocate the second piece before the first piece has a
> > chance to be released.  If this happens in enough threads to exhaust the pool
> > (4 I think), it will deadlock.
> > 
> > I realise this sounds like a very unlikely case, but of course they happen.
> > 
> > One possible approach might be to count how many splits will be required,
> > then have an interface to mempools so you can allocate them all at once,
> > or block and wait.
> 
> Yeah, I actually thought of that (I probably should've documented it,
> though).
> 
> That's why the code is checking if bio_split() returned NULL; my verison
> of bio_split() checks if current->bio_list is non NULL, and if it is it
> doesn't pass __GFP_WAIT to bio_alloc_bioset().
> 
> (I was debating whether bio_split() should do that itself or leave it up
> to the caller. I tend to think it's too easy to accidentally screw up -
> and not notice it - so it should be enforced by generic code. Possibly
> the check should be moved to bio_alloc_bioset(), though.)
> 
> If we do get a memory allocation failure, then we just punt to workqueue
> - continue_at() runs __bio_submit_split from the bio_split workqueue -
> where we can safely use the mempool.
May be I am missing something, hence I will ask.  Is punting to workqueue
will really solve the issue raised by Neil. 
Due to spliting required, we will be holding some bios in the stack and
these bios can't be submitted till further allocation from pool happens. So
will it matter whether we are waiting for allocation in submitting process
context or in worker thread context.
IOW, say you have a pool of 2 bios. We allocate 1 bio (say bio A),  and submit
it for IO (now 1 bio left in pool). Now, bio A needs to be split up, so we
allocate bio B and submit it (pool is empty now). Now we try to submit bio
B and this also needs to be split. There are no more free bios so we will
wait for some to get free but none of the bios (A and B) have actually
been submitted for IO so nothing will get freed and we have a deadlock
(This is assuming that memory is tight enough that we are not able to do
any allocations from the slab backing the mempool).
biodoc.txt says following.
**************************************************************************
The caller of bio_alloc is expected to taken certain steps to avoid
deadlocks, e.g. avoid trying to allocate more memory from the pool while
already holding memory obtained from the pool.
[TBD: This is a potential issue, though a rare possibility
 in the bounce bio allocation that happens in the current code, since
 it ends up allocating a second bio from the same pool while
 holding the original bio ]
**************************************************************************
Thanks
Vivek
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dm-devel] [PATCH 12/13] Make generic_make_request handle arbitrarily large bios
  2012-05-21 17:17           ` [dm-devel] " Vivek Goyal
@ 2012-05-21 17:55             ` Kent Overstreet
  2012-05-21 18:32               ` Vivek Goyal
  0 siblings, 1 reply; 48+ messages in thread
From: Kent Overstreet @ 2012-05-21 17:55 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: NeilBrown, axboe, dm-devel, linux-kernel, tj, linux-bcache,
	linux-fsdevel, agk
On Mon, May 21, 2012 at 01:17:06PM -0400, Vivek Goyal wrote:
> May be I am missing something, hence I will ask.  Is punting to workqueue
> will really solve the issue raised by Neil. 
> 
> Due to spliting required, we will be holding some bios in the stack and
> these bios can't be submitted till further allocation from pool happens. So
> will it matter whether we are waiting for allocation in submitting process
> context or in worker thread context.
Punting to workqueue allows all the bio splits that have already been
allocated to be submitted.
> IOW, say you have a pool of 2 bios. We allocate 1 bio (say bio A),  and submit
> it for IO (now 1 bio left in pool). Now, bio A needs to be split up, so we
> allocate bio B and submit it (pool is empty now). Now we try to submit bio
> B and this also needs to be split. There are no more free bios so we will
> wait for some to get free but none of the bios (A and B) have actually
> been submitted for IO so nothing will get freed and we have a deadlock
> (This is assuming that memory is tight enough that we are not able to do
> any allocations from the slab backing the mempool).
You're talking about a different issue. You can't safely split a split
from the same bio pool - but this code doesn't do that since it adds a
separate bio pool for each request_queue that's only for splits.
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dm-devel] [PATCH 12/13] Make generic_make_request handle arbitrarily large bios
  2012-05-21 17:55             ` Kent Overstreet
@ 2012-05-21 18:32               ` Vivek Goyal
  0 siblings, 0 replies; 48+ messages in thread
From: Vivek Goyal @ 2012-05-21 18:32 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: NeilBrown, axboe, dm-devel, linux-kernel, tj, linux-bcache,
	linux-fsdevel, agk
On Mon, May 21, 2012 at 10:55:42AM -0700, Kent Overstreet wrote:
> On Mon, May 21, 2012 at 01:17:06PM -0400, Vivek Goyal wrote:
> > May be I am missing something, hence I will ask.  Is punting to workqueue
> > will really solve the issue raised by Neil. 
> > 
> > Due to spliting required, we will be holding some bios in the stack and
> > these bios can't be submitted till further allocation from pool happens. So
> > will it matter whether we are waiting for allocation in submitting process
> > context or in worker thread context.
> 
> Punting to workqueue allows all the bio splits that have already been
> allocated to be submitted.
>
> > IOW, say you have a pool of 2 bios. We allocate 1 bio (say bio A),  and submit
> > it for IO (now 1 bio left in pool). Now, bio A needs to be split up, so we
> > allocate bio B and submit it (pool is empty now). Now we try to submit bio
> > B and this also needs to be split. There are no more free bios so we will
> > wait for some to get free but none of the bios (A and B) have actually
> > been submitted for IO so nothing will get freed and we have a deadlock
> > (This is assuming that memory is tight enough that we are not able to do
> > any allocations from the slab backing the mempool).
> 
> You're talking about a different issue. You can't safely split a split
> from the same bio pool - but this code doesn't do that since it adds a
> separate bio pool for each request_queue that's only for splits.
Ok, I missed the fact that there are per queue pools for bio allocation.
Thanks for clarifying.
Vivek
^ permalink raw reply	[flat|nested] 48+ messages in thread
* Re: [dm-devel] [PATCH 02/13] dm: kill dm_rq_bio_destructor
       [not found]             ` <20120518185027.GA9673-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-05-22  4:29               ` Jun'ichi Nomura
  0 siblings, 0 replies; 48+ messages in thread
From: Jun'ichi Nomura @ 2012-05-22  4:29 UTC (permalink / raw)
  To: device-mapper development, Kent Overstreet
  Cc: Tejun Heo, axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	Mikulas Patocka
Hi,
On 05/19/12 03:50, Kent Overstreet wrote:
> On Fri, May 18, 2012 at 05:43:19PM +0100, Alasdair G Kergon wrote:
>> On Fri, May 18, 2012 at 08:57:29AM -0700, Tejun Heo wrote:
>>> Please explain why this is done and how it's safe.  Alasdair / dm
>>> folks, can you please ack this?
>>  
>> I think it's relying on there being never more than one reference on those 
>> bios so that that endio fn, called exactly once, always frees it and there
>> are no dm_puts elsewhere.
> 
> Is that a safe assumption? From my perusal of the code it certainly
> looks like it should be, but I don't know dm all that well.
Doing free_bio_info() in end_clone_bio() is safe.
But there is other problem.
This bio may be put by blk_rq_unprep_clone() and leak memory
without the destructor.
So could it be possible to keep bi_destructor available for this case?
Thanks,
-- 
Jun'ichi Nomura, NEC Corporation
^ permalink raw reply	[flat|nested] 48+ messages in thread
end of thread, other threads:[~2012-05-22  4:29 UTC | newest]
Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-18  2:59 [PATCH 00/13] Block cleanups (for bcache) koverstreet
2012-05-18  2:59 ` [PATCH 01/13] block: Generalized bio pool freeing koverstreet
     [not found]   ` <ea774fea2a27c9f1028a12ce31a7ee5e5517bef4.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-05-18 15:55     ` Tejun Heo
     [not found]       ` <20120518155538.GA19388-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-05-18 16:14         ` [dm-devel] " Alasdair G Kergon
2012-05-18  2:59 ` [PATCH 02/13] dm: kill dm_rq_bio_destructor koverstreet
2012-05-18 15:57   ` Tejun Heo
     [not found]     ` <20120518155729.GB19388-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-05-18 16:43       ` [dm-devel] " Alasdair G Kergon
     [not found]         ` <20120518164319.GJ29330-FDJ95KluN3Z0klwcnFlA1dvLeJWuRmrY@public.gmane.org>
2012-05-18 18:50           ` Kent Overstreet
     [not found]             ` <20120518185027.GA9673-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-05-22  4:29               ` Jun'ichi Nomura
2012-05-18  2:59 ` [PATCH 03/13] block: Add bio_clone_bioset() koverstreet
2012-05-18 16:05   ` Tejun Heo
2012-05-18 20:31     ` Kent Overstreet
     [not found]   ` <eb6a7d3fe7ae203202bc365d7274ee531631a9ca.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-05-18 16:11     ` [dm-devel] " Vivek Goyal
2012-05-18 18:55       ` Kent Overstreet
2012-05-18  2:59 ` [PATCH 04/13] block: Add bio_clone_kmalloc() koverstreet
     [not found]   ` <1c7c2d4b89bc3d0e907608cec37bcf0ee50f4c0e.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-05-18 16:09     ` Tejun Heo
     [not found]       ` <20120518160903.GD19388-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-05-18 20:39         ` Kent Overstreet
2012-05-18 16:45   ` Boaz Harrosh
2012-05-18  2:59 ` [PATCH 05/13] block: Only clone bio vecs that are in use koverstreet
2012-05-18 16:13   ` Tejun Heo
2012-05-18 21:14     ` Kent Overstreet
2012-05-18  2:59 ` [PATCH 06/13] block: Add bio_reset() koverstreet
2012-05-18 16:16   ` Tejun Heo
     [not found]     ` <20120518161608.GF19388-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-05-18 21:48       ` Kent Overstreet
2012-05-18  2:59 ` [PATCH 07/13] pktcdvd: Switch to bio_kmalloc() koverstreet
     [not found]   ` <04a4d6c2c8b6f0097e3594c6e0932093afffc1da.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-05-18 16:18     ` Tejun Heo
2012-05-18  2:59 ` [PATCH 08/13] block: Kill bi_destructor koverstreet
     [not found]   ` <de05855cf0fa4800cfab7e8340f106dccc7a75a1.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-05-18 16:21     ` Tejun Heo
     [not found]       ` <20120518162142.GH19388-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-05-18 22:21         ` Kent Overstreet
2012-05-18  2:59 ` [PATCH 09/13] block: Add an explicit bio flag for bios that own their bvec koverstreet
     [not found]   ` <363875943e9d0e13bee6ed28239280543e6e5055.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-05-18 16:30     ` Tejun Heo
2012-05-18 21:49       ` Kent Overstreet
2012-05-18 17:07     ` Boaz Harrosh
2012-05-18  2:59 ` [PATCH 10/13] block: Rework bio splitting koverstreet
     [not found]   ` <ac4c1cbd10934fdc3a4af74d4cb5ec370f9139d5.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-05-18 17:07     ` Tejun Heo
2012-05-18 17:46   ` Boaz Harrosh
     [not found] ` <cover.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-05-18  2:59   ` [PATCH 11/13] Closures koverstreet-hpIqsD4AKlfQT0dZR+AlfA
     [not found]     ` <a184989cfbf92297d4cca2f823e5ac24ec8fe1e3.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-05-18 17:08       ` Tejun Heo
2012-05-18  2:59 ` [PATCH 12/13] Make generic_make_request handle arbitrarily large bios koverstreet
2012-05-18  8:05   ` NeilBrown
     [not found]     ` <20120518180550.0a6cdc34-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2012-05-18  8:14       ` Kent Overstreet
     [not found]         ` <20120518081444.GA27205-RcKxWJ4Cfj3IzGYXcIpNmNLIRw13R84JkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
2012-05-21 17:17           ` [dm-devel] " Vivek Goyal
2012-05-21 17:55             ` Kent Overstreet
2012-05-21 18:32               ` Vivek Goyal
2012-05-18 17:52   ` Tejun Heo
2012-05-19  0:59     ` [dm-devel] " Alasdair G Kergon
     [not found]   ` <1114e7019b0055fc09a54b59b36398d5c54f5e32.1337308722.git.koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-05-18 22:48     ` Mikulas Patocka
2012-05-18  3:00 ` [PATCH 13/13] Gut bio_add_page() koverstreet
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).