Linux RAID subsystem development
 help / color / mirror / Atom feed
* [PATCH] md: fix incorrect use of lexx_to_cpu in does_sb_need_changing
From: Jason Yan @ 2017-03-10  3:49 UTC (permalink / raw)
  To: shli, linux-raid; +Cc: miaoxie, zhaohongjiang, Jason Yan

The sb->layout is of type __le32, so we shoud use le32_to_cpu.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
---
 drivers/md/md.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index e0d4d13..ea86a89 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2287,7 +2287,7 @@ static bool does_sb_need_changing(struct mddev *mddev)
 	/* Check if any mddev parameters have changed */
 	if ((mddev->dev_sectors != le64_to_cpu(sb->size)) ||
 	    (mddev->reshape_position != le64_to_cpu(sb->reshape_position)) ||
-	    (mddev->layout != le64_to_cpu(sb->layout)) ||
+	    (mddev->layout != le32_to_cpu(sb->layout)) ||
 	    (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) ||
 	    (mddev->chunk_sectors != le32_to_cpu(sb->chunksize)))
 		return true;
-- 
2.5.0


^ permalink raw reply related

* Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()
From: NeilBrown @ 2017-03-10  4:32 UTC (permalink / raw)
  To: Jens Axboe, Jack Wang
  Cc: LKML, Lars Ellenberg, Kent Overstreet, Pavel Machek, Mike Snitzer,
	Mikulas Patocka, linux-raid, device-mapper development,
	linux-block
In-Reply-To: <87r328j00i.fsf@notabene.neil.brown.name>

[-- Attachment #1: Type: text/plain, Size: 2343 bytes --]


I started looking further at the improvements we can make once
generic_make_request is fixed, and realised that I had missed an
important detail in this patch.
Several places test current->bio_list, and two actually edit the list.
With this change, that cannot see the whole lists, so it could cause a
regression.

So I've revised the patch to make sure that the entire list of queued
bios remains visible, and change the relevant code to look at both
the new list and the old list.

Following that there are some patches which make the rescuer thread
optional, and then starts removing it from some easy-to-fix places.

The series summary is below.

NeilBrown


NeilBrown (5):
      blk: improve order of bio handling in generic_make_request()
      blk: remove bio_set arg from blk_queue_split()
      blk: make the bioset rescue_workqueue optional.
      blk: use non-rescuing bioset for q->bio_split.
      block_dev: make blkdev_dio_pool a non-rescuing bioset


 block/bio.c                         |   39 +++++++++++++++++++++++++++------
 block/blk-core.c                    |   42 ++++++++++++++++++++++++++++-------
 block/blk-merge.c                   |    7 +++---
 block/blk-mq.c                      |    4 ++-
 drivers/block/drbd/drbd_main.c      |    2 +-
 drivers/block/drbd/drbd_req.c       |    2 +-
 drivers/block/pktcdvd.c             |    2 +-
 drivers/block/ps3vram.c             |    2 +-
 drivers/block/rsxx/dev.c            |    2 +-
 drivers/block/umem.c                |    2 +-
 drivers/block/zram/zram_drv.c       |    2 +-
 drivers/lightnvm/rrpc.c             |    2 +-
 drivers/md/bcache/super.c           |    4 ++-
 drivers/md/dm-crypt.c               |    2 +-
 drivers/md/dm-io.c                  |    2 +-
 drivers/md/dm.c                     |   32 +++++++++++++++------------
 drivers/md/md.c                     |    4 ++-
 drivers/md/raid10.c                 |    3 ++-
 drivers/md/raid5-cache.c            |    2 +-
 drivers/s390/block/dcssblk.c        |    2 +-
 drivers/s390/block/xpram.c          |    2 +-
 drivers/target/target_core_iblock.c |    2 +-
 fs/btrfs/extent_io.c                |    4 ++-
 fs/xfs/xfs_super.c                  |    2 +-
 include/linux/bio.h                 |    2 ++
 include/linux/blkdev.h              |    3 +--
 26 files changed, 114 insertions(+), 60 deletions(-)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* [PATCH 1/5 v3] blk: improve order of bio handling in generic_make_request()
From: NeilBrown @ 2017-03-10  4:33 UTC (permalink / raw)
  To: Jens Axboe, Jack Wang
  Cc: LKML, Lars Ellenberg, Kent Overstreet, Pavel Machek, Mike Snitzer,
	Mikulas Patocka, linux-raid, device-mapper development,
	linux-block
In-Reply-To: <87d1dphhuy.fsf@notabene.neil.brown.name>

[-- Attachment #1: Type: text/plain, Size: 9493 bytes --]


To avoid recursion on the kernel stack when stacked block devices
are in use, generic_make_request() will, when called recursively,
queue new requests for later handling.  They will be handled when the
make_request_fn for the current bio completes.

If any bios are submitted by a make_request_fn, these will ultimately
be handled seqeuntially.  If the handling of one of those generates
further requests, they will be added to the end of the queue.

This strict first-in-first-out behaviour can lead to deadlocks in
various ways, normally because a request might need to wait for a
previous request to the same device to complete.  This can happen when
they share a mempool, and can happen due to interdependencies
particular to the device.  Both md and dm have examples where this happens.

These deadlocks can be erradicated by more selective ordering of bios.
Specifically by handling them in depth-first order.  That is: when the
handling of one bio generates one or more further bios, they are
handled immediately after the parent, before any siblings of the
parent.  That way, when generic_make_request() calls make_request_fn
for some particular device, we can be certain that all previously
submited requests for that device have been completely handled and are
not waiting for anything in the queue of requests maintained in
generic_make_request().

An easy way to achieve this would be to use a last-in-first-out stack
instead of a queue.  However this will change the order of consecutive
bios submitted by a make_request_fn, which could have unexpected consequences.
Instead we take a slightly more complex approach.
A fresh queue is created for each call to a make_request_fn.  After it completes,
any bios for a different device are placed on the front of the main queue, followed
by any bios for the same device, followed by all bios that were already on
the queue before the make_request_fn was called.
This provides the depth-first approach without reordering bios on the same level.

This, by itself, it not enough to remove all deadlocks.  It just makes
it possible for drivers to take the extra step required themselves.

To avoid deadlocks, drivers must never risk waiting for a request
after submitting one to generic_make_request.  This includes never
allocing from a mempool twice in the one call to a make_request_fn.

A common pattern in drivers is to call bio_split() in a loop, handling
the first part and then looping around to possibly split the next part.
Instead, a driver that finds it needs to split a bio should queue
(with generic_make_request) the second part, handle the first part,
and then return.  The new code in generic_make_request will ensure the
requests to underlying bios are processed first, then the second bio
that was split off.  If it splits again, the same process happens.  In
each case one bio will be completely handled before the next one is attempted.

With this is place, it should be possible to disable the
punt_bios_to_recover() recovery thread for many block devices, and
eventually it may be possible to remove it completely.

Note that as some drivers look inside the bio_list, sometimes to punt
some bios to rescuer threads, we need to make both the pendind list and the
new list visible.  So current->bio_list is now an array of 2 lists,
and relevant code examines both of them.

Ref: http://www.spinics.net/lists/raid/msg54680.html
Tested-by: Jinpu Wang <jinpu.wang@profitbricks.com>
Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/bio.c         |   11 ++++++++---
 block/blk-core.c    |   40 ++++++++++++++++++++++++++++++++--------
 drivers/md/dm.c     |   29 ++++++++++++++++-------------
 drivers/md/raid10.c |    3 ++-
 4 files changed, 58 insertions(+), 25 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 5eec5e08417f..84ae39f06f81 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -376,10 +376,13 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
 	bio_list_init(&punt);
 	bio_list_init(&nopunt);
 
-	while ((bio = bio_list_pop(current->bio_list)))
+	while ((bio = bio_list_pop(&current->bio_list[0])))
 		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
+	current->bio_list[0] = nopunt;
 
-	*current->bio_list = nopunt;
+	while ((bio = bio_list_pop(&current->bio_list[1])))
+		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
+	current->bio_list[1] = nopunt;
 
 	spin_lock(&bs->rescue_lock);
 	bio_list_merge(&bs->rescue_list, &punt);
@@ -466,7 +469,9 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 		 * we retry with the original gfp_flags.
 		 */
 
-		if (current->bio_list && !bio_list_empty(current->bio_list))
+		if (current->bio_list &&
+		    (!bio_list_empty(&current->bio_list[0]) ||
+		     !bio_list_empty(&current->bio_list[1])))
 			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
 
 		p = mempool_alloc(bs->bio_pool, gfp_mask);
diff --git a/block/blk-core.c b/block/blk-core.c
index 1086dac8724c..bd2cb4ba674e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
  */
 blk_qc_t generic_make_request(struct bio *bio)
 {
-	struct bio_list bio_list_on_stack;
+	/*
+	 * bio_list_on_stack[0] contains bios submitted by the current
+	 * make_request_fn.
+	 * bio_list_on_stack[1] contains bios that were submitted before
+	 * the current make_request_fn, but that haven't been processed
+	 * yet.
+	 */
+	struct bio_list bio_list_on_stack[2];
 	blk_qc_t ret = BLK_QC_T_NONE;
 
 	if (!generic_make_request_checks(bio))
@@ -1992,7 +1999,7 @@ blk_qc_t generic_make_request(struct bio *bio)
 	 * should be added at the tail
 	 */
 	if (current->bio_list) {
-		bio_list_add(current->bio_list, bio);
+		bio_list_add(&current->bio_list[0], bio);
 		goto out;
 	}
 
@@ -2011,23 +2018,40 @@ blk_qc_t generic_make_request(struct bio *bio)
 	 * bio_list, and call into ->make_request() again.
 	 */
 	BUG_ON(bio->bi_next);
-	bio_list_init(&bio_list_on_stack);
-	current->bio_list = &bio_list_on_stack;
+	bio_list_init(&bio_list_on_stack[0]);
+	bio_list_init(&bio_list_on_stack[1]);
+	current->bio_list = bio_list_on_stack;
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
 		if (likely(blk_queue_enter(q, false) == 0)) {
+			struct bio_list lower, same;
+
+			/* Create a fresh bio_list for all subordinate requests */
+			bio_list_on_stack[1] = bio_list_on_stack[0];
+			bio_list_init(&bio_list_on_stack[0]);
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
 
-			bio = bio_list_pop(current->bio_list);
+			/* sort new bios into those for a lower level
+			 * and those for the same level
+			 */
+			bio_list_init(&lower);
+			bio_list_init(&same);
+			while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL)
+				if (q == bdev_get_queue(bio->bi_bdev))
+					bio_list_add(&same, bio);
+				else
+					bio_list_add(&lower, bio);
+			/* now assemble so we handle the lowest level first */
+			bio_list_merge(&bio_list_on_stack[0], &lower);
+			bio_list_merge(&bio_list_on_stack[0], &same);
+			bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
 		} else {
-			struct bio *bio_next = bio_list_pop(current->bio_list);
-
 			bio_io_error(bio);
-			bio = bio_next;
 		}
+		bio = bio_list_pop(&bio_list_on_stack[0]);
 	} while (bio);
 	current->bio_list = NULL; /* deactivate */
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index f4ffd1eb8f44..dfb75979e455 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -989,26 +989,29 @@ static void flush_current_bio_list(struct blk_plug_cb *cb, bool from_schedule)
 	struct dm_offload *o = container_of(cb, struct dm_offload, cb);
 	struct bio_list list;
 	struct bio *bio;
+	int i;
 
 	INIT_LIST_HEAD(&o->cb.list);
 
 	if (unlikely(!current->bio_list))
 		return;
 
-	list = *current->bio_list;
-	bio_list_init(current->bio_list);
-
-	while ((bio = bio_list_pop(&list))) {
-		struct bio_set *bs = bio->bi_pool;
-		if (unlikely(!bs) || bs == fs_bio_set) {
-			bio_list_add(current->bio_list, bio);
-			continue;
+	for (i = 0; i < 2; i++) {
+		list = current->bio_list[i];
+		bio_list_init(&current->bio_list[i]);
+
+		while ((bio = bio_list_pop(&list))) {
+			struct bio_set *bs = bio->bi_pool;
+			if (unlikely(!bs) || bs == fs_bio_set) {
+				bio_list_add(&current->bio_list[i], bio);
+				continue;
+			}
+
+			spin_lock(&bs->rescue_lock);
+			bio_list_add(&bs->rescue_list, bio);
+			queue_work(bs->rescue_workqueue, &bs->rescue_work);
+			spin_unlock(&bs->rescue_lock);
 		}
-
-		spin_lock(&bs->rescue_lock);
-		bio_list_add(&bs->rescue_list, bio);
-		queue_work(bs->rescue_workqueue, &bs->rescue_work);
-		spin_unlock(&bs->rescue_lock);
 	}
 }
 
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 063c43d83b72..0536658c9d40 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -974,7 +974,8 @@ static void wait_barrier(struct r10conf *conf)
 				    !conf->barrier ||
 				    (atomic_read(&conf->nr_pending) &&
 				     current->bio_list &&
-				     !bio_list_empty(current->bio_list)),
+				     (!bio_list_empty(&current->bio_list[0]) ||
+				      !bio_list_empty(&current->bio_list[1]))),
 				    conf->resync_lock);
 		conf->nr_waiting--;
 		if (!conf->nr_waiting)



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* [PATCH 2/5] blk: remove bio_set arg from blk_queue_split()
From: NeilBrown @ 2017-03-10  4:34 UTC (permalink / raw)
  To: Jens Axboe, Jack Wang
  Cc: LKML, Lars Ellenberg, Kent Overstreet, Pavel Machek, Mike Snitzer,
	Mikulas Patocka, linux-raid, device-mapper development,
	linux-block
In-Reply-To: <87d1dphhuy.fsf@notabene.neil.brown.name>

[-- Attachment #1: Type: text/plain, Size: 8745 bytes --]


blk_queue_split() is always called with the last arg being q->bio_split,
where 'q' is the first arg.

Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses
q->bio_split.

This is inconsistent and unnecessary.  Remove the last arg and always use
q->bio_split inside blk_queue_split()

Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/blk-core.c              |    2 +-
 block/blk-merge.c             |    7 +++----
 block/blk-mq.c                |    4 ++--
 drivers/block/drbd/drbd_req.c |    2 +-
 drivers/block/pktcdvd.c       |    2 +-
 drivers/block/ps3vram.c       |    2 +-
 drivers/block/rsxx/dev.c      |    2 +-
 drivers/block/umem.c          |    2 +-
 drivers/block/zram/zram_drv.c |    2 +-
 drivers/lightnvm/rrpc.c       |    2 +-
 drivers/md/md.c               |    2 +-
 drivers/s390/block/dcssblk.c  |    2 +-
 drivers/s390/block/xpram.c    |    2 +-
 include/linux/blkdev.h        |    3 +--
 14 files changed, 17 insertions(+), 19 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index bd2cb4ba674e..375006c94c15 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1637,7 +1637,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
 	 */
 	blk_queue_bounce(q, &bio);
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
 		bio->bi_error = -EIO;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 2afa262425d1..ce8838aff7f7 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -188,8 +188,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 	return do_split ? new : NULL;
 }
 
-void blk_queue_split(struct request_queue *q, struct bio **bio,
-		     struct bio_set *bs)
+void blk_queue_split(struct request_queue *q, struct bio **bio)
 {
 	struct bio *split, *res;
 	unsigned nsegs;
@@ -197,14 +196,14 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
 	switch (bio_op(*bio)) {
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
-		split = blk_bio_discard_split(q, *bio, bs, &nsegs);
+		split = blk_bio_discard_split(q, *bio, q->bio_split, &nsegs);
 		break;
 	case REQ_OP_WRITE_ZEROES:
 		split = NULL;
 		nsegs = (*bio)->bi_phys_segments;
 		break;
 	case REQ_OP_WRITE_SAME:
-		split = blk_bio_write_same_split(q, *bio, bs, &nsegs);
+		split = blk_bio_write_same_split(q, *bio, q->bio_split, &nsegs);
 		break;
 	default:
 		split = blk_bio_segment_split(q, *bio, q->bio_split, &nsegs);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b2fd175e84d7..ef7b289e873c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1502,7 +1502,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		return BLK_QC_T_NONE;
 	}
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	if (!is_flush_fua && !blk_queue_nomerges(q) &&
 	    blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq))
@@ -1624,7 +1624,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 		return BLK_QC_T_NONE;
 	}
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	if (!is_flush_fua && !blk_queue_nomerges(q)) {
 		if (blk_attempt_plug_merge(q, bio, &request_count, NULL))
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 652114ae1a8a..f6ed6f7f5ab2 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1554,7 +1554,7 @@ blk_qc_t drbd_make_request(struct request_queue *q, struct bio *bio)
 	struct drbd_device *device = (struct drbd_device *) q->queuedata;
 	unsigned long start_jif;
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	start_jif = jiffies;
 
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 66d846ba85a9..98394d034c29 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2414,7 +2414,7 @@ static blk_qc_t pkt_make_request(struct request_queue *q, struct bio *bio)
 
 	blk_queue_bounce(q, &bio);
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	pd = q->queuedata;
 	if (!pd) {
diff --git a/drivers/block/ps3vram.c b/drivers/block/ps3vram.c
index 456b4fe21559..48072c0c1010 100644
--- a/drivers/block/ps3vram.c
+++ b/drivers/block/ps3vram.c
@@ -606,7 +606,7 @@ static blk_qc_t ps3vram_make_request(struct request_queue *q, struct bio *bio)
 
 	dev_dbg(&dev->core, "%s\n", __func__);
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	spin_lock_irq(&priv->lock);
 	busy = !bio_list_empty(&priv->list);
diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
index f81d70b39d10..8c8024f616ae 100644
--- a/drivers/block/rsxx/dev.c
+++ b/drivers/block/rsxx/dev.c
@@ -151,7 +151,7 @@ static blk_qc_t rsxx_make_request(struct request_queue *q, struct bio *bio)
 	struct rsxx_bio_meta *bio_meta;
 	int st = -EINVAL;
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	might_sleep();
 
diff --git a/drivers/block/umem.c b/drivers/block/umem.c
index c141cc3be22b..c8d8a2f16f8e 100644
--- a/drivers/block/umem.c
+++ b/drivers/block/umem.c
@@ -529,7 +529,7 @@ static blk_qc_t mm_make_request(struct request_queue *q, struct bio *bio)
 		 (unsigned long long)bio->bi_iter.bi_sector,
 		 bio->bi_iter.bi_size);
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	spin_lock_irq(&card->lock);
 	*card->biotail = bio;
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index e27d89a36c34..973dac2a9a99 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -880,7 +880,7 @@ static blk_qc_t zram_make_request(struct request_queue *queue, struct bio *bio)
 {
 	struct zram *zram = queue->queuedata;
 
-	blk_queue_split(queue, &bio, queue->bio_split);
+	blk_queue_split(queue, &bio);
 
 	if (!valid_io_request(zram, bio->bi_iter.bi_sector,
 					bio->bi_iter.bi_size)) {
diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
index e00b1d7b976f..701fa2fafbb2 100644
--- a/drivers/lightnvm/rrpc.c
+++ b/drivers/lightnvm/rrpc.c
@@ -999,7 +999,7 @@ static blk_qc_t rrpc_make_rq(struct request_queue *q, struct bio *bio)
 	struct nvm_rq *rqd;
 	int err;
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	if (bio_op(bio) == REQ_OP_DISCARD) {
 		rrpc_discard(rrpc, bio);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 548d1b8014f8..d7d2bb51a58d 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -253,7 +253,7 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
 	unsigned int sectors;
 	int cpu;
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	if (mddev == NULL || mddev->pers == NULL) {
 		bio_io_error(bio);
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 415d10a67b7a..10ece6f3c7eb 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -829,7 +829,7 @@ dcssblk_make_request(struct request_queue *q, struct bio *bio)
 	unsigned long source_addr;
 	unsigned long bytes_done;
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	bytes_done = 0;
 	dev_info = bio->bi_bdev->bd_disk->private_data;
diff --git a/drivers/s390/block/xpram.c b/drivers/s390/block/xpram.c
index b9d7e755c8a3..a48f0d40c1d2 100644
--- a/drivers/s390/block/xpram.c
+++ b/drivers/s390/block/xpram.c
@@ -190,7 +190,7 @@ static blk_qc_t xpram_make_request(struct request_queue *q, struct bio *bio)
 	unsigned long page_addr;
 	unsigned long bytes;
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	if ((bio->bi_iter.bi_sector & 7) != 0 ||
 	    (bio->bi_iter.bi_size & 4095) != 0)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 796016e63c1d..63008a246403 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -934,8 +934,7 @@ extern int blk_insert_cloned_request(struct request_queue *q,
 				     struct request *rq);
 extern int blk_rq_append_bio(struct request *rq, struct bio *bio);
 extern void blk_delay_queue(struct request_queue *, unsigned long);
-extern void blk_queue_split(struct request_queue *, struct bio **,
-			    struct bio_set *);
+extern void blk_queue_split(struct request_queue *, struct bio **);
 extern void blk_recount_segments(struct request_queue *, struct bio *);
 extern int scsi_verify_blk_ioctl(struct block_device *, unsigned int);
 extern int scsi_cmd_blk_ioctl(struct block_device *, fmode_t,



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* [PATCH 3/5] blk: make the bioset rescue_workqueue optional.
From: NeilBrown @ 2017-03-10  4:35 UTC (permalink / raw)
  To: Jens Axboe, Jack Wang
  Cc: LKML, Lars Ellenberg, Kent Overstreet, Pavel Machek, Mike Snitzer,
	Mikulas Patocka, linux-raid, device-mapper development,
	linux-block
In-Reply-To: <87d1dphhuy.fsf@notabene.neil.brown.name>

[-- Attachment #1: Type: text/plain, Size: 11485 bytes --]


This patch converts bioset_create() and bioset_create_nobvec()
to not create a workqueue so alloctions will never trigger
punt_bios_to_rescuer().
It also introduces bioset_create_rescued() and bioset_create_nobvec_rescued()
which preserve the old behaviour.

*All* callers of bioset_create() and bioset_create_nobvec() are
converted to the _rescued() version, so that not change in behaviour
is experienced.

It is hoped that most, if not all, bioset can end up being the
non-rescued version.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/bio.c                         |   30 +++++++++++++++++++++++++-----
 block/blk-core.c                    |    2 +-
 drivers/block/drbd/drbd_main.c      |    2 +-
 drivers/md/bcache/super.c           |    4 ++--
 drivers/md/dm-crypt.c               |    2 +-
 drivers/md/dm-io.c                  |    2 +-
 drivers/md/dm.c                     |    5 +++--
 drivers/md/md.c                     |    2 +-
 drivers/md/raid5-cache.c            |    2 +-
 drivers/target/target_core_iblock.c |    2 +-
 fs/block_dev.c                      |    2 +-
 fs/btrfs/extent_io.c                |    4 ++--
 fs/xfs/xfs_super.c                  |    2 +-
 include/linux/bio.h                 |    2 ++
 14 files changed, 43 insertions(+), 20 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 84ae39f06f81..06587f1119f5 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -362,6 +362,8 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
 	struct bio_list punt, nopunt;
 	struct bio *bio;
 
+	if (!WARN_ON_ONCE(!bs->rescue_workqueue))
+		return;
 	/*
 	 * In order to guarantee forward progress we must punt only bios that
 	 * were allocated from this bio_set; otherwise, if there was a bio on
@@ -471,7 +473,8 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 
 		if (current->bio_list &&
 		    (!bio_list_empty(&current->bio_list[0]) ||
-		     !bio_list_empty(&current->bio_list[1])))
+		     !bio_list_empty(&current->bio_list[1])) &&
+		    bs->rescue_workqueue)
 			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
 
 		p = mempool_alloc(bs->bio_pool, gfp_mask);
@@ -1940,7 +1943,8 @@ EXPORT_SYMBOL(bioset_free);
 
 static struct bio_set *__bioset_create(unsigned int pool_size,
 				       unsigned int front_pad,
-				       bool create_bvec_pool)
+				       bool create_bvec_pool,
+				       bool create_rescue_workqueue)
 {
 	unsigned int back_pad = BIO_INLINE_VECS * sizeof(struct bio_vec);
 	struct bio_set *bs;
@@ -1971,6 +1975,9 @@ static struct bio_set *__bioset_create(unsigned int pool_size,
 			goto bad;
 	}
 
+	if (!create_rescue_workqueue)
+		return bs;
+
 	bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
 	if (!bs->rescue_workqueue)
 		goto bad;
@@ -1996,10 +2003,16 @@ static struct bio_set *__bioset_create(unsigned int pool_size,
  */
 struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
 {
-	return __bioset_create(pool_size, front_pad, true);
+	return __bioset_create(pool_size, front_pad, true, false);
 }
 EXPORT_SYMBOL(bioset_create);
 
+struct bio_set *bioset_create_rescued(unsigned int pool_size, unsigned int front_pad)
+{
+	return __bioset_create(pool_size, front_pad, true, true);
+}
+EXPORT_SYMBOL(bioset_create_rescued);
+
 /**
  * bioset_create_nobvec  - Create a bio_set without bio_vec mempool
  * @pool_size:	Number of bio to cache in the mempool
@@ -2011,10 +2024,17 @@ EXPORT_SYMBOL(bioset_create);
  */
 struct bio_set *bioset_create_nobvec(unsigned int pool_size, unsigned int front_pad)
 {
-	return __bioset_create(pool_size, front_pad, false);
+	return __bioset_create(pool_size, front_pad, false, false);
 }
 EXPORT_SYMBOL(bioset_create_nobvec);
 
+struct bio_set *bioset_create_nobvec_rescued(unsigned int pool_size,
+					     unsigned int front_pad)
+{
+	return __bioset_create(pool_size, front_pad, false, true);
+}
+EXPORT_SYMBOL(bioset_create_nobvec_rescued);
+
 #ifdef CONFIG_BLK_CGROUP
 
 /**
@@ -2129,7 +2149,7 @@ static int __init init_bio(void)
 	bio_integrity_init();
 	biovec_init_slabs();
 
-	fs_bio_set = bioset_create(BIO_POOL_SIZE, 0);
+	fs_bio_set = bioset_create_rescued(BIO_POOL_SIZE, 0);
 	if (!fs_bio_set)
 		panic("bio: can't allocate bios\n");
 
diff --git a/block/blk-core.c b/block/blk-core.c
index 375006c94c15..c3992d17dc2c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -714,7 +714,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	if (q->id < 0)
 		goto fail_q;
 
-	q->bio_split = bioset_create(BIO_POOL_SIZE, 0);
+	q->bio_split = bioset_create_rescued(BIO_POOL_SIZE, 0);
 	if (!q->bio_split)
 		goto fail_id;
 
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 92c60cbd04ee..2c69c2ab0fff 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2166,7 +2166,7 @@ static int drbd_create_mempools(void)
 		goto Enomem;
 
 	/* mempools */
-	drbd_md_io_bio_set = bioset_create(DRBD_MIN_POOL_PAGES, 0);
+	drbd_md_io_bio_set = bioset_create_rescued(DRBD_MIN_POOL_PAGES, 0);
 	if (drbd_md_io_bio_set == NULL)
 		goto Enomem;
 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 85e3f21c2514..6cb30792f0ed 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -786,7 +786,7 @@ static int bcache_device_init(struct bcache_device *d, unsigned block_size,
 
 	minor *= BCACHE_MINORS;
 
-	if (!(d->bio_split = bioset_create(4, offsetof(struct bbio, bio))) ||
+	if (!(d->bio_split = bioset_create_rescued(4, offsetof(struct bbio, bio))) ||
 	    !(d->disk = alloc_disk(BCACHE_MINORS))) {
 		ida_simple_remove(&bcache_minor, minor);
 		return -ENOMEM;
@@ -1520,7 +1520,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
 				sizeof(struct bbio) + sizeof(struct bio_vec) *
 				bucket_pages(c))) ||
 	    !(c->fill_iter = mempool_create_kmalloc_pool(1, iter_size)) ||
-	    !(c->bio_split = bioset_create(4, offsetof(struct bbio, bio))) ||
+	    !(c->bio_split = bioset_create_rescued(4, offsetof(struct bbio, bio))) ||
 	    !(c->uuids = alloc_bucket_pages(GFP_KERNEL, c)) ||
 	    !(c->moving_gc_wq = alloc_workqueue("bcache_gc",
 						WQ_MEM_RECLAIM, 0)) ||
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 389a3637ffcc..91a2d637d44f 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1936,7 +1936,7 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 		goto bad;
 	}
 
-	cc->bs = bioset_create(MIN_IOS, 0);
+	cc->bs = bioset_create_rescued(MIN_IOS, 0);
 	if (!cc->bs) {
 		ti->error = "Cannot allocate crypt bioset";
 		goto bad;
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 03940bf36f6c..fe1241c196b1 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -58,7 +58,7 @@ struct dm_io_client *dm_io_client_create(void)
 	if (!client->pool)
 		goto bad;
 
-	client->bios = bioset_create(min_ios, 0);
+	client->bios = bioset_create_rescued(min_ios, 0);
 	if (!client->bios)
 		goto bad;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index dfb75979e455..41b1f033841f 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1002,7 +1002,8 @@ static void flush_current_bio_list(struct blk_plug_cb *cb, bool from_schedule)
 
 		while ((bio = bio_list_pop(&list))) {
 			struct bio_set *bs = bio->bi_pool;
-			if (unlikely(!bs) || bs == fs_bio_set) {
+			if (unlikely(!bs) || bs == fs_bio_set ||
+			    !bs->rescue_workqueue) {
 				bio_list_add(&current->bio_list[i], bio);
 				continue;
 			}
@@ -2577,7 +2578,7 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, unsigned t
 		BUG();
 	}
 
-	pools->bs = bioset_create_nobvec(pool_size, front_pad);
+	pools->bs = bioset_create_nobvec_rescued(pool_size, front_pad);
 	if (!pools->bs)
 		goto out;
 
diff --git a/drivers/md/md.c b/drivers/md/md.c
index d7d2bb51a58d..e5f08a195837 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5220,7 +5220,7 @@ int md_run(struct mddev *mddev)
 	}
 
 	if (mddev->bio_set == NULL) {
-		mddev->bio_set = bioset_create(BIO_POOL_SIZE, 0);
+		mddev->bio_set = bioset_create_rescued(BIO_POOL_SIZE, 0);
 		if (!mddev->bio_set)
 			return -ENOMEM;
 	}
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 3f307be01b10..c95c6c046395 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2831,7 +2831,7 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	if (!log->io_pool)
 		goto io_pool;
 
-	log->bs = bioset_create(R5L_POOL_SIZE, 0);
+	log->bs = bioset_create_rescued(R5L_POOL_SIZE, 0);
 	if (!log->bs)
 		goto io_bs;
 
diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
index d316ed537d59..5bf3392195c6 100644
--- a/drivers/target/target_core_iblock.c
+++ b/drivers/target/target_core_iblock.c
@@ -93,7 +93,7 @@ static int iblock_configure_device(struct se_device *dev)
 		return -EINVAL;
 	}
 
-	ib_dev->ibd_bio_set = bioset_create(IBLOCK_BIO_POOL_SIZE, 0);
+	ib_dev->ibd_bio_set = bioset_create_rescued(IBLOCK_BIO_POOL_SIZE, 0);
 	if (!ib_dev->ibd_bio_set) {
 		pr_err("IBLOCK: Unable to create bioset\n");
 		goto out;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2eca00ec4370..c0ca5f0d0369 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -436,7 +436,7 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 
 static __init int blkdev_init(void)
 {
-	blkdev_dio_pool = bioset_create(4, offsetof(struct blkdev_dio, bio));
+	blkdev_dio_pool = bioset_create_rescued(4, offsetof(struct blkdev_dio, bio));
 	if (!blkdev_dio_pool)
 		return -ENOMEM;
 	return 0;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 28e81922a21c..34aa8893790a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -173,8 +173,8 @@ int __init extent_io_init(void)
 	if (!extent_buffer_cache)
 		goto free_state_cache;
 
-	btrfs_bioset = bioset_create(BIO_POOL_SIZE,
-				     offsetof(struct btrfs_io_bio, bio));
+	btrfs_bioset = bioset_create_rescued(BIO_POOL_SIZE,
+					     offsetof(struct btrfs_io_bio, bio));
 	if (!btrfs_bioset)
 		goto free_buffer_cache;
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 890862f2447c..f4c4d6f41d91 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1756,7 +1756,7 @@ MODULE_ALIAS_FS("xfs");
 STATIC int __init
 xfs_init_zones(void)
 {
-	xfs_ioend_bioset = bioset_create(4 * MAX_BUF_PER_PAGE,
+	xfs_ioend_bioset = bioset_create_rescued(4 * MAX_BUF_PER_PAGE,
 			offsetof(struct xfs_ioend, io_inline_bio));
 	if (!xfs_ioend_bioset)
 		goto out;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e521194f6fc..05730603fcf1 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -379,7 +379,9 @@ static inline struct bio *bio_next_split(struct bio *bio, int sectors,
 }
 
 extern struct bio_set *bioset_create(unsigned int, unsigned int);
+extern struct bio_set *bioset_create_rescued(unsigned int, unsigned int);
 extern struct bio_set *bioset_create_nobvec(unsigned int, unsigned int);
+extern struct bio_set *bioset_create_nobvec_rescued(unsigned int, unsigned int);
 extern void bioset_free(struct bio_set *);
 extern mempool_t *biovec_create_pool(int pool_entries);
 



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* [PATCH 4/5] blk: use non-rescuing bioset for q->bio_split.
From: NeilBrown @ 2017-03-10  4:36 UTC (permalink / raw)
  To: Jens Axboe, Jack Wang
  Cc: LKML, Lars Ellenberg, Kent Overstreet, Pavel Machek, Mike Snitzer,
	Mikulas Patocka, linux-raid, device-mapper development,
	linux-block
In-Reply-To: <87d1dphhuy.fsf@notabene.neil.brown.name>

[-- Attachment #1: Type: text/plain, Size: 1967 bytes --]


A rescuing bioset is only useful if there might be bios from
that same bioset on the bio_list_on_stack queue at a time
when bio_alloc_bioset() is called.  This never applies to
q->bio_split.

Allocations from q->bio_split are only ever made from
blk_queue_split() which is only ever called early in each of
various make_request_fn()s.  The original bio (call this A)
is then passed to generic_make_request() and is placed on
the bio_list_on_stack queue, and the bio that was allocated
from q->bio_split (B) is processed.

The processing of this may cause other bios to be passed to
generic_make_request() or may even cause the bio B itself to
be passed, possible after some prefix has been split off
(using some other bioset).

generic_make_request() now guarantees that all of these bios
(B and dependants) will be fully processed before the tail
of the original bio A gets handled.  None of these early bios
can possible trigger an allocation from the original
q->bio_split as they are either too small to require
splitting or (more likely) are destined for a different queue.

The next time that the original q->bio_split might be used
by this thread is when A is processed again, as it might
still be too big to handle directly.  By this time there
cannot be and other bios allocated fro q->bio_split in the
generic_make_request() queue.  So no rescuing will ever be
needed.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/blk-core.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index c3992d17dc2c..375006c94c15 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -714,7 +714,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	if (q->id < 0)
 		goto fail_q;
 
-	q->bio_split = bioset_create_rescued(BIO_POOL_SIZE, 0);
+	q->bio_split = bioset_create(BIO_POOL_SIZE, 0);
 	if (!q->bio_split)
 		goto fail_id;
 



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* [PATCH 5/5] block_dev: make blkdev_dio_pool a non-rescuing bioset
From: NeilBrown @ 2017-03-10  4:37 UTC (permalink / raw)
  To: Jens Axboe, Jack Wang
  Cc: LKML, Lars Ellenberg, Kent Overstreet, Pavel Machek, Mike Snitzer,
	Mikulas Patocka, linux-raid, device-mapper development,
	linux-block
In-Reply-To: <87d1dphhuy.fsf@notabene.neil.brown.name>

[-- Attachment #1: Type: text/plain, Size: 727 bytes --]


Allocations from blkdev_dio_pool are never made under
generic_make_request, so this bioset does not need a rescuer thread.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/block_dev.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index c0ca5f0d0369..2eca00ec4370 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -436,7 +436,7 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 
 static __init int blkdev_init(void)
 {
-	blkdev_dio_pool = bioset_create_rescued(4, offsetof(struct blkdev_dio, bio));
+	blkdev_dio_pool = bioset_create(4, offsetof(struct blkdev_dio, bio));
 	if (!blkdev_dio_pool)
 		return -ENOMEM;
 	return 0;



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()
From: Jens Axboe @ 2017-03-10  4:38 UTC (permalink / raw)
  To: NeilBrown, Jack Wang
  Cc: LKML, Lars Ellenberg, Kent Overstreet, Pavel Machek, Mike Snitzer,
	Mikulas Patocka, linux-raid, device-mapper development,
	linux-block
In-Reply-To: <87d1dphhuy.fsf@notabene.neil.brown.name>

On 03/09/2017 09:32 PM, NeilBrown wrote:
> 
> I started looking further at the improvements we can make once
> generic_make_request is fixed, and realised that I had missed an
> important detail in this patch.
> Several places test current->bio_list, and two actually edit the list.
> With this change, that cannot see the whole lists, so it could cause a
> regression.
> 
> So I've revised the patch to make sure that the entire list of queued
> bios remains visible, and change the relevant code to look at both
> the new list and the old list.
> 
> Following that there are some patches which make the rescuer thread
> optional, and then starts removing it from some easy-to-fix places.

Neil, note that the v2 patch is already in Linus tree as of earlier
today. You need to rebase the series, and if we need fixups on
top of v2, then that should be done separately and with increased
urgency.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()
From: Jens Axboe @ 2017-03-10  4:40 UTC (permalink / raw)
  To: NeilBrown, Jack Wang
  Cc: LKML, Lars Ellenberg, Kent Overstreet, Pavel Machek, Mike Snitzer,
	Mikulas Patocka, linux-raid, device-mapper development,
	linux-block
In-Reply-To: <58be6551-4aa7-72ee-1616-a1545606d029@kernel.dk>

On 03/09/2017 09:38 PM, Jens Axboe wrote:
> On 03/09/2017 09:32 PM, NeilBrown wrote:
>>
>> I started looking further at the improvements we can make once
>> generic_make_request is fixed, and realised that I had missed an
>> important detail in this patch.
>> Several places test current->bio_list, and two actually edit the list.
>> With this change, that cannot see the whole lists, so it could cause a
>> regression.
>>
>> So I've revised the patch to make sure that the entire list of queued
>> bios remains visible, and change the relevant code to look at both
>> the new list and the old list.
>>
>> Following that there are some patches which make the rescuer thread
>> optional, and then starts removing it from some easy-to-fix places.
> 
> Neil, note that the v2 patch is already in Linus tree as of earlier
> today. You need to rebase the series, and if we need fixups on
> top of v2, then that should be done separately and with increased
> urgency.

Additionally, at least the first patch appears to be badly mangled.
The formatting is screwed up.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH] md: fix super_offset endianness in super_1_rdev_size_change
From: Shaohua Li @ 2017-03-10  4:47 UTC (permalink / raw)
  To: Jason Yan; +Cc: linux-raid, miaoxie, zhaohongjiang
In-Reply-To: <1489116443-16195-1-git-send-email-yanaijie@huawei.com>

On Fri, Mar 10, 2017 at 11:27:23AM +0800, Jason Yan wrote:
> The sb->super_offset should be big-endian, but the rdev->sb_start is in
> host byte order, so fix this by adding cpu_to_le64.
> 
> Signed-off-by: Jason Yan <yanaijie@huawei.com>

thanks, applied the two patches.

> ---
>  drivers/md/md.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 548d1b8..b6516f8 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -1887,7 +1887,7 @@ super_1_rdev_size_change(struct md_rdev *rdev, sector_t num_sectors)
>  	}
>  	sb = page_address(rdev->sb_page);
>  	sb->data_size = cpu_to_le64(num_sectors);
> -	sb->super_offset = rdev->sb_start;
> +	sb->super_offset = cpu_to_le64(rdev->sb_start);
>  	sb->sb_csum = calc_sb_1_csum(sb);
>  	do {
>  		md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size,
> -- 
> 2.5.0
> 

^ permalink raw reply

* Re: [PATCH v2] md/r5cache: generate R5LOG_PAYLOAD_FLUSH
From: Shaohua Li @ 2017-03-10  5:03 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid, shli, neilb, kernel-team, dan.j.williams, hch
In-Reply-To: <20170310002403.1951485-1-songliubraving@fb.com>

On Thu, Mar 09, 2017 at 04:24:03PM -0800, Song Liu wrote:
> In r5c_finish_stripe_write_out(), R5LOG_PAYLOAD_FLUSH is append to
> log->current_io.
> 
> Appending R5LOG_PAYLOAD_FLUSH in quiesce needs extra writes to
> journal. To simplify the logic, we just skip R5LOG_PAYLOAD_FLUSH in
> quiesce.
> 
> Even R5LOG_PAYLOAD_FLUSH supports multiple stripes per payload.
> However, current implementation is one stripe per R5LOG_PAYLOAD_FLUSH,
> which is simpler.
> 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  drivers/md/raid5-cache.c | 48 ++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 44 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index e69f922..f148c68 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -588,10 +588,11 @@ static void r5l_log_endio(struct bio *bio)
>  
>  	bio_put(bio);
>  	mempool_free(io->meta_page, log->meta_pool);
> +	__r5l_set_io_unit_state(io, IO_UNIT_IO_END);
>  
>  	spin_lock_irqsave(&log->io_list_lock, flags);
> -	__r5l_set_io_unit_state(io, IO_UNIT_IO_END);
why move this out the lock?

> -	if (log->need_cache_flush)
> +
> +	if (log->need_cache_flush && !list_empty(&io->stripe_list))
>  		r5l_move_to_end_ios(log);
>  	else
>  		r5l_log_run_stripes(log);
> @@ -619,9 +620,11 @@ static void r5l_log_endio(struct bio *bio)
>  			bio_endio(bi);
>  			atomic_dec(&io->pending_stripe);
>  		}
> -		if (atomic_read(&io->pending_stripe) == 0)
> -			__r5l_stripe_write_finished(io);
>  	}
> +
> +	/* finish flush only io_unit and PAYLOAD_FLUSH only io_unit */
> +	if (list_empty(&io->stripe_list))

we don't hold lock here, so can't check the list. checking the pending_stripe
should work too and is safe.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()
From: NeilBrown @ 2017-03-10  5:19 UTC (permalink / raw)
  To: Jens Axboe, Jack Wang
  Cc: LKML, Lars Ellenberg, Kent Overstreet, Pavel Machek, Mike Snitzer,
	Mikulas Patocka, linux-raid, device-mapper development,
	linux-block
In-Reply-To: <58be6551-4aa7-72ee-1616-a1545606d029@kernel.dk>

[-- Attachment #1: Type: text/plain, Size: 1137 bytes --]

On Thu, Mar 09 2017, Jens Axboe wrote:

> On 03/09/2017 09:32 PM, NeilBrown wrote:
>> 
>> I started looking further at the improvements we can make once
>> generic_make_request is fixed, and realised that I had missed an
>> important detail in this patch.
>> Several places test current->bio_list, and two actually edit the list.
>> With this change, that cannot see the whole lists, so it could cause a
>> regression.
>> 
>> So I've revised the patch to make sure that the entire list of queued
>> bios remains visible, and change the relevant code to look at both
>> the new list and the old list.
>> 
>> Following that there are some patches which make the rescuer thread
>> optional, and then starts removing it from some easy-to-fix places.
>
> Neil, note that the v2 patch is already in Linus tree as of earlier
> today. You need to rebase the series, and if we need fixups on
> top of v2, then that should be done separately and with increased
> urgency.

I had checked linux-next, but not the latest from Linus.
I see it now - thanks!
I'll rebase (and ensure nothing gets mangled)

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* [PATCH v3] md/r5cache: generate R5LOG_PAYLOAD_FLUSH
From: Song Liu @ 2017-03-10  5:23 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, kernel-team, dan.j.williams, hch, Song Liu

In r5c_finish_stripe_write_out(), R5LOG_PAYLOAD_FLUSH is append to
log->current_io.

Appending R5LOG_PAYLOAD_FLUSH in quiesce needs extra writes to
journal. To simplify the logic, we just skip R5LOG_PAYLOAD_FLUSH in
quiesce.

Even R5LOG_PAYLOAD_FLUSH supports multiple stripes per payload.
However, current implementation is one stripe per R5LOG_PAYLOAD_FLUSH,
which is simpler.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 45 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 42 insertions(+), 3 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index e69f922..e3c20c7 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -591,7 +591,7 @@ static void r5l_log_endio(struct bio *bio)
 
 	spin_lock_irqsave(&log->io_list_lock, flags);
 	__r5l_set_io_unit_state(io, IO_UNIT_IO_END);
-	if (log->need_cache_flush)
+	if (log->need_cache_flush && !list_empty(&io->stripe_list))
 		r5l_move_to_end_ios(log);
 	else
 		r5l_log_run_stripes(log);
@@ -619,9 +619,11 @@ static void r5l_log_endio(struct bio *bio)
 			bio_endio(bi);
 			atomic_dec(&io->pending_stripe);
 		}
-		if (atomic_read(&io->pending_stripe) == 0)
-			__r5l_stripe_write_finished(io);
 	}
+
+	/* finish flush only io_unit and PAYLOAD_FLUSH only io_unit */
+	if (atomic_read(&io->pending_stripe) == 0)
+		__r5l_stripe_write_finished(io);
 }
 
 static void r5l_do_submit_io(struct r5l_log *log, struct r5l_io_unit *io)
@@ -843,6 +845,41 @@ static void r5l_append_payload_page(struct r5l_log *log, struct page *page)
 	r5_reserve_log_entry(log, io);
 }
 
+static void r5l_append_flush_payload(struct r5l_log *log, sector_t sect)
+{
+	struct mddev *mddev = log->rdev->mddev;
+	struct r5conf *conf = mddev->private;
+	struct r5l_io_unit *io;
+	struct r5l_payload_flush *payload;
+	int meta_size;
+
+	/*
+	 * payload_flush requires extra writes to the journal.
+	 * To avoid handling the extra IO in quiesce, just skip
+	 * flush_payload
+	 */
+	if (conf->quiesce)
+		return;
+
+	mutex_lock(&log->io_mutex);
+	meta_size = sizeof(struct r5l_payload_flush) + sizeof(__le64);
+
+	if (r5l_get_meta(log, meta_size)) {
+		mutex_unlock(&log->io_mutex);
+		return;
+	}
+
+	/* current implementation is one stripe per flush payload */
+	io = log->current_io;
+	payload = page_address(io->meta_page) + io->meta_offset;
+	payload->header.type = cpu_to_le16(R5LOG_PAYLOAD_FLUSH);
+	payload->header.flags = cpu_to_le16(0);
+	payload->size = cpu_to_le32(sizeof(__le64));
+	payload->flush_stripes[0] = cpu_to_le64(sect);
+	io->meta_offset += meta_size;
+	mutex_unlock(&log->io_mutex);
+}
+
 static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
 			   int data_pages, int parity_pages)
 {
@@ -2784,6 +2821,8 @@ void r5c_finish_stripe_write_out(struct r5conf *conf,
 		atomic_dec(&conf->r5c_flushing_full_stripes);
 		atomic_dec(&conf->r5c_cached_full_stripes);
 	}
+
+	r5l_append_flush_payload(log, sh->sector);
 }
 
 int
-- 
2.9.3


^ permalink raw reply related

* [PATCH 0/5] Updates following recent generic_make_request improvement
From: NeilBrown @ 2017-03-10  6:00 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Mike Snitzer, LKML, linux-raid,
	device-mapper development, Mikulas Patocka, Pavel Machek,
	Jack Wang, Lars Ellenberg, Kent Overstreet

This is a rebase of the series I sent earlier, based on the
very latest from Linus, which included my first patch.

The first fixes a problem that patch introduced, and so should go to
Linux promptly.
The others are more general improvements and can go in the normal
course of events.

It is possible that the changes to btrfs and xfs can just be dropped
as a subsequent patch will be needed to revert them anyway.  They are
there only to be able to say that "blk: make the bioset
rescue_workqueue optional." doesn't change any functionality at all.

Thanks,
NeilBrown

---

NeilBrown (5):
      blk: Ensure users for current->bio_list can see the full list.
      blk: remove bio_set arg from blk_queue_split()
      blk: make the bioset rescue_workqueue optional.
      blk: use non-rescuing bioset for q->bio_split.
      block_dev: make blkdev_dio_pool a non-rescuing bioset


 block/bio.c                         |   40 +++++++++++++++++++++++++++++------
 block/blk-core.c                    |   32 +++++++++++++++++-----------
 block/blk-merge.c                   |    7 +++---
 block/blk-mq.c                      |    4 ++--
 drivers/block/drbd/drbd_main.c      |    2 +-
 drivers/block/drbd/drbd_req.c       |    2 +-
 drivers/block/pktcdvd.c             |    2 +-
 drivers/block/ps3vram.c             |    2 +-
 drivers/block/rsxx/dev.c            |    2 +-
 drivers/block/umem.c                |    2 +-
 drivers/block/zram/zram_drv.c       |    2 +-
 drivers/lightnvm/rrpc.c             |    2 +-
 drivers/md/bcache/super.c           |    4 ++--
 drivers/md/dm-crypt.c               |    2 +-
 drivers/md/dm-io.c                  |    2 +-
 drivers/md/dm.c                     |   32 ++++++++++++++++------------
 drivers/md/md.c                     |    4 ++--
 drivers/md/raid10.c                 |    3 ++-
 drivers/md/raid5-cache.c            |    2 +-
 drivers/s390/block/dcssblk.c        |    2 +-
 drivers/s390/block/xpram.c          |    2 +-
 drivers/target/target_core_iblock.c |    2 +-
 fs/btrfs/extent_io.c                |    4 ++--
 fs/xfs/xfs_super.c                  |    2 +-
 include/linux/bio.h                 |    2 ++
 include/linux/blkdev.h              |    3 +--
 26 files changed, 101 insertions(+), 64 deletions(-)

--
Signature

^ permalink raw reply

* [PATCH 1/5] blk: Ensure users for current->bio_list can see the full list.
From: NeilBrown @ 2017-03-10  6:00 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Mike Snitzer, LKML, linux-raid,
	device-mapper development, Mikulas Patocka, Pavel Machek,
	Jack Wang, Lars Ellenberg, Kent Overstreet
In-Reply-To: <148912539296.4002.219258660543808741.stgit@noble>

Commit 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()")
changed current->bio_list so that it did not contain *all* of the
queued bios, but only those submitted by the currently running
make_request_fn.

There are two places which walk the list and requeue selected bios,
and others that check if the list is empty.  These are no longer
correct.

So redefine current->bio_list to point to an array of two lists, which
contain all queued bios, and adjust various code to test or walk both
lists.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/bio.c         |   12 +++++++++---
 block/blk-core.c    |   30 ++++++++++++++++++------------
 drivers/md/dm.c     |   29 ++++++++++++++++-------------
 drivers/md/raid10.c |    3 ++-
 4 files changed, 45 insertions(+), 29 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 5eec5e08417f..e75878f8b14a 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -376,10 +376,14 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
 	bio_list_init(&punt);
 	bio_list_init(&nopunt);
 
-	while ((bio = bio_list_pop(current->bio_list)))
+	while ((bio = bio_list_pop(&current->bio_list[0])))
 		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
+	current->bio_list[0] = nopunt;
 
-	*current->bio_list = nopunt;
+	bio_list_init(&nopunt);
+	while ((bio = bio_list_pop(&current->bio_list[1])))
+		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
+	current->bio_list[1] = nopunt;
 
 	spin_lock(&bs->rescue_lock);
 	bio_list_merge(&bs->rescue_list, &punt);
@@ -466,7 +470,9 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 		 * we retry with the original gfp_flags.
 		 */
 
-		if (current->bio_list && !bio_list_empty(current->bio_list))
+		if (current->bio_list &&
+		    (!bio_list_empty(&current->bio_list[0]) ||
+		     !bio_list_empty(&current->bio_list[1])))
 			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
 
 		p = mempool_alloc(bs->bio_pool, gfp_mask);
diff --git a/block/blk-core.c b/block/blk-core.c
index 0eeb99ef654f..d772c221cc17 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1973,7 +1973,14 @@ generic_make_request_checks(struct bio *bio)
  */
 blk_qc_t generic_make_request(struct bio *bio)
 {
-	struct bio_list bio_list_on_stack;
+	/*
+	 * bio_list_on_stack[0] contains bios submitted by the current
+	 * make_request_fn.
+	 * bio_list_on_stack[1] contains bios that were submitted before
+	 * the current make_request_fn, but that haven't been processed
+	 * yet.
+	 */
+	struct bio_list bio_list_on_stack[2];
 	blk_qc_t ret = BLK_QC_T_NONE;
 
 	if (!generic_make_request_checks(bio))
@@ -1990,7 +1997,7 @@ blk_qc_t generic_make_request(struct bio *bio)
 	 * should be added at the tail
 	 */
 	if (current->bio_list) {
-		bio_list_add(current->bio_list, bio);
+		bio_list_add(&current->bio_list[0], bio);
 		goto out;
 	}
 
@@ -2009,18 +2016,17 @@ blk_qc_t generic_make_request(struct bio *bio)
 	 * bio_list, and call into ->make_request() again.
 	 */
 	BUG_ON(bio->bi_next);
-	bio_list_init(&bio_list_on_stack);
-	current->bio_list = &bio_list_on_stack;
+	bio_list_init(&bio_list_on_stack[0]);
+	current->bio_list = bio_list_on_stack;
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
 		if (likely(blk_queue_enter(q, false) == 0)) {
-			struct bio_list hold;
 			struct bio_list lower, same;
 
 			/* Create a fresh bio_list for all subordinate requests */
-			hold = bio_list_on_stack;
-			bio_list_init(&bio_list_on_stack);
+			bio_list_on_stack[1] = bio_list_on_stack[0];
+			bio_list_init(&bio_list_on_stack[0]);
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
@@ -2030,19 +2036,19 @@ blk_qc_t generic_make_request(struct bio *bio)
 			 */
 			bio_list_init(&lower);
 			bio_list_init(&same);
-			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
+			while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL)
 				if (q == bdev_get_queue(bio->bi_bdev))
 					bio_list_add(&same, bio);
 				else
 					bio_list_add(&lower, bio);
 			/* now assemble so we handle the lowest level first */
-			bio_list_merge(&bio_list_on_stack, &lower);
-			bio_list_merge(&bio_list_on_stack, &same);
-			bio_list_merge(&bio_list_on_stack, &hold);
+			bio_list_merge(&bio_list_on_stack[0], &lower);
+			bio_list_merge(&bio_list_on_stack[0], &same);
+			bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
 		} else {
 			bio_io_error(bio);
 		}
-		bio = bio_list_pop(current->bio_list);
+		bio = bio_list_pop(&bio_list_on_stack[0]);
 	} while (bio);
 	current->bio_list = NULL; /* deactivate */
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index f4ffd1eb8f44..dfb75979e455 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -989,26 +989,29 @@ static void flush_current_bio_list(struct blk_plug_cb *cb, bool from_schedule)
 	struct dm_offload *o = container_of(cb, struct dm_offload, cb);
 	struct bio_list list;
 	struct bio *bio;
+	int i;
 
 	INIT_LIST_HEAD(&o->cb.list);
 
 	if (unlikely(!current->bio_list))
 		return;
 
-	list = *current->bio_list;
-	bio_list_init(current->bio_list);
-
-	while ((bio = bio_list_pop(&list))) {
-		struct bio_set *bs = bio->bi_pool;
-		if (unlikely(!bs) || bs == fs_bio_set) {
-			bio_list_add(current->bio_list, bio);
-			continue;
+	for (i = 0; i < 2; i++) {
+		list = current->bio_list[i];
+		bio_list_init(&current->bio_list[i]);
+
+		while ((bio = bio_list_pop(&list))) {
+			struct bio_set *bs = bio->bi_pool;
+			if (unlikely(!bs) || bs == fs_bio_set) {
+				bio_list_add(&current->bio_list[i], bio);
+				continue;
+			}
+
+			spin_lock(&bs->rescue_lock);
+			bio_list_add(&bs->rescue_list, bio);
+			queue_work(bs->rescue_workqueue, &bs->rescue_work);
+			spin_unlock(&bs->rescue_lock);
 		}
-
-		spin_lock(&bs->rescue_lock);
-		bio_list_add(&bs->rescue_list, bio);
-		queue_work(bs->rescue_workqueue, &bs->rescue_work);
-		spin_unlock(&bs->rescue_lock);
 	}
 }
 
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 063c43d83b72..0536658c9d40 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -974,7 +974,8 @@ static void wait_barrier(struct r10conf *conf)
 				    !conf->barrier ||
 				    (atomic_read(&conf->nr_pending) &&
 				     current->bio_list &&
-				     !bio_list_empty(current->bio_list)),
+				     (!bio_list_empty(&current->bio_list[0]) ||
+				      !bio_list_empty(&current->bio_list[1]))),
 				    conf->resync_lock);
 		conf->nr_waiting--;
 		if (!conf->nr_waiting)

^ permalink raw reply related

* [PATCH 2/5] blk: remove bio_set arg from blk_queue_split()
From: NeilBrown @ 2017-03-10  6:00 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Mike Snitzer, LKML, linux-raid,
	device-mapper development, Mikulas Patocka, Pavel Machek,
	Jack Wang, Lars Ellenberg, Kent Overstreet
In-Reply-To: <148912539296.4002.219258660543808741.stgit@noble>

blk_queue_split() is always called with the last arg being q->bio_split,
where 'q' is the first arg.

Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses
q->bio_split.

This is inconsistent and unnecessary.  Remove the last arg and always use
q->bio_split inside blk_queue_split()

Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/blk-core.c              |    2 +-
 block/blk-merge.c             |    7 +++----
 block/blk-mq.c                |    4 ++--
 drivers/block/drbd/drbd_req.c |    2 +-
 drivers/block/pktcdvd.c       |    2 +-
 drivers/block/ps3vram.c       |    2 +-
 drivers/block/rsxx/dev.c      |    2 +-
 drivers/block/umem.c          |    2 +-
 drivers/block/zram/zram_drv.c |    2 +-
 drivers/lightnvm/rrpc.c       |    2 +-
 drivers/md/md.c               |    2 +-
 drivers/s390/block/dcssblk.c  |    2 +-
 drivers/s390/block/xpram.c    |    2 +-
 include/linux/blkdev.h        |    3 +--
 14 files changed, 17 insertions(+), 19 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d772c221cc17..fae7966e1f98 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1635,7 +1635,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
 	 */
 	blk_queue_bounce(q, &bio);
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
 		bio->bi_error = -EIO;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 2afa262425d1..ce8838aff7f7 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -188,8 +188,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 	return do_split ? new : NULL;
 }
 
-void blk_queue_split(struct request_queue *q, struct bio **bio,
-		     struct bio_set *bs)
+void blk_queue_split(struct request_queue *q, struct bio **bio)
 {
 	struct bio *split, *res;
 	unsigned nsegs;
@@ -197,14 +196,14 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
 	switch (bio_op(*bio)) {
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
-		split = blk_bio_discard_split(q, *bio, bs, &nsegs);
+		split = blk_bio_discard_split(q, *bio, q->bio_split, &nsegs);
 		break;
 	case REQ_OP_WRITE_ZEROES:
 		split = NULL;
 		nsegs = (*bio)->bi_phys_segments;
 		break;
 	case REQ_OP_WRITE_SAME:
-		split = blk_bio_write_same_split(q, *bio, bs, &nsegs);
+		split = blk_bio_write_same_split(q, *bio, q->bio_split, &nsegs);
 		break;
 	default:
 		split = blk_bio_segment_split(q, *bio, q->bio_split, &nsegs);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 159187a28d66..e582d7f7511e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1502,7 +1502,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		return BLK_QC_T_NONE;
 	}
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	if (!is_flush_fua && !blk_queue_nomerges(q) &&
 	    blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq))
@@ -1624,7 +1624,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 		return BLK_QC_T_NONE;
 	}
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	if (!is_flush_fua && !blk_queue_nomerges(q)) {
 		if (blk_attempt_plug_merge(q, bio, &request_count, NULL))
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 652114ae1a8a..f6ed6f7f5ab2 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1554,7 +1554,7 @@ blk_qc_t drbd_make_request(struct request_queue *q, struct bio *bio)
 	struct drbd_device *device = (struct drbd_device *) q->queuedata;
 	unsigned long start_jif;
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	start_jif = jiffies;
 
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 66d846ba85a9..98394d034c29 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2414,7 +2414,7 @@ static blk_qc_t pkt_make_request(struct request_queue *q, struct bio *bio)
 
 	blk_queue_bounce(q, &bio);
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	pd = q->queuedata;
 	if (!pd) {
diff --git a/drivers/block/ps3vram.c b/drivers/block/ps3vram.c
index 456b4fe21559..48072c0c1010 100644
--- a/drivers/block/ps3vram.c
+++ b/drivers/block/ps3vram.c
@@ -606,7 +606,7 @@ static blk_qc_t ps3vram_make_request(struct request_queue *q, struct bio *bio)
 
 	dev_dbg(&dev->core, "%s\n", __func__);
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	spin_lock_irq(&priv->lock);
 	busy = !bio_list_empty(&priv->list);
diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
index f81d70b39d10..8c8024f616ae 100644
--- a/drivers/block/rsxx/dev.c
+++ b/drivers/block/rsxx/dev.c
@@ -151,7 +151,7 @@ static blk_qc_t rsxx_make_request(struct request_queue *q, struct bio *bio)
 	struct rsxx_bio_meta *bio_meta;
 	int st = -EINVAL;
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	might_sleep();
 
diff --git a/drivers/block/umem.c b/drivers/block/umem.c
index c141cc3be22b..c8d8a2f16f8e 100644
--- a/drivers/block/umem.c
+++ b/drivers/block/umem.c
@@ -529,7 +529,7 @@ static blk_qc_t mm_make_request(struct request_queue *q, struct bio *bio)
 		 (unsigned long long)bio->bi_iter.bi_sector,
 		 bio->bi_iter.bi_size);
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	spin_lock_irq(&card->lock);
 	*card->biotail = bio;
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index dceb5edd1e54..49d8bf37c4ef 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -880,7 +880,7 @@ static blk_qc_t zram_make_request(struct request_queue *queue, struct bio *bio)
 {
 	struct zram *zram = queue->queuedata;
 
-	blk_queue_split(queue, &bio, queue->bio_split);
+	blk_queue_split(queue, &bio);
 
 	if (!valid_io_request(zram, bio->bi_iter.bi_sector,
 					bio->bi_iter.bi_size)) {
diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
index e00b1d7b976f..701fa2fafbb2 100644
--- a/drivers/lightnvm/rrpc.c
+++ b/drivers/lightnvm/rrpc.c
@@ -999,7 +999,7 @@ static blk_qc_t rrpc_make_rq(struct request_queue *q, struct bio *bio)
 	struct nvm_rq *rqd;
 	int err;
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	if (bio_op(bio) == REQ_OP_DISCARD) {
 		rrpc_discard(rrpc, bio);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 548d1b8014f8..d7d2bb51a58d 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -253,7 +253,7 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
 	unsigned int sectors;
 	int cpu;
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	if (mddev == NULL || mddev->pers == NULL) {
 		bio_io_error(bio);
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 415d10a67b7a..10ece6f3c7eb 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -829,7 +829,7 @@ dcssblk_make_request(struct request_queue *q, struct bio *bio)
 	unsigned long source_addr;
 	unsigned long bytes_done;
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	bytes_done = 0;
 	dev_info = bio->bi_bdev->bd_disk->private_data;
diff --git a/drivers/s390/block/xpram.c b/drivers/s390/block/xpram.c
index b9d7e755c8a3..a48f0d40c1d2 100644
--- a/drivers/s390/block/xpram.c
+++ b/drivers/s390/block/xpram.c
@@ -190,7 +190,7 @@ static blk_qc_t xpram_make_request(struct request_queue *q, struct bio *bio)
 	unsigned long page_addr;
 	unsigned long bytes;
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	if ((bio->bi_iter.bi_sector & 7) != 0 ||
 	    (bio->bi_iter.bi_size & 4095) != 0)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5a7da607ca04..8f98b4e2b9e6 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -933,8 +933,7 @@ extern int blk_insert_cloned_request(struct request_queue *q,
 				     struct request *rq);
 extern int blk_rq_append_bio(struct request *rq, struct bio *bio);
 extern void blk_delay_queue(struct request_queue *, unsigned long);
-extern void blk_queue_split(struct request_queue *, struct bio **,
-			    struct bio_set *);
+extern void blk_queue_split(struct request_queue *, struct bio **);
 extern void blk_recount_segments(struct request_queue *, struct bio *);
 extern int scsi_verify_blk_ioctl(struct block_device *, unsigned int);
 extern int scsi_cmd_blk_ioctl(struct block_device *, fmode_t,

^ permalink raw reply related

* [PATCH 3/5] blk: make the bioset rescue_workqueue optional.
From: NeilBrown @ 2017-03-10  6:00 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Mike Snitzer, LKML, linux-raid,
	device-mapper development, Mikulas Patocka, Pavel Machek,
	Jack Wang, Lars Ellenberg, Kent Overstreet
In-Reply-To: <148912539296.4002.219258660543808741.stgit@noble>

This patch converts bioset_create() and bioset_create_nobvec()
to not create a workqueue so alloctions will never trigger
punt_bios_to_rescuer().
It also introduces bioset_create_rescued() and bioset_create_nobvec_rescued()
which preserve the old behaviour.

*All* callers of bioset_create() and bioset_create_nobvec() are
converted to the _rescued() version, so that no change in behaviour
is experienced.

It is hoped that most, if not all, biosets can end up being the
non-rescued version.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/bio.c                         |   30 +++++++++++++++++++++++++-----
 block/blk-core.c                    |    2 +-
 drivers/block/drbd/drbd_main.c      |    2 +-
 drivers/md/bcache/super.c           |    4 ++--
 drivers/md/dm-crypt.c               |    2 +-
 drivers/md/dm-io.c                  |    2 +-
 drivers/md/dm.c                     |    5 +++--
 drivers/md/md.c                     |    2 +-
 drivers/md/raid5-cache.c            |    2 +-
 drivers/target/target_core_iblock.c |    2 +-
 fs/block_dev.c                      |    2 +-
 fs/btrfs/extent_io.c                |    4 ++--
 fs/xfs/xfs_super.c                  |    2 +-
 include/linux/bio.h                 |    2 ++
 14 files changed, 43 insertions(+), 20 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index e75878f8b14a..3790c3f376b6 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -362,6 +362,8 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
 	struct bio_list punt, nopunt;
 	struct bio *bio;
 
+	if (!WARN_ON_ONCE(!bs->rescue_workqueue))
+		return;
 	/*
 	 * In order to guarantee forward progress we must punt only bios that
 	 * were allocated from this bio_set; otherwise, if there was a bio on
@@ -472,7 +474,8 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 
 		if (current->bio_list &&
 		    (!bio_list_empty(&current->bio_list[0]) ||
-		     !bio_list_empty(&current->bio_list[1])))
+		     !bio_list_empty(&current->bio_list[1])) &&
+		    bs->rescue_workqueue)
 			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
 
 		p = mempool_alloc(bs->bio_pool, gfp_mask);
@@ -1941,7 +1944,8 @@ EXPORT_SYMBOL(bioset_free);
 
 static struct bio_set *__bioset_create(unsigned int pool_size,
 				       unsigned int front_pad,
-				       bool create_bvec_pool)
+				       bool create_bvec_pool,
+				       bool create_rescue_workqueue)
 {
 	unsigned int back_pad = BIO_INLINE_VECS * sizeof(struct bio_vec);
 	struct bio_set *bs;
@@ -1972,6 +1976,9 @@ static struct bio_set *__bioset_create(unsigned int pool_size,
 			goto bad;
 	}
 
+	if (!create_rescue_workqueue)
+		return bs;
+
 	bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
 	if (!bs->rescue_workqueue)
 		goto bad;
@@ -1997,10 +2004,16 @@ static struct bio_set *__bioset_create(unsigned int pool_size,
  */
 struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
 {
-	return __bioset_create(pool_size, front_pad, true);
+	return __bioset_create(pool_size, front_pad, true, false);
 }
 EXPORT_SYMBOL(bioset_create);
 
+struct bio_set *bioset_create_rescued(unsigned int pool_size, unsigned int front_pad)
+{
+	return __bioset_create(pool_size, front_pad, true, true);
+}
+EXPORT_SYMBOL(bioset_create_rescued);
+
 /**
  * bioset_create_nobvec  - Create a bio_set without bio_vec mempool
  * @pool_size:	Number of bio to cache in the mempool
@@ -2012,10 +2025,17 @@ EXPORT_SYMBOL(bioset_create);
  */
 struct bio_set *bioset_create_nobvec(unsigned int pool_size, unsigned int front_pad)
 {
-	return __bioset_create(pool_size, front_pad, false);
+	return __bioset_create(pool_size, front_pad, false, false);
 }
 EXPORT_SYMBOL(bioset_create_nobvec);
 
+struct bio_set *bioset_create_nobvec_rescued(unsigned int pool_size,
+					     unsigned int front_pad)
+{
+	return __bioset_create(pool_size, front_pad, false, true);
+}
+EXPORT_SYMBOL(bioset_create_nobvec_rescued);
+
 #ifdef CONFIG_BLK_CGROUP
 
 /**
@@ -2130,7 +2150,7 @@ static int __init init_bio(void)
 	bio_integrity_init();
 	biovec_init_slabs();
 
-	fs_bio_set = bioset_create(BIO_POOL_SIZE, 0);
+	fs_bio_set = bioset_create_rescued(BIO_POOL_SIZE, 0);
 	if (!fs_bio_set)
 		panic("bio: can't allocate bios\n");
 
diff --git a/block/blk-core.c b/block/blk-core.c
index fae7966e1f98..430c82f646eb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -712,7 +712,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	if (q->id < 0)
 		goto fail_q;
 
-	q->bio_split = bioset_create(BIO_POOL_SIZE, 0);
+	q->bio_split = bioset_create_rescued(BIO_POOL_SIZE, 0);
 	if (!q->bio_split)
 		goto fail_id;
 
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 92c60cbd04ee..2c69c2ab0fff 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2166,7 +2166,7 @@ static int drbd_create_mempools(void)
 		goto Enomem;
 
 	/* mempools */
-	drbd_md_io_bio_set = bioset_create(DRBD_MIN_POOL_PAGES, 0);
+	drbd_md_io_bio_set = bioset_create_rescued(DRBD_MIN_POOL_PAGES, 0);
 	if (drbd_md_io_bio_set == NULL)
 		goto Enomem;
 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 85e3f21c2514..6cb30792f0ed 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -786,7 +786,7 @@ static int bcache_device_init(struct bcache_device *d, unsigned block_size,
 
 	minor *= BCACHE_MINORS;
 
-	if (!(d->bio_split = bioset_create(4, offsetof(struct bbio, bio))) ||
+	if (!(d->bio_split = bioset_create_rescued(4, offsetof(struct bbio, bio))) ||
 	    !(d->disk = alloc_disk(BCACHE_MINORS))) {
 		ida_simple_remove(&bcache_minor, minor);
 		return -ENOMEM;
@@ -1520,7 +1520,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
 				sizeof(struct bbio) + sizeof(struct bio_vec) *
 				bucket_pages(c))) ||
 	    !(c->fill_iter = mempool_create_kmalloc_pool(1, iter_size)) ||
-	    !(c->bio_split = bioset_create(4, offsetof(struct bbio, bio))) ||
+	    !(c->bio_split = bioset_create_rescued(4, offsetof(struct bbio, bio))) ||
 	    !(c->uuids = alloc_bucket_pages(GFP_KERNEL, c)) ||
 	    !(c->moving_gc_wq = alloc_workqueue("bcache_gc",
 						WQ_MEM_RECLAIM, 0)) ||
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 389a3637ffcc..91a2d637d44f 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1936,7 +1936,7 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 		goto bad;
 	}
 
-	cc->bs = bioset_create(MIN_IOS, 0);
+	cc->bs = bioset_create_rescued(MIN_IOS, 0);
 	if (!cc->bs) {
 		ti->error = "Cannot allocate crypt bioset";
 		goto bad;
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 03940bf36f6c..fe1241c196b1 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -58,7 +58,7 @@ struct dm_io_client *dm_io_client_create(void)
 	if (!client->pool)
 		goto bad;
 
-	client->bios = bioset_create(min_ios, 0);
+	client->bios = bioset_create_rescued(min_ios, 0);
 	if (!client->bios)
 		goto bad;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index dfb75979e455..41b1f033841f 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1002,7 +1002,8 @@ static void flush_current_bio_list(struct blk_plug_cb *cb, bool from_schedule)
 
 		while ((bio = bio_list_pop(&list))) {
 			struct bio_set *bs = bio->bi_pool;
-			if (unlikely(!bs) || bs == fs_bio_set) {
+			if (unlikely(!bs) || bs == fs_bio_set ||
+			    !bs->rescue_workqueue) {
 				bio_list_add(&current->bio_list[i], bio);
 				continue;
 			}
@@ -2577,7 +2578,7 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, unsigned t
 		BUG();
 	}
 
-	pools->bs = bioset_create_nobvec(pool_size, front_pad);
+	pools->bs = bioset_create_nobvec_rescued(pool_size, front_pad);
 	if (!pools->bs)
 		goto out;
 
diff --git a/drivers/md/md.c b/drivers/md/md.c
index d7d2bb51a58d..e5f08a195837 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5220,7 +5220,7 @@ int md_run(struct mddev *mddev)
 	}
 
 	if (mddev->bio_set == NULL) {
-		mddev->bio_set = bioset_create(BIO_POOL_SIZE, 0);
+		mddev->bio_set = bioset_create_rescued(BIO_POOL_SIZE, 0);
 		if (!mddev->bio_set)
 			return -ENOMEM;
 	}
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 3f307be01b10..c95c6c046395 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2831,7 +2831,7 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	if (!log->io_pool)
 		goto io_pool;
 
-	log->bs = bioset_create(R5L_POOL_SIZE, 0);
+	log->bs = bioset_create_rescued(R5L_POOL_SIZE, 0);
 	if (!log->bs)
 		goto io_bs;
 
diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
index d316ed537d59..5bf3392195c6 100644
--- a/drivers/target/target_core_iblock.c
+++ b/drivers/target/target_core_iblock.c
@@ -93,7 +93,7 @@ static int iblock_configure_device(struct se_device *dev)
 		return -EINVAL;
 	}
 
-	ib_dev->ibd_bio_set = bioset_create(IBLOCK_BIO_POOL_SIZE, 0);
+	ib_dev->ibd_bio_set = bioset_create_rescued(IBLOCK_BIO_POOL_SIZE, 0);
 	if (!ib_dev->ibd_bio_set) {
 		pr_err("IBLOCK: Unable to create bioset\n");
 		goto out;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2eca00ec4370..c0ca5f0d0369 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -436,7 +436,7 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 
 static __init int blkdev_init(void)
 {
-	blkdev_dio_pool = bioset_create(4, offsetof(struct blkdev_dio, bio));
+	blkdev_dio_pool = bioset_create_rescued(4, offsetof(struct blkdev_dio, bio));
 	if (!blkdev_dio_pool)
 		return -ENOMEM;
 	return 0;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 28e81922a21c..34aa8893790a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -173,8 +173,8 @@ int __init extent_io_init(void)
 	if (!extent_buffer_cache)
 		goto free_state_cache;
 
-	btrfs_bioset = bioset_create(BIO_POOL_SIZE,
-				     offsetof(struct btrfs_io_bio, bio));
+	btrfs_bioset = bioset_create_rescued(BIO_POOL_SIZE,
+					     offsetof(struct btrfs_io_bio, bio));
 	if (!btrfs_bioset)
 		goto free_buffer_cache;
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 890862f2447c..f4c4d6f41d91 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1756,7 +1756,7 @@ MODULE_ALIAS_FS("xfs");
 STATIC int __init
 xfs_init_zones(void)
 {
-	xfs_ioend_bioset = bioset_create(4 * MAX_BUF_PER_PAGE,
+	xfs_ioend_bioset = bioset_create_rescued(4 * MAX_BUF_PER_PAGE,
 			offsetof(struct xfs_ioend, io_inline_bio));
 	if (!xfs_ioend_bioset)
 		goto out;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e521194f6fc..05730603fcf1 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -379,7 +379,9 @@ static inline struct bio *bio_next_split(struct bio *bio, int sectors,
 }
 
 extern struct bio_set *bioset_create(unsigned int, unsigned int);
+extern struct bio_set *bioset_create_rescued(unsigned int, unsigned int);
 extern struct bio_set *bioset_create_nobvec(unsigned int, unsigned int);
+extern struct bio_set *bioset_create_nobvec_rescued(unsigned int, unsigned int);
 extern void bioset_free(struct bio_set *);
 extern mempool_t *biovec_create_pool(int pool_entries);
 

^ permalink raw reply related

* [PATCH 5/5] block_dev: make blkdev_dio_pool a non-rescuing bioset
From: NeilBrown @ 2017-03-10  6:00 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Mike Snitzer, LKML, linux-raid,
	device-mapper development, Mikulas Patocka, Pavel Machek,
	Jack Wang, Lars Ellenberg, Kent Overstreet
In-Reply-To: <148912539296.4002.219258660543808741.stgit@noble>

Allocations from blkdev_dio_pool are never made under
generic_make_request, so this bioset does not need a rescuer thread.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/block_dev.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index c0ca5f0d0369..2eca00ec4370 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -436,7 +436,7 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 
 static __init int blkdev_init(void)
 {
-	blkdev_dio_pool = bioset_create_rescued(4, offsetof(struct blkdev_dio, bio));
+	blkdev_dio_pool = bioset_create(4, offsetof(struct blkdev_dio, bio));
 	if (!blkdev_dio_pool)
 		return -ENOMEM;
 	return 0;

^ permalink raw reply related

* [PATCH 4/5] blk: use non-rescuing bioset for q->bio_split.
From: NeilBrown @ 2017-03-10  6:00 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-raid, Mike Snitzer, LKML, linux-block,
	device-mapper development, Mikulas Patocka, Pavel Machek,
	Jack Wang, Lars Ellenberg, Kent Overstreet
In-Reply-To: <148912539296.4002.219258660543808741.stgit@noble>

A rescuing bioset is only useful if there might be bios from
that same bioset on the bio_list_on_stack queue at a time
when bio_alloc_bioset() is called.  This never applies to
q->bio_split.

Allocations from q->bio_split are only ever made from
blk_queue_split() which is only ever called early in each of
various make_request_fn()s.  The original bio (call this A)
is then passed to generic_make_request() and is placed on
the bio_list_on_stack queue, and the bio that was allocated
from q->bio_split (B) is processed.

The processing of this may cause other bios to be passed to
generic_make_request() or may even cause the bio B itself to
be passed, possible after some prefix has been split off
(using some other bioset).

generic_make_request() now guarantees that all of these bios
(B and dependants) will be fully processed before the tail
of the original bio A gets handled.  None of these early bios
can possible trigger an allocation from the original
q->bio_split as they are either too small to require
splitting or (more likely) are destined for a different queue.

The next time that the original q->bio_split might be used
by this thread is when A is processed again, as it might
still be too big to handle directly.  By this time there
cannot be any other bios allocated from q->bio_split in the
generic_make_request() queue.  So no rescuing will ever be
needed.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/blk-core.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 430c82f646eb..fae7966e1f98 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -712,7 +712,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	if (q->id < 0)
 		goto fail_q;
 
-	q->bio_split = bioset_create_rescued(BIO_POOL_SIZE, 0);
+	q->bio_split = bioset_create(BIO_POOL_SIZE, 0);
 	if (!q->bio_split)
 		goto fail_id;
 

^ permalink raw reply related

* Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()
From: Lars Ellenberg @ 2017-03-10 12:34 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jens Axboe, Jack Wang, LKML, Kent Overstreet, Pavel Machek,
	Mike Snitzer, Mikulas Patocka, linux-raid,
	device-mapper development, linux-block
In-Reply-To: <87varhg14d.fsf@notabene.neil.brown.name>

> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
>   */
>  blk_qc_t generic_make_request(struct bio *bio)
>  {
> -       struct bio_list bio_list_on_stack;
> +       /*
> +        * bio_list_on_stack[0] contains bios submitted by the current
> +        * make_request_fn.
> +        * bio_list_on_stack[1] contains bios that were submitted before
> +        * the current make_request_fn, but that haven't been processed
> +        * yet.
> +        */
> +       struct bio_list bio_list_on_stack[2];
>         blk_qc_t ret = BLK_QC_T_NONE;

May I suggest that, if you intend to assign something that is not a
plain &(struct bio_list), but a &(struct bio_list[2]),
you change the task member so it is renamed (current->bio_list vs
current->bio_lists, plural, is what I did last year).
Or you will break external modules, silently, and horribly (or,
rather, they won't notice, but break the kernel).
Examples of such modules would be DRBD, ZFS, quite possibly others.

Thanks,

    Lars

^ permalink raw reply

* Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()
From: Mike Snitzer @ 2017-03-10 14:38 UTC (permalink / raw)
  To: Lars Ellenberg
  Cc: NeilBrown, Jens Axboe, Jack Wang, LKML, Kent Overstreet,
	Pavel Machek, Mikulas Patocka, linux-raid,
	device-mapper development, linux-block
In-Reply-To: <CANr6vz8EbRWXq7igGCzRy9JC1Nt=MMma0h8M6nxHQtwiMDa5aQ@mail.gmail.com>

On Fri, Mar 10 2017 at  7:34am -0500,
Lars Ellenberg <lars.ellenberg@linbit.com> wrote:

> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
> >   */
> >  blk_qc_t generic_make_request(struct bio *bio)
> >  {
> > -       struct bio_list bio_list_on_stack;
> > +       /*
> > +        * bio_list_on_stack[0] contains bios submitted by the current
> > +        * make_request_fn.
> > +        * bio_list_on_stack[1] contains bios that were submitted before
> > +        * the current make_request_fn, but that haven't been processed
> > +        * yet.
> > +        */
> > +       struct bio_list bio_list_on_stack[2];
> >         blk_qc_t ret = BLK_QC_T_NONE;
> 
> May I suggest that, if you intend to assign something that is not a
> plain &(struct bio_list), but a &(struct bio_list[2]),
> you change the task member so it is renamed (current->bio_list vs
> current->bio_lists, plural, is what I did last year).
> Or you will break external modules, silently, and horribly (or,
> rather, they won't notice, but break the kernel).
> Examples of such modules would be DRBD, ZFS, quite possibly others.

drbd is upstream -- so what is the problem?  (if you are having to
distribute drbd independent of the upstream drbd then why is drbd
upstream?)

As for ZFS, worrying about ZFS kABI breakage is the last thing we should
be doing.

So Nack from me on this defensive make-work for external modules.

^ permalink raw reply

* Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()
From: Mikulas Patocka @ 2017-03-10 14:55 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Lars Ellenberg, NeilBrown, Jens Axboe, Jack Wang, LKML,
	Kent Overstreet, Pavel Machek, linux-raid,
	device-mapper development, linux-block
In-Reply-To: <20170310143822.GA23879@redhat.com>



On Fri, 10 Mar 2017, Mike Snitzer wrote:

> On Fri, Mar 10 2017 at  7:34am -0500,
> Lars Ellenberg <lars.ellenberg@linbit.com> wrote:
> 
> > > --- a/block/blk-core.c
> > > +++ b/block/blk-core.c
> > > @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
> > >   */
> > >  blk_qc_t generic_make_request(struct bio *bio)
> > >  {
> > > -       struct bio_list bio_list_on_stack;
> > > +       /*
> > > +        * bio_list_on_stack[0] contains bios submitted by the current
> > > +        * make_request_fn.
> > > +        * bio_list_on_stack[1] contains bios that were submitted before
> > > +        * the current make_request_fn, but that haven't been processed
> > > +        * yet.
> > > +        */
> > > +       struct bio_list bio_list_on_stack[2];
> > >         blk_qc_t ret = BLK_QC_T_NONE;
> > 
> > May I suggest that, if you intend to assign something that is not a
> > plain &(struct bio_list), but a &(struct bio_list[2]),
> > you change the task member so it is renamed (current->bio_list vs
> > current->bio_lists, plural, is what I did last year).
> > Or you will break external modules, silently, and horribly (or,
> > rather, they won't notice, but break the kernel).
> > Examples of such modules would be DRBD, ZFS, quite possibly others.
> 
> drbd is upstream -- so what is the problem?  (if you are having to
> distribute drbd independent of the upstream drbd then why is drbd
> upstream?)
> 
> As for ZFS, worrying about ZFS kABI breakage is the last thing we should
> be doing.

It's better to make external modules not compile than to silently 
introduce bugs in them. So yes, I would rename that.

Mikulas

> So Nack from me on this defensive make-work for external modules.
> 

^ permalink raw reply

* Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()
From: Jack Wang @ 2017-03-10 15:07 UTC (permalink / raw)
  To: Mikulas Patocka, Mike Snitzer
  Cc: Lars Ellenberg, NeilBrown, Jens Axboe, LKML, Kent Overstreet,
	Pavel Machek, linux-raid, device-mapper development, linux-block
In-Reply-To: <alpine.LRH.2.02.1703100953280.9054@file01.intranet.prod.int.rdu2.redhat.com>



On 10.03.2017 15:55, Mikulas Patocka wrote:
> 
> 
> On Fri, 10 Mar 2017, Mike Snitzer wrote:
> 
>> On Fri, Mar 10 2017 at  7:34am -0500,
>> Lars Ellenberg <lars.ellenberg@linbit.com> wrote:
>>
>>>> --- a/block/blk-core.c
>>>> +++ b/block/blk-core.c
>>>> @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
>>>>   */
>>>>  blk_qc_t generic_make_request(struct bio *bio)
>>>>  {
>>>> -       struct bio_list bio_list_on_stack;
>>>> +       /*
>>>> +        * bio_list_on_stack[0] contains bios submitted by the current
>>>> +        * make_request_fn.
>>>> +        * bio_list_on_stack[1] contains bios that were submitted before
>>>> +        * the current make_request_fn, but that haven't been processed
>>>> +        * yet.
>>>> +        */
>>>> +       struct bio_list bio_list_on_stack[2];
>>>>         blk_qc_t ret = BLK_QC_T_NONE;
>>>
>>> May I suggest that, if you intend to assign something that is not a
>>> plain &(struct bio_list), but a &(struct bio_list[2]),
>>> you change the task member so it is renamed (current->bio_list vs
>>> current->bio_lists, plural, is what I did last year).
>>> Or you will break external modules, silently, and horribly (or,
>>> rather, they won't notice, but break the kernel).
>>> Examples of such modules would be DRBD, ZFS, quite possibly others.
>>
>> drbd is upstream -- so what is the problem?  (if you are having to
>> distribute drbd independent of the upstream drbd then why is drbd
>> upstream?)
>>
>> As for ZFS, worrying about ZFS kABI breakage is the last thing we should
>> be doing.
> 
> It's better to make external modules not compile than to silently 
> introduce bugs in them. So yes, I would rename that.
> 
> Mikulas

Agree, better rename current->bio_list to current->bio_lists

Regards,
Jack

^ permalink raw reply

* Re: [PATCH v5 3/7] raid5-ppl: Partial Parity Log write logging implementation
From: Artur Paszkiewicz @ 2017-03-10 15:16 UTC (permalink / raw)
  To: Shaohua Li; +Cc: shli, linux-raid
In-Reply-To: <20170309232451.4hw2pgizn7potlrj@kernel.org>

On 03/10/2017 12:24 AM, Shaohua Li wrote:
> On Thu, Mar 09, 2017 at 09:59:59AM +0100, Artur Paszkiewicz wrote:
>> Implement the calculation of partial parity for a stripe and PPL write
>> logging functionality. The description of PPL is added to the
>> documentation. More details can be found in the comments in raid5-ppl.c.
>>
>> Attach a page for holding the partial parity data to stripe_head.
>> Allocate it only if mddev has the MD_HAS_PPL flag set.
>>
>> Partial parity is the xor of not modified data chunks of a stripe and is
>> calculated as follows:
>>
>> - reconstruct-write case:
>>   xor data from all not updated disks in a stripe
>>
>> - read-modify-write case:
>>   xor old data and parity from all updated disks in a stripe
>>
>> Implement it using the async_tx API and integrate into raid_run_ops().
>> It must be called when we still have access to old data, so do it when
>> STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
>> stored into sh->ppl_page.
>>
>> Partial parity is not meaningful for full stripe write and is not stored
>> in the log or used for recovery, so don't attempt to calculate it when
>> stripe has STRIPE_FULL_WRITE.
>>
>> Put the PPL metadata structures to md_p.h because userspace tools
>> (mdadm) will also need to read/write PPL.
>>
>> Warn about using PPL with enabled disk volatile write-back cache for
>> now. It can be removed once disk cache flushing before writing PPL is
>> implemented.
>>
>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> 
> ... snip ...
> 
>> +struct dma_async_tx_descriptor *
>> +ops_run_partial_parity(struct stripe_head *sh, struct raid5_percpu *percpu,
>> +		       struct dma_async_tx_descriptor *tx)
>> +{
>> +	int disks = sh->disks;
>> +	struct page **xor_srcs = flex_array_get(percpu->scribble, 0);
>> +	int count = 0, pd_idx = sh->pd_idx, i;
>> +	struct async_submit_ctl submit;
>> +
>> +	pr_debug("%s: stripe %llu\n", __func__, (unsigned long long)sh->sector);
>> +
>> +	/*
>> +	 * Partial parity is the XOR of stripe data chunks that are not changed
>> +	 * during the write request. Depending on available data
>> +	 * (read-modify-write vs. reconstruct-write case) we calculate it
>> +	 * differently.
>> +	 */
>> +	if (sh->reconstruct_state == reconstruct_state_prexor_drain_run) {
>> +		/* rmw: xor old data and parity from updated disks */
>> +		for (i = disks; i--;) {
>> +			struct r5dev *dev = &sh->dev[i];
>> +			if (test_bit(R5_Wantdrain, &dev->flags) || i == pd_idx)
>> +				xor_srcs[count++] = dev->page;
>> +		}
>> +	} else if (sh->reconstruct_state == reconstruct_state_drain_run) {
>> +		/* rcw: xor data from all not updated disks */
>> +		for (i = disks; i--;) {
>> +			struct r5dev *dev = &sh->dev[i];
>> +			if (test_bit(R5_UPTODATE, &dev->flags))
>> +				xor_srcs[count++] = dev->page;
>> +		}
>> +	} else {
>> +		return tx;
>> +	}
>> +
>> +	init_async_submit(&submit, ASYNC_TX_XOR_ZERO_DST, tx, NULL, sh,
>> +			  flex_array_get(percpu->scribble, 0)
>> +			  + sizeof(struct page *) * (sh->disks + 2));
> 
> Since this should be done before biodrain, should this add ASYNC_TX_FENCE flag?

The result of this calculation isn't used later by other async_tx
operations, so it's not needed here, if I understand this correctly. But
maybe later we could optimize and use partial parity to calculate full
parity, then it will be necessary.

Thanks,
Artur

^ permalink raw reply

* RAID 5 3-drive array inactive - no superblock - can anything be saved?
From: Robert Schultz @ 2017-03-10 15:18 UTC (permalink / raw)
  To: linux-raid

Before I mess things up even worse.

I have a PC running BackupPC.

The system contains 4 disks:
boot & system: 1x WD 20GB IDE (9.3 years powered on)
backup data: RAID 5 array md0 containing 3 x Seagate 2TB SATA drives
     ST2000DM001-1ER164   /dev/sdb
     ST2000DM001-1CH164   /dev/sdc
     ST2000DM001-1CH164   /dev/sdd

Back around Christmas I was notified of disk errors on the oldest 2TB 
disk. I purchased a new one, failed out the old one, replaced it and 
everything seemed fine.

At some point I noticed that the system had stopped and I was getting 
XFS errors metadata I/O error: block 0x12910b88 
("xfs_trans_read_buf_map") error 117 ....
<snip>
Corruption of in-memory data detected. Shutting down filesystem.

I did a bit of searching and it appears this may have been a cable 
issue. I shutdown, reseated the SATA cables and restarted. I unmounted 
the filesystem, ran xfs_repair and everything seemed fine.

Then it repeated the issue a week or two later. Unfortunately at this 
time it failed one of the disks.

I shutdown, replaced the SATA cables and restarted. I had some issues on 
restart where it would go into recovery mode. I solved this by removing 
the md0 array from fstab and restarted.

At this point I am unable to restart, or re-assemble the array as mdadm 
can't find the superblocks.

 >cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] 
[raid4] [raid10]
md0 : inactive sdd1[4](S) sdc1[3](S) sdb1[5](S)
       5860147464 blocks super 1.2

unused devices: <none>

 >sudo mdadm --detail /dev/md0
/dev/md0:
         Version : 1.2
      Raid Level : raid0
   Total Devices : 3
     Persistence : Superblock is persistent

           State : inactive

            Name : bkp1:0  (local to host bkp1)
            UUID : 77965a25:38a24b98:9ab5899c:7795ded7
          Events : 396761

     Number   Major   Minor   RaidDevice

        -       8       17        -        /dev/sdb1
        -       8       33        -        /dev/sdc1
        -       8       49        -        /dev/sdd1

Trying to reassemble the array:

 >sudo mdadm --assemble --force /dev/md0 /dev/sdb /dev/sdc /dev/sdd
mdadm: Cannot assemble mbr metadata on /dev/sdb
mdadm: /dev/sdb has no superblock - assembly aborted

Running the command with any of the disks says the same thing - no 
superblock.





Here is the mdadm examine for the three disks:
sudo mdadm --examine /dev/sdb1
/dev/sdb1:
           Magic : a92b4efc
         Version : 1.2
     Feature Map : 0x0
      Array UUID : 77965a25:38a24b98:9ab5899c:7795ded7
            Name : bkp1:0  (local to host bkp1)
   Creation Time : Fri May 31 11:06:39 2013
      Raid Level : raid5
    Raid Devices : 3

  Avail Dev Size : 3906764976 (1862.89 GiB 2000.26 GB)
      Array Size : 3906763776 (3725.78 GiB 4000.53 GB)
   Used Dev Size : 3906763776 (1862.89 GiB 2000.26 GB)
     Data Offset : 262144 sectors
    Super Offset : 8 sectors
    Unused Space : before=262056 sectors, after=1200 sectors
           State : clean
     Device UUID : f8a5b84c:6c63d9bf:1b930a93:55f12cb5

     Update Time : Tue Mar  7 08:01:31 2017
   Bad Block Log : 512 entries available at offset 72 sectors
        Checksum : dd687715 - correct
          Events : 396761

          Layout : left-symmetric
      Chunk Size : 512K

    Device Role : Active device 0
    Array State : A.. ('A' == active, '.' == missing, 'R' == replacing)

---------------------------------------------------------------------
sudo mdadm --examine /dev/sdc1
/dev/sdc1:
           Magic : a92b4efc
         Version : 1.2
     Feature Map : 0x0
      Array UUID : 77965a25:38a24b98:9ab5899c:7795ded7
            Name : bkp1:0  (local to host bkp1)
   Creation Time : Fri May 31 11:06:39 2013
      Raid Level : raid5
    Raid Devices : 3

  Avail Dev Size : 3906764976 (1862.89 GiB 2000.26 GB)
      Array Size : 3906763776 (3725.78 GiB 4000.53 GB)
   Used Dev Size : 3906763776 (1862.89 GiB 2000.26 GB)
     Data Offset : 262144 sectors
    Super Offset : 8 sectors
    Unused Space : before=262064 sectors, after=1200 sectors
           State : clean
     Device UUID : 2d4ade03:d6b7e7ce:3744b40b:21a3d17e

     Update Time : Tue Feb 21 23:26:13 2017
        Checksum : e5d43607 - correct
          Events : 396755

          Layout : left-symmetric
      Chunk Size : 512K

    Device Role : Active device 2
    Array State : AAA ('A' == active, '.' == missing, 'R' == replacing)

----------------------------------------------------------------------
  sudo mdadm --examine /dev/sdd1
/dev/sdd1:
           Magic : a92b4efc
         Version : 1.2
     Feature Map : 0x0
      Array UUID : 77965a25:38a24b98:9ab5899c:7795ded7
            Name : bkp1:0  (local to host bkp1)
   Creation Time : Fri May 31 11:06:39 2013
      Raid Level : raid5
    Raid Devices : 3

  Avail Dev Size : 3906764976 (1862.89 GiB 2000.26 GB)
      Array Size : 3906763776 (3725.78 GiB 4000.53 GB)
   Used Dev Size : 3906763776 (1862.89 GiB 2000.26 GB)
     Data Offset : 262144 sectors
    Super Offset : 8 sectors
    Unused Space : before=262064 sectors, after=1200 sectors
           State : clean
     Device UUID : 97c3bfe9:8ed96f77:ef13ee6b:007874b3

     Update Time : Tue Feb 21 23:26:13 2017
        Checksum : 91a793b - correct
          Events : 396755

          Layout : left-symmetric
      Chunk Size : 512K

    Device Role : Active device 1
    Array State : AAA ('A' == active, '.' == missing, 'R' == replacing)
------------------------------------------------------------------------
Is the fact that the Array State on sdb only shows 'A..' significant?


fdisk -l shows:

sudo fdisk -l /dev/sd[b-d]
Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x5ebd3967

Device     Boot Start        End    Sectors  Size Id Type
/dev/sdb1        2048 3907029167 3907027120  1.8T fd Linux raid autodetect
Disk /dev/sdc: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x5ebd3967

Device     Boot Start        End    Sectors  Size Id Type
/dev/sdc1        2048 3907029167 3907027120  1.8T fd Linux raid autodetect
Disk /dev/sdd: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x00000000

Device     Boot Start        End    Sectors  Size Id Type
/dev/sdd1        2048 3907029167 3907027120  1.8T fd Linux raid autodetect

Any change of putting this thing back together?











^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox