Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload
From: Christoph Hellwig @ 2017-03-27 14:57 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, Christoph Hellwig, axboe, martin.petersen, agk,
	shli, philipp.reisner, linux-block, linux-scsi, drbd-dev,
	dm-devel, linux-raid
In-Reply-To: <20170327140307.GA13020@redhat.com>

On Mon, Mar 27, 2017 at 10:03:07AM -0400, Mike Snitzer wrote:
> By "you" I assume you're referring to Lars?

Yes.

> Lars' approach for discard,
> when drbd is layered on dm-thinp, feels over-engineered.  Not his fault,
> the way discard and zeroing got conflated certainly lends itself to
> these ugly hacks.  SO I do appreciate that for anything to leverage
> discard_zeroes_data it needs to be reliable.  Which runs counter to how
> discard was implemented (discard may get silently dropped!)  But that is
> why dm-thinp doesn't advertise dzd.  Anyway...

That's exactly what this series does - remove discard_zeroes_data and
use the new REQ_OP_WRITE_ZEROES for anything that wants zeroing offload.

> As for the blkdev_issue_zeroout() resorting to manually zeroing the
> range, if the discard fails or dzd not supported, that certainly
> requires DM thinp to implement manual zeroing of the head and tail of
> the range if partial blocks are being zeroed.  So I welcome any advances
> there.  It is probably something that is best left to Joe or myself to
> tackle.  But I'll gladly accept patches ;)

Ok, I'll happily leave this to the two of you..

^ permalink raw reply

* Re: RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload
From: Mike Snitzer @ 2017-03-27 14:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christoph Hellwig, axboe, martin.petersen, agk, shli,
	philipp.reisner, linux-block, linux-scsi, drbd-dev, dm-devel,
	linux-raid
In-Reply-To: <20170327091056.GB6879@infradead.org>

On Mon, Mar 27 2017 at  5:10am -0400,
Christoph Hellwig <hch@infradead.org> wrote:

> It sounds like you don't want to support traditional discard at all,
> but only WRITE ZEROES.  So in many ways this series is the right way
> forward.  It would be nice if we could do a full blown
> REQ_OP_WRITE_ZEROES for dm_think that zeroes out partial blocks,
> similar to what hardware that implements WRITE SAME of zeroes
> or WRITE ZEROES would do.  I'll see if I could include that in my
> series.

By "you" I assume you're referring to Lars?  Lars' approach for discard,
when drbd is layered on dm-thinp, feels over-engineered.  Not his fault,
the way discard and zeroing got conflated certainly lends itself to
these ugly hacks.  SO I do appreciate that for anything to leverage
discard_zeroes_data it needs to be reliable.  Which runs counter to how
discard was implemented (discard may get silently dropped!)  But that is
why dm-thinp doesn't advertise dzd.  Anyway...

As for the blkdev_issue_zeroout() resorting to manually zeroing the
range, if the discard fails or dzd not supported, that certainly
requires DM thinp to implement manual zeroing of the head and tail of
the range if partial blocks are being zeroed.  So I welcome any advances
there.  It is probably something that is best left to Joe or myself to
tackle.  But I'll gladly accept patches ;)

^ permalink raw reply

* Re: [PATCH v3 2/4] block: add a read barrier in blk_queue_enter()
From: Johannes Thumshirn @ 2017-03-27 12:14 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Hannes Reinecke
In-Reply-To: <20170327120658.29864-3-tom.leiming@gmail.com>

On Mon, Mar 27, 2017 at 08:06:56PM +0800, Ming Lei wrote:
> Without the barrier, reading DEAD flag of .q_usage_counter
> and reading .mq_freeze_depth may be reordered, then the
> following wait_event_interruptible() may never return.
> 
> Reviewed-by: Hannes Reinecke <hare@suse.com>
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> ---

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: Felix Imendï¿½rffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nï¿½rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply

* Re: [PATCH v3 4/4] block: block new I/O just after queue is set as dying
From: Johannes Thumshirn @ 2017-03-27 12:16 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Hannes Reinecke, Tejun Heo
In-Reply-To: <20170327120658.29864-5-tom.leiming@gmail.com>

On Mon, Mar 27, 2017 at 08:06:58PM +0800, Ming Lei wrote:
> Before commit 780db2071a(blk-mq: decouble blk-mq freezing
> from generic bypassing), the dying flag is checked before
> entering queue, and Tejun converts the checking into .mq_freeze_depth,
> and assumes the counter is increased just after dying flag
> is set. Unfortunately we doesn't do that in blk_set_queue_dying().
> 
> This patch calls blk_freeze_queue_start() in blk_set_queue_dying(),
> so that we can block new I/O coming once the queue is set as dying.
> 
> Given blk_set_queue_dying() is always called in remove path
> of block device, and queue will be cleaned up later, we don't
> need to worry about undoing the counter.
> 
> Cc: Bart Van Assche <bart.vanassche@sandisk.com>
> Cc: Tejun Heo <tj@kernel.org>
> Reviewed-by: Hannes Reinecke <hare@suse.com>
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> ---

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: Felix Imendï¿½rffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nï¿½rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply

* Re: [PATCH v3 3/4] block: rename blk_mq_freeze_queue_start()
From: Johannes Thumshirn @ 2017-03-27 12:15 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Hannes Reinecke
In-Reply-To: <20170327120658.29864-4-tom.leiming@gmail.com>

On Mon, Mar 27, 2017 at 08:06:57PM +0800, Ming Lei wrote:
> As the .q_usage_counter is used by both legacy and
> mq path, we need to block new I/O if queue becomes
> dead in blk_queue_enter().
> 
> So rename it and we can use this function in both
> paths.
> 
> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
> Reviewed-by: Hannes Reinecke <hare@suse.com>
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> ---

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: Felix Imendï¿½rffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nï¿½rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply

* Re: [PATCH v3 1/4] blk-mq: comment on races related with timeout handler
From: Johannes Thumshirn @ 2017-03-27 12:11 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Hannes Reinecke
In-Reply-To: <20170327120658.29864-2-tom.leiming@gmail.com>

On Mon, Mar 27, 2017 at 08:06:55PM +0800, Ming Lei wrote:
> This patch adds comment on two races related with
> timeout handler:
> 
> - requeue from queue busy vs. timeout
> - rq free & reallocation vs. timeout
> 
> Both the races themselves and current solution aren't
> explicit enough, so add comments on them.
> 
> Cc: Bart Van Assche <bart.vanassche@sandisk.com>
> Reviewed-by: Hannes Reinecke <hare@suse.com>
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> ---

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: Felix Imendï¿½rffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nï¿½rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply

* [PATCH v3 1/4] blk-mq: comment on races related with timeout handler
From: Ming Lei @ 2017-03-27 12:06 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Hannes Reinecke, Ming Lei
In-Reply-To: <20170327120658.29864-1-tom.leiming@gmail.com>

This patch adds comment on two races related with
timeout handler:

- requeue from queue busy vs. timeout
- rq free & reallocation vs. timeout

Both the races themselves and current solution aren't
explicit enough, so add comments on them.

Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 block/blk-mq.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index c212b9644a9f..b36f0481ba0e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -523,6 +523,15 @@ void blk_mq_start_request(struct request *rq)
 }
 EXPORT_SYMBOL(blk_mq_start_request);
 
+/*
+ * When we reach here because queue is busy, REQ_ATOM_COMPLETE
+ * flag isn't set yet, so there may be race with timeout hanlder,
+ * but given rq->deadline is just set in .queue_rq() under
+ * this situation, the race won't be possible in reality because
+ * rq->timeout should be set as big enough to cover the window
+ * between blk_mq_start_request() called from .queue_rq() and
+ * clearing REQ_ATOM_STARTED here.
+ */
 static void __blk_mq_requeue_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
@@ -696,6 +705,19 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
 		return;
 
+	/*
+	 * The rq being checked may have been freed and reallocated
+	 * out already here, we avoid this race by checking rq->deadline
+	 * and REQ_ATOM_COMPLETE flag together:
+	 *
+	 * - if rq->deadline is observed as new value because of
+	 *   reusing, the rq won't be timed out because of timing.
+	 * - if rq->deadline is observed as previous value,
+	 *   REQ_ATOM_COMPLETE flag won't be cleared in reuse path
+	 *   because we put a barrier between setting rq->deadline
+	 *   and clearing the flag in blk_mq_start_request(), so
+	 *   this rq won't be timed out too.
+	 */
 	if (time_after_eq(jiffies, rq->deadline)) {
 		if (!blk_mark_rq_complete(rq))
 			blk_mq_rq_timed_out(rq, reserved);
-- 
2.9.3

^ permalink raw reply related

* [PATCH v3 4/4] block: block new I/O just after queue is set as dying
From: Ming Lei @ 2017-03-27 12:06 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Hannes Reinecke, Ming Lei, Tejun Heo
In-Reply-To: <20170327120658.29864-1-tom.leiming@gmail.com>

Before commit 780db2071a(blk-mq: decouble blk-mq freezing
from generic bypassing), the dying flag is checked before
entering queue, and Tejun converts the checking into .mq_freeze_depth,
and assumes the counter is increased just after dying flag
is set. Unfortunately we doesn't do that in blk_set_queue_dying().

This patch calls blk_freeze_queue_start() in blk_set_queue_dying(),
so that we can block new I/O coming once the queue is set as dying.

Given blk_set_queue_dying() is always called in remove path
of block device, and queue will be cleaned up later, we don't
need to worry about undoing the counter.

Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Tejun Heo <tj@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 block/blk-core.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 60f364e1d36b..e22c4ea002ec 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -500,6 +500,13 @@ void blk_set_queue_dying(struct request_queue *q)
 	queue_flag_set(QUEUE_FLAG_DYING, q);
 	spin_unlock_irq(q->queue_lock);
 
+	/*
+	 * When queue DYING flag is set, we need to block new req
+	 * entering queue, so we call blk_freeze_queue_start() to
+	 * prevent I/O from crossing blk_queue_enter().
+	 */
+	blk_freeze_queue_start(q);
+
 	if (q->mq_ops)
 		blk_mq_wake_waiters(q);
 	else {
@@ -672,9 +679,9 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
 		/*
 		 * read pair of barrier in blk_freeze_queue_start(),
 		 * we need to order reading __PERCPU_REF_DEAD flag of
-		 * .q_usage_counter and reading .mq_freeze_depth,
-		 * otherwise the following wait may never return if the
-		 * two reads are reordered.
+		 * .q_usage_counter and reading .mq_freeze_depth or
+		 * queue dying flag, otherwise the following wait may
+		 * never return if the two reads are reordered.
 		 */
 		smp_rmb();
 
-- 
2.9.3

^ permalink raw reply related

* [PATCH v3 2/4] block: add a read barrier in blk_queue_enter()
From: Ming Lei @ 2017-03-27 12:06 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Hannes Reinecke, Ming Lei
In-Reply-To: <20170327120658.29864-1-tom.leiming@gmail.com>

Without the barrier, reading DEAD flag of .q_usage_counter
and reading .mq_freeze_depth may be reordered, then the
following wait_event_interruptible() may never return.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 block/blk-core.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index ad388d5e309a..5e8963bc98d9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -669,6 +669,15 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
 		if (nowait)
 			return -EBUSY;
 
+		/*
+		 * read pair of barrier in blk_mq_freeze_queue_start(),
+		 * we need to order reading __PERCPU_REF_DEAD flag of
+		 * .q_usage_counter and reading .mq_freeze_depth,
+		 * otherwise the following wait may never return if the
+		 * two reads are reordered.
+		 */
+		smp_rmb();
+
 		ret = wait_event_interruptible(q->mq_freeze_wq,
 				!atomic_read(&q->mq_freeze_depth) ||
 				blk_queue_dying(q));
-- 
2.9.3

^ permalink raw reply related

* [PATCH v3 0/4] block: misc changes
From: Ming Lei @ 2017-03-27 12:06 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Hannes Reinecke, Ming Lei

Hi,

The 1st  patch add comments on blk-mq races with timeout handler.

The other 3 patches improves handling for dying queue:
	- the 2nd one adds one barrier in blk_queue_enter() for
	avoiding hanging caused by out-of-order
	- the 3rd and 4th patches block new I/O entering queue
	after queue is set as dying

V3:
	- tweak comments as suggested by Bart Van Assche

V2:
	- add one missing barrier in blk_queue_enter()

V1:
	- add comments on races related with timeout handler
	- add Tested-by & Reviewed-by tag

thanks,
Ming

Ming Lei (4):
  blk-mq: comment on races related with timeout handler
  block: add a read barrier in blk_queue_enter()
  block: rename blk_mq_freeze_queue_start()
  block: block new I/O just after queue is set as dying

 block/blk-core.c                  | 16 ++++++++++++++++
 block/blk-mq.c                    | 32 +++++++++++++++++++++++++++-----
 drivers/block/mtip32xx/mtip32xx.c |  2 +-
 drivers/nvme/host/core.c          |  2 +-
 include/linux/blk-mq.h            |  2 +-
 5 files changed, 46 insertions(+), 8 deletions(-)

-- 
2.9.3

^ permalink raw reply

* [PATCH v3 3/4] block: rename blk_mq_freeze_queue_start()
From: Ming Lei @ 2017-03-27 12:06 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Hannes Reinecke, Ming Lei
In-Reply-To: <20170327120658.29864-1-tom.leiming@gmail.com>

As the .q_usage_counter is used by both legacy and
mq path, we need to block new I/O if queue becomes
dead in blk_queue_enter().

So rename it and we can use this function in both
paths.

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 block/blk-core.c                  |  2 +-
 block/blk-mq.c                    | 10 +++++-----
 drivers/block/mtip32xx/mtip32xx.c |  2 +-
 drivers/nvme/host/core.c          |  2 +-
 include/linux/blk-mq.h            |  2 +-
 5 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 5e8963bc98d9..60f364e1d36b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -670,7 +670,7 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
 			return -EBUSY;
 
 		/*
-		 * read pair of barrier in blk_mq_freeze_queue_start(),
+		 * read pair of barrier in blk_freeze_queue_start(),
 		 * we need to order reading __PERCPU_REF_DEAD flag of
 		 * .q_usage_counter and reading .mq_freeze_depth,
 		 * otherwise the following wait may never return if the
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b36f0481ba0e..5370b4f750ff 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -68,7 +68,7 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
 	sbitmap_clear_bit(&hctx->ctx_map, ctx->index_hw);
 }
 
-void blk_mq_freeze_queue_start(struct request_queue *q)
+void blk_freeze_queue_start(struct request_queue *q)
 {
 	int freeze_depth;
 
@@ -78,7 +78,7 @@ void blk_mq_freeze_queue_start(struct request_queue *q)
 		blk_mq_run_hw_queues(q, false);
 	}
 }
-EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_start);
+EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
 
 void blk_mq_freeze_queue_wait(struct request_queue *q)
 {
@@ -108,7 +108,7 @@ void blk_freeze_queue(struct request_queue *q)
 	 * no blk_unfreeze_queue(), and blk_freeze_queue() is not
 	 * exported to drivers as the only user for unfreeze is blk_mq.
 	 */
-	blk_mq_freeze_queue_start(q);
+	blk_freeze_queue_start(q);
 	blk_mq_freeze_queue_wait(q);
 }
 
@@ -746,7 +746,7 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	 * percpu_ref_tryget directly, because we need to be able to
 	 * obtain a reference even in the short window between the queue
 	 * starting to freeze, by dropping the first reference in
-	 * blk_mq_freeze_queue_start, and the moment the last request is
+	 * blk_freeze_queue_start, and the moment the last request is
 	 * consumed, marked by the instant q_usage_counter reaches
 	 * zero.
 	 */
@@ -2376,7 +2376,7 @@ static void blk_mq_queue_reinit_work(void)
 	 * take place in parallel.
 	 */
 	list_for_each_entry(q, &all_q_list, all_q_node)
-		blk_mq_freeze_queue_start(q);
+		blk_freeze_queue_start(q);
 	list_for_each_entry(q, &all_q_list, all_q_node)
 		blk_mq_freeze_queue_wait(q);
 
diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index f96ab717534c..c96c35ab39df 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -4162,7 +4162,7 @@ static int mtip_block_remove(struct driver_data *dd)
 		dev_info(&dd->pdev->dev, "device %s surprise removal\n",
 						dd->disk->disk_name);
 
-	blk_mq_freeze_queue_start(dd->queue);
+	blk_freeze_queue_start(dd->queue);
 	blk_mq_stop_hw_queues(dd->queue);
 	blk_mq_tagset_busy_iter(&dd->tags, mtip_no_dev_cleanup, dd);
 
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9b3b57fef446..4a6d7f408769 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2386,7 +2386,7 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl)
 
 	mutex_lock(&ctrl->namespaces_mutex);
 	list_for_each_entry(ns, &ctrl->namespaces, list)
-		blk_mq_freeze_queue_start(ns->queue);
+		blk_freeze_queue_start(ns->queue);
 	mutex_unlock(&ctrl->namespaces_mutex);
 }
 EXPORT_SYMBOL_GPL(nvme_start_freeze);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 5b3e201c8d4f..ea2e9dcd3aef 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -243,7 +243,7 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 		busy_tag_iter_fn *fn, void *priv);
 void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_unfreeze_queue(struct request_queue *q);
-void blk_mq_freeze_queue_start(struct request_queue *q);
+void blk_freeze_queue_start(struct request_queue *q);
 void blk_mq_freeze_queue_wait(struct request_queue *q);
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 				     unsigned long timeout);
-- 
2.9.3

^ permalink raw reply related

* Re: [PATCH v2 2/4] block: add a read barrier in blk_queue_enter()
From: Ming Lei @ 2017-03-27 11:31 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch@infradead.org, linux-block@vger.kernel.org, hare@suse.de,
	axboe@fb.com
In-Reply-To: <1490381059.3312.15.camel@sandisk.com>

On Sat, Mar 25, 2017 at 2:45 AM, Bart Van Assche
<Bart.VanAssche@sandisk.com> wrote:
> On Sat, 2017-03-25 at 01:38 +0800, Ming Lei wrote:
>> As I explained, the dying flag should only be mentioned after we change
>> the code in blk_set_queue_dying().
>
> Hello Ming,
>
> If patches 2 and 4 would be combined into a single patch then it wouldn't
> be necessary anymore to update the comment introduced in patch 2 in patch 4.
> I think that would make this patch series easier to review.

I think it is better to not do that, because we know patch 2 and patch
4 are doing
two different things, and patch 2 can be thought as one fix, but patch 4 should
be a improvement.  And it is normal to update comment after code changes.

>
> Since the issues fixed by your patches are longstanding issues, have you
> considered to add a "Cc: stable" tag?

I don't know because there isn't real report, and maybe never, so I leave it
alone.


Thanks,
Ming Lei

^ permalink raw reply

* Re: [RFC PATCH 00/28] INFINIBAND NETWORK BLOCK DEVICE (IBNBD)
From: Jinpu Wang @ 2017-03-27 10:21 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-block, linux-rdma@vger.kernel.org, Doug Ledford, Jens Axboe,
	hch, Fabian Holler, Milind Dumbare, Michael Wang, Roman Penyaev,
	Danil Kipnis
In-Reply-To: <cfde9eab-a52a-8888-7d32-4f6fc9f9640a@grimberg.me>

Hi Sagi,

On Mon, Mar 27, 2017 at 4:20 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
>
>> This series introduces IBNBD/IBTRS kernel modules.
>>
>> IBNBD (InfiniBand network block device) allows for an RDMA transfer of
>> block IO
>> over InfiniBand network. The driver presents itself as a block device on
>> client
>> side and transmits the block requests in a zero-copy fashion to the
>> server-side
>> via InfiniBand. The server part of the driver converts the incoming
>> buffers back
>> into BIOs and hands them down to the underlying block device. As soon as
>> IO
>> responses come back from the drive, they are being transmitted back to t=
he
>> client.
>
>
> Hi Jack, Danil and Roman,
>
> I met Danil and Roman last week at Vault, and I think you guys
> are awesome, thanks a lot for open-sourcing your work! However,
> I have a couple of issues here, some are related to the code and
> some are fundamental actually.

Thanks for comments and suggestions, reply in inline.

>
> - Is there room for this ibnbd? If we were to take every block driver
>   that was submitted without sufficient justification, it'd be very
>   hard to maintain. What advantage (if any) does this buy anyone over
>   existing rdma based protocols (srp, iser, nvmf)? I'm really (*really*)
>   not sold on this one...
>
> - To me, the fundamental design that the client side owns a pool of
>   buffers that it issues writes too, seems inferior than the
>   one taken in iser/nvmf (immediate data). IMO, the ibnbd design has
>   scalability issues both in terms of server side resources, client
>   side contention and network congestion (on infiniband the latter is
>   less severe).
>
> - I suggest that for your next post, you provide a real-life use-case
>   where each of the existing drivers can't suffice, and by can't
>   suffice I mean that it has a fundamental issue with it, not something
>   that requires a fix. With that our feedback can be much more concrete
>   and (at least on my behalf) more open to accept it.
>
> - I'm not exactly sure why you would suggest that your implementation
>   supports only infiniband if you use rdma_cm for address resolution,
>   nor I understand why you emphasize feature (2) below, nor why even
>   in the presence of rdma_cm you have ibtrs_ib_path? (confused...)
>   iWARP needs a bit more attention if you don't use the new generic
>   interfaces though...
You reminds me, we also tested in rxe drivers in the past, but not iWARP.
Might work.
ibtrs_ib_path was leftover for APM feature, we used in house, will
remove next round.

>
> - I honestly do not understand why you need *19057* LOC to implement
>   a rdma based block driver. Thats almost larger than all of our
>   existing block drivers combined... First glance at the code provides
>   some explanations, (1) you have some strange code that has no business
>   in a block driver like ibtrs_malloc/ibtrs_zalloc (yikes) or
>   open-coding various existing logging routines, (2) you are for some
>   reason adding a second tag allocation scheme (why?), (3) you are open
>   coding a lot of stuff that we added to the stack in the past months...
>   (4) you seem to over-layer your code for reasons that I do not
>   really understand. And I didn't really look deep at all into the
>   code, just to get the feel of it, and it seems like it needs a lot
>   of work before it can even be considered upstream ready.
Agree, we will clean up the code further, that's why I sent it RFC to
get a early feedback.


>
>> We design and implement this solution based on our need for Cloud
>> Computing,
>> the key features are:
>> - High throughput and low latency due to:
>> 1) Only two rdma messages per IO
>
>
> Where exactly did you witnessed latency that was meaningful by having
> another rdma message on the wire? That's only for writes, anyway, and
> we have first data bursts for that..
Clearly, we need to benchmark on latest kernel.

>
>> 2) Simplified client side server memory management
>> 3) Eliminated SCSI sublayer
>
>
> That's hardly an advantage given all we are losing without it...
>
> ...
>
> Cheers,
> Sagi.

Thanks,
--=20
Jack Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Gesch=C3=A4ftsf=C3=BChrer: Achim Weiss

^ permalink raw reply

* Re: [PATCH v3 02/14] md: move two macros into md.h
From: NeilBrown @ 2017-03-27  9:52 UTC (permalink / raw)
  To: Christoph Hellwig, Shaohua Li
  Cc: Ming Lei, Jens Axboe, linux-raid, linux-block, Christoph Hellwig
In-Reply-To: <20170327091553.GF6879@infradead.org>

[-- Attachment #1: Type: text/plain, Size: 900 bytes --]

On Mon, Mar 27 2017, Christoph Hellwig wrote:

> On Fri, Mar 24, 2017 at 09:53:25AM -0700, Shaohua Li wrote:
>> 
>> I had the same concern when I looked at this patch firstly. The number for
>> raid1/10 doesn't need to be the same. But if we don't move the number to a
>> generic header, the third patch will become a little more complicated. I
>> eventually ignored this issue. If we really need different number for raid1/10,
>> lets do it at that time.
>
> Which brings up my usual queastion:  Is is really that benefitical for
> us to keep the raid1.c code around instead of making it a special short
> cut case in raid10.c?

Patches welcome.

They would need to handle write-mostly and write-behind.  They would
also need to avoid the assumption of a chunk size for RAID1.
Undoubtedly do-able.  Hard to say how beneficial it would be, or how much
it would cost.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH v3] block: trace completion of all bios.
From: NeilBrown @ 2017-03-27  9:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, Christoph Hellwig, Jens Axboe, linux-block, linux-raid,
	dm-devel, Alasdair Kergon, Mike Snitzer, Shaohua Li, linux-kernel,
	Martin K . Petersen
In-Reply-To: <20170327090308.GA11757@infradead.org>

[-- Attachment #1: Type: text/plain, Size: 641 bytes --]

On Mon, Mar 27 2017, Christoph Hellwig wrote:

> I don't really like the flag at all.  I'd much prefer a __bio_endio
> with a 'bool trace' flag.  Also please remove the manual tracing in
> dm.ċ.  Once that is done I suspect we can also remove the
> block_bio_complete export.

Can you say why you don't like it?

I find that it neatly handles all the corner cases that I found, and
keeps the complexity local.

Were we to use a flag to __bio_endio(), we would need one to
__generic_make_request() too because we really don't want 'QUEUE' tracing
when when blk_queue_split() (and similar code) calls it.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH v3 02/14] md: move two macros into md.h
From: Christoph Hellwig @ 2017-03-27  9:15 UTC (permalink / raw)
  To: Shaohua Li
  Cc: NeilBrown, Ming Lei, Jens Axboe, linux-raid, linux-block,
	Christoph Hellwig
In-Reply-To: <20170324165325.nek4kb4yezz2xmow@kernel.org>

On Fri, Mar 24, 2017 at 09:53:25AM -0700, Shaohua Li wrote:
> 
> I had the same concern when I looked at this patch firstly. The number for
> raid1/10 doesn't need to be the same. But if we don't move the number to a
> generic header, the third patch will become a little more complicated. I
> eventually ignored this issue. If we really need different number for raid1/10,
> lets do it at that time.

Which brings up my usual queastion:  Is is really that benefitical for
us to keep the raid1.c code around instead of making it a special short
cut case in raid10.c?

^ permalink raw reply

* Re: [PATCH v3 01/14] md: raid1/raid10: don't handle failure of bio_add_page()
From: Christoph Hellwig @ 2017-03-27  9:14 UTC (permalink / raw)
  To: Ming Lei; +Cc: Shaohua Li, Jens Axboe, linux-raid, linux-block,
	Christoph Hellwig
In-Reply-To: <20170316161235.27110-2-tom.leiming@gmail.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH] blk-mq: Add NULL pointer check for HW dispatch queue
From: Christoph Hellwig @ 2017-03-27  9:14 UTC (permalink / raw)
  To: Jitendra Bhivare; +Cc: linux-block, Somnath Kotur
In-Reply-To: <1490002801-16280-1-git-send-email-jitendra.bhivare@broadcom.com>

On Mon, Mar 20, 2017 at 03:10:01PM +0530, Jitendra Bhivare wrote:
> As part of blk_mq_realloc_hw_ctx(), if the init_hctx() ops is
> failed by the underyling transport, the hctx pointer is freed and
> initialized to NULL.
> However, functions down the line, access this hwctx pointer without
> a NULL pointer check, which could lead to a kernel crash.

Shouldn't we fail initializing the queue if any of the hctx allocations
fail?

^ permalink raw reply

* Re: [PATCH 06/23] dm-kcopyd: switch to use REQ_OP_WRITE_ZEROES
From: Christoph Hellwig @ 2017-03-27  9:12 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, axboe, martin.petersen, agk, shli,
	philipp.reisner, lars.ellenberg, linux-block, linux-scsi,
	drbd-dev, dm-devel, linux-raid
In-Reply-To: <20170323151037.GC17127@redhat.com>

On Thu, Mar 23, 2017 at 11:10:38AM -0400, Mike Snitzer wrote:
> Not sure why you've split out the dm-kcopyd patch, likely best to just
> fold it into the previous dm support patch.

The dm-kcopyd patch switches to using WRITE_ZEROES instead of write
same for dm as user of these interfaces.  The one before just adds
WRITE_ZEROES support to dm.

^ permalink raw reply

* Re: RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload
From: Christoph Hellwig @ 2017-03-27  9:10 UTC (permalink / raw)
  To: Christoph Hellwig, axboe, martin.petersen, agk, snitzer, shli,
	philipp.reisner, linux-block, linux-scsi, drbd-dev, dm-devel,
	linux-raid
In-Reply-To: <20170323155410.GD1138@soda.linbit>

It sounds like you don't want to support traditional discard at all,
but only WRITE ZEROES.  So in many ways this series is the right way
forward.  It would be nice if we could do a full blown
REQ_OP_WRITE_ZEROES for dm_think that zeroes out partial blocks,
similar to what hardware that implements WRITE SAME of zeroes
or WRITE ZEROES would do.  I'll see if I could include that in my
series.

^ permalink raw reply

* Re: [PATCH] fs: remove _submit_bh()
From: Christoph Hellwig @ 2017-03-27  9:06 UTC (permalink / raw)
  To: Eric Biggers
  Cc: linux-fsdevel, linux-block, Alexander Viro, Eric Biggers,
	Darrick J . Wong
In-Reply-To: <20170326040218.28519-1-ebiggers3@gmail.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH v3] block: trace completion of all bios.
From: Christoph Hellwig @ 2017-03-27  9:03 UTC (permalink / raw)
  To: NeilBrown
  Cc: Ming Lei, Christoph Hellwig, Jens Axboe, linux-block, linux-raid,
	dm-devel, Alasdair Kergon, Mike Snitzer, Shaohua Li, linux-kernel,
	Martin K . Petersen
In-Reply-To: <87fui3a65o.fsf@notabene.neil.brown.name>

I don't really like the flag at all.  I'd much prefer a __bio_endio
with a 'bool trace' flag.  Also please remove the manual tracing in
dm.ċ.  Once that is done I suspect we can also remove the
block_bio_complete export.

^ permalink raw reply

* Re: [RFC PATCH 00/28] INFINIBAND NETWORK BLOCK DEVICE (IBNBD)
From: Sagi Grimberg @ 2017-03-27  2:20 UTC (permalink / raw)
  To: Jack Wang, linux-block, linux-rdma
  Cc: dledford, axboe, hch, mail, Milind.dumbare, yun.wang
In-Reply-To: <1490352343-20075-1-git-send-email-jinpu.wangl@profitbricks.com>


> This series introduces IBNBD/IBTRS kernel modules.
>
> IBNBD (InfiniBand network block device) allows for an RDMA transfer of block IO
> over InfiniBand network. The driver presents itself as a block device on client
> side and transmits the block requests in a zero-copy fashion to the server-side
> via InfiniBand. The server part of the driver converts the incoming buffers back
> into BIOs and hands them down to the underlying block device. As soon as IO
> responses come back from the drive, they are being transmitted back to the
> client.

Hi Jack, Danil and Roman,

I met Danil and Roman last week at Vault, and I think you guys
are awesome, thanks a lot for open-sourcing your work! However,
I have a couple of issues here, some are related to the code and
some are fundamental actually.

- Is there room for this ibnbd? If we were to take every block driver
   that was submitted without sufficient justification, it'd be very
   hard to maintain. What advantage (if any) does this buy anyone over
   existing rdma based protocols (srp, iser, nvmf)? I'm really (*really*)
   not sold on this one...

- To me, the fundamental design that the client side owns a pool of
   buffers that it issues writes too, seems inferior than the
   one taken in iser/nvmf (immediate data). IMO, the ibnbd design has
   scalability issues both in terms of server side resources, client
   side contention and network congestion (on infiniband the latter is
   less severe).

- I suggest that for your next post, you provide a real-life use-case
   where each of the existing drivers can't suffice, and by can't
   suffice I mean that it has a fundamental issue with it, not something
   that requires a fix. With that our feedback can be much more concrete
   and (at least on my behalf) more open to accept it.

- I'm not exactly sure why you would suggest that your implementation
   supports only infiniband if you use rdma_cm for address resolution,
   nor I understand why you emphasize feature (2) below, nor why even
   in the presence of rdma_cm you have ibtrs_ib_path? (confused...)
   iWARP needs a bit more attention if you don't use the new generic
   interfaces though...

- I honestly do not understand why you need *19057* LOC to implement
   a rdma based block driver. Thats almost larger than all of our
   existing block drivers combined... First glance at the code provides
   some explanations, (1) you have some strange code that has no business
   in a block driver like ibtrs_malloc/ibtrs_zalloc (yikes) or
   open-coding various existing logging routines, (2) you are for some
   reason adding a second tag allocation scheme (why?), (3) you are open
   coding a lot of stuff that we added to the stack in the past months...
   (4) you seem to over-layer your code for reasons that I do not
   really understand. And I didn't really look deep at all into the
   code, just to get the feel of it, and it seems like it needs a lot
   of work before it can even be considered upstream ready.

> We design and implement this solution based on our need for Cloud Computing,
> the key features are:
> - High throughput and low latency due to:
> 1) Only two rdma messages per IO

Where exactly did you witnessed latency that was meaningful by having
another rdma message on the wire? That's only for writes, anyway, and
we have first data bursts for that..

> 2) Simplified client side server memory management
> 3) Eliminated SCSI sublayer

That's hardly an advantage given all we are losing without it...

...

Cheers,
Sagi.

^ permalink raw reply

* Re: [BUG] Deadlock due due to interactions of block, RCU, and cpu offline
From: Paul E. McKenney @ 2017-03-26 23:28 UTC (permalink / raw)
  To: Jeffrey Hugo
  Cc: linux-kernel, linux-block, pprakash, Josh Triplett,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Jens Axboe,
	Sebastian Andrzej Siewior, Thomas Gleixner, Richard Cochran,
	Boris Ostrovsky, Richard Weinberger
In-Reply-To: <db9c91f6-1b17-6136-84f0-03c3c2581ab4@codeaurora.org>

On Sun, Mar 26, 2017 at 05:10:40PM -0600, Jeffrey Hugo wrote:
> Hello,
> 
> I observe that running stress-ng with the cpu-online and fstat tests
> results in a deadlock of hung tasks:
> 
> [  366.810486] INFO: task stress-ng-cpu-o:2590 blocked for more than
> 120 seconds.
> [  366.817689]       Not tainted 4.9.0 #39
> [  366.821504] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  366.829320] stress-ng-cpu-o D    0  2590   2589 0x00000008
> [  366.834803] Call trace:
> [  366.837222] [<ffff000008085dd0>] __switch_to+0x60/0x70
> [  366.842338] [<ffff000008a23c18>] __schedule+0x178/0x648
> [  366.847550] [<ffff000008a24120>] schedule+0x38/0x98
> [  366.852408] [<ffff00000848b774>] blk_mq_freeze_queue_wait+0x64/0x1a8
> [  366.858749] [<ffff00000848e9d4>] blk_mq_queue_reinit_work+0x74/0x110
> [  366.865081] [<ffff00000848ea94>] blk_mq_queue_reinit_dead+0x24/0x30
> [  366.871335] [<ffff0000080c9898>] cpuhp_invoke_callback+0x98/0x4a8
> [  366.877411] [<ffff0000080cb084>] cpuhp_down_callbacks+0x114/0x150
> [  366.883484] [<ffff000008a22578>] _cpu_down+0x100/0x1d8
> [  366.888609] [<ffff0000080cbfdc>] do_cpu_down+0x4c/0x78
> [  366.893727] [<ffff0000080cc02c>] cpu_down+0x24/0x30
> [  366.898593] [<ffff0000086aaf28>] cpu_subsys_offline+0x20/0x30
> [  366.904318] [<ffff0000086a53d8>] device_offline+0xa8/0xd8
> [  366.909704] [<ffff0000086a550c>] online_store+0x4c/0xa8
> [  366.914907] [<ffff0000086a241c>] dev_attr_store+0x44/0x60
> [  366.920294] [<ffff0000082b6a24>] sysfs_kf_write+0x5c/0x78
> [  366.925672] [<ffff0000082b5cec>] kernfs_fop_write+0xbc/0x1e8
> [  366.931318] [<ffff000008238320>] __vfs_write+0x48/0x138
> [  366.936526] [<ffff000008239078>] vfs_write+0xa8/0x1c0
> [  366.941557] [<ffff00000823a08c>] SyS_write+0x54/0xb0
> [  366.946511] [<ffff000008083370>] el0_svc_naked+0x24/0x28
> [  366.951800] INFO: task stress-ng-fstat:2591 blocked for more than
> 120 seconds.
> [  366.959008]       Not tainted 4.9.0 #39
> [  366.962823] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  366.970640] stress-ng-fstat D    0  2591   2589 0x00000000
> [  366.976105] Call trace:
> [  366.978540] [<ffff000008085dd0>] __switch_to+0x60/0x70
> [  366.983658] [<ffff000008a23c18>] __schedule+0x178/0x648
> [  366.988870] [<ffff000008a24120>] schedule+0x38/0x98
> [  366.993727] [<ffff00000848b774>] blk_mq_freeze_queue_wait+0x64/0x1a8
> [  367.000068] [<ffff00000848e2d0>] blk_mq_freeze_queue+0x28/0x38
> [  367.005880] [<ffff0000086d480c>] lo_release+0x64/0x90
> [  367.010919] [<ffff000008278bd4>] __blkdev_put+0x26c/0x2c8
> [  367.016300] [<ffff000008278fec>] blkdev_put+0x54/0x128
> [  367.021418] [<ffff0000082790ec>] blkdev_close+0x2c/0x40
> [  367.026631] [<ffff00000823ab58>] __fput+0xa0/0x1e0
> [  367.031401] [<ffff00000823ad10>] ____fput+0x20/0x30
> [  367.036266] [<ffff0000080e7a40>] task_work_run+0xc8/0xe8
> [  367.041557] [<ffff0000080882b4>] do_notify_resume+0xac/0xb8
> [  367.047116] [<ffff000008083294>] work_pending+0x8/0x10
> 
> I have tested and found this issue to be reproducible on both x86
> and ARM64 architectures on 4.7, 4.8, 4.9, 4.10, and 4.11-rc3
> kernels.
> 
> Using the below test methodology [1], the issue reproduces within a
> few minutes.
> 
> Using ftrace, I have analyzed the issue on 4.9 and I believe I've
> found the root cause [2].
> 
> Based on my analysis, I have developed a fix [3], which addresses
> the issue as I am able to run stress-ng for over an hour where I was
> unable to do so before, however I do not know the full extend of
> impacts from this fix, and look for guidance from the community to
> determine the final fix.
> 
> 
> [1] Test methodology
> --------------------
> Boot a multicore system such as a desktop i5 system with nr_cpus=2
> 
> Enable all logging to determine when the deadlock occurs (prints
> from test stop flowing out of the serial port)
>      echo 1 > /sys/module/printk/parameters/ignore_loglevel
> 
> Run stress-ng
>      stress-ng --fstat 1 --cpu-online 1 -t 3600
> 
> Wait for the test output to stop, and the hung task watchdog to fire.
> 
> 
> [2] Analysis
> ------------
> Again, this analysis is based upon the 4.9 kernel, but believe it to
> still apply to newer kernels.
> 
> I conclude that the hung tasks occur due to a race condition which
> results in a deadlock.
> 
> The race condition occurs between "normal" work in the block layer
> on a core (the stress-ng-fstat task in the above dump) and cpu
> offline of that core (the stress-ng-cpu-o task in the above dump).
> 
> The fput() from userspace in the fstat task results in a call to
> blk_mq_freeze_queue(), which drops the last reference to the queue
> via percpu_ref_kill(), and then waits for the ref count of the queue
> to hit 0 in blk_mq_freeze_queue_wait().  percpu_ref_kill() will
> result in __percpu_ref_switch_to_atomic() which will use
> call_rcu_sched() to setup delayed work to finalize the percpu_ref
> cleanup and drop the ref count to 0.
> 
> Note that call_rcu_sched() queues the work to a per-cpu queue, thus
> the work can only be run on the core it is queued on, by the work
> thread that is pinned to that core.
> 
> It is a race between this work running, and the cpu offline processing.

One quick way to test this assumption is to build a kernel with Kconfig
options CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y.  This will
cause call_rcu_sched() to queue the work to a kthread, which can migrate
to some other CPU.  If your analysis is correct, this should avoid
the deadlock.  (Note that the deadlock should be fixed in any case,
just a diagnostic assumption-check procedure.)

> If the cpu offline processing is able to get to and process the
> RCU/tree:online state before the queued work from the block layer,
> then the pinned work thread will be migrated to another core via
> rcutree_offline_cpu(), and the work will not be able to execute.
> 
> This race condition does not result in deadlock until later in the
> cpu offline processing.  Once we hit the block/mq:prepare state the
> block layer freezes all the queues and waits for the ref counts to
> hit 0. This normally works because at this point the cpu being
> offlined is dead from cpu:teardown, and the offline processing is
> occuring on another active cpu, so call_rcu_sched() will queue work
> to an active cpu where it can get processed.  However the fstat
> process already did that work for one of the queues to be frozen in
> the block layer, so the processing of the block/mq:prepare state
> waits on the same ref count as fstat to hit 0.  Thus we see the
> result of this as the stress-ng-cpu-o task above.
> 
> The block/mq:prepare processing stalls the cpu offline processing
> which causes a deadlock because the processing does not get to the
> RCU/tree:prepare state which migrates all of the queued work from
> the offline cpu to another cpu, which would allow the work that the
> fstat task queued to execute, drop the ref count to 0, and unblock
> both stalled tasks.
> 
> By reordering the cpu offline states such the shutdown processing of
> RCU/tree:prepare occurs before block/mq:prepare [3], we prevent
> deadlock by enabling the queued work in the RCU framework to run
> elsewhere, and eventually unblock the tasks waiting on the ref
> count.
> 
> However, it is not entirely clear what are the full ramifications of
> this reorder.  I assume the ordering of these cpu online/offline
> states is carefully considered, and without that knowledge, I could
> not say for certain that my fix [3] is safe.
> 
> What is the opinion of the domain experts?

I do hope that we can come up with a better fix.  No offense intended,
as coming up with -any- fix in the CPU-hotplug domain is not to be
denigrated, but this looks to be at vest quite fragile.

							Thanx, Paul

> -- 
> Jeffrey Hugo
> Qualcomm Datacenter Technologies as an affiliate of Qualcomm
> Technologies, Inc.
> Qualcomm Technologies, Inc. is a member of the
> Code Aurora Forum, a Linux Foundation Collaborative Project.
> 
> 
> [3] Proposed fix
> ---8>---
> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> index afe641c..9b86db9 100644
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -49,6 +49,7 @@ enum cpuhp_state {
>         CPUHP_ARM_SHMOBILE_SCU_PREPARE,
>         CPUHP_SH_SH3X_PREPARE,
>         CPUHP_BLK_MQ_PREPARE,
> +       CPUHP_RCUTREE_PREP2,
>         CPUHP_TIMERS_DEAD,
>         CPUHP_NOTF_ERR_INJ_PREPARE,
>         CPUHP_MIPS_SOC_PREPARE,
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 29de1a9..b46c573 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1289,6 +1289,11 @@ static int __init cpu_hotplug_pm_sync_init(void)
>         [CPUHP_RCUTREE_PREP] = {
>                 .name                   = "RCU/tree:prepare",
>                 .startup.single         = rcutree_prepare_cpu,
> +               .teardown.single        = NULL,
> +       },
> +       [CPUHP_RCUTREE_PREP2] = {
> +               .name                   = "RCU/tree:dead",
> +               .startup.single         = NULL,
>                 .teardown.single        = rcutree_dead_cpu,
>         },
>         /*
> 

^ permalink raw reply

* Re: [PATCH v3] block: trace completion of all bios.
From: NeilBrown @ 2017-03-26 23:17 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, linux-block,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	open list:DEVICE-MAPPER (LVM), Alasdair Kergon, Mike Snitzer,
	Shaohua Li, Linux Kernel Mailing List, Martin K . Petersen
In-Reply-To: <CACVXFVNC=-_f1jYo5bYw7X6TT8-g7UThw3zzzfrMaKVarnJvSQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4825 bytes --]

On Fri, Mar 24 2017, Ming Lei wrote:

> On Fri, Mar 24, 2017 at 8:07 AM, NeilBrown <neilb@suse.com> wrote:
...
>> @@ -102,6 +102,8 @@ struct bio {
>>  #define BIO_REFFED     8       /* bio has elevated ->bi_cnt */
>>  #define BIO_THROTTLED  9       /* This bio has already been subjected to
>>                                  * throttling rules. Don't do it again. */
>> +#define BIO_TRACE_COMPLETION 10        /* bio_endio() should trace the final completion
>> +                                * of this bio. */
>
> This may not be a good idea, since the flag space is quite small(12).

which means there are 2 unused bits and I only want to use 1.  It should
fit.
I'm a bit perplexed why 4 bits a reserved for the BVEC_POOL number which
ranges from 0 to 7....

But if these bits really are a scarce resource, we should stop wasting
them.
The patch below makes bit 7 (BIO_CHAIN) available.  We could probably make
bit 8 (BIO_REFFED) available using a similar technique.
BIO_QUIET is only relevant if bi_error is non-zero and bi_error has a
limited range, so a bit in there could be used instead. In fact,
MAX_ERRNO is 4096, so bi_error could be a 'short'.  That could save 2
whole bytes ... or make 16 more flag bits available.

So I don't think you concern about a shortage of spare flag bits is valid.

Thanks,
NeilBrown

diff --git a/block/bio.c b/block/bio.c
index c1272986133e..1703786a206a 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -274,7 +274,7 @@ void bio_init(struct bio *bio, struct bio_vec *table,
 	      unsigned short max_vecs)
 {
 	memset(bio, 0, sizeof(*bio));
-	atomic_set(&bio->__bi_remaining, 1);
+	atomic_set(&bio->__bi_remaining, 0);
 	atomic_set(&bio->__bi_cnt, 1);
 
 	bio->bi_io_vec = table;
@@ -300,7 +300,7 @@ void bio_reset(struct bio *bio)
 
 	memset(bio, 0, BIO_RESET_BYTES);
 	bio->bi_flags = flags;
-	atomic_set(&bio->__bi_remaining, 1);
+	atomic_set(&bio->__bi_remaining, 0);
 }
 EXPORT_SYMBOL(bio_reset);
 
@@ -1794,20 +1794,15 @@ EXPORT_SYMBOL(bio_flush_dcache_pages);
 static inline bool bio_remaining_done(struct bio *bio)
 {
 	/*
-	 * If we're not chaining, then ->__bi_remaining is always 1 and
+	 * If we're not chaining, then ->__bi_remaining is always 0 and
 	 * we always end io on the first invocation.
 	 */
-	if (!bio_flagged(bio, BIO_CHAIN))
+	if (atomic_read(&bio->__bi_remaining) == 0)
 		return true;
 
 	BUG_ON(atomic_read(&bio->__bi_remaining) <= 0);
 
-	if (atomic_dec_and_test(&bio->__bi_remaining)) {
-		bio_clear_flag(bio, BIO_CHAIN);
-		return true;
-	}
-
-	return false;
+	return atomic_dec_and_test(&bio->__bi_remaining);
 }
 
 /**
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e521194f6fc..d15245544111 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -664,13 +664,19 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
 }
 
 /*
- * Increment chain count for the bio. Make sure the CHAIN flag update
- * is visible before the raised count.
+ * Increment chain count for the bio.
  */
 static inline void bio_inc_remaining(struct bio *bio)
 {
-	bio_set_flag(bio, BIO_CHAIN);
-	smp_mb__before_atomic();
+	/*
+	 * Calls to bio_inc_remaining() cannot race
+	 * with the final call to bio_end_io(), and
+	 * the first call cannot race with other calls,
+	 * so if __bi_remaining appears to be zero, there
+	 * can be no race which might change it.
+	 */
+	if (atomic_read(&bio->__bi_remaining) == 0)
+		atomic_set(&bio->__bi_remaining, 1);
 	atomic_inc(&bio->__bi_remaining);
 }
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index db7a57ee0e58..2deea4501d14 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -46,6 +46,15 @@ struct bio {
 	unsigned int		bi_seg_front_size;
 	unsigned int		bi_seg_back_size;
 
+	/* __bi_remaining records the number of completion events
+	 * (i.e. calls to bi_end_io()) that need to happen before this
+	 * bio is truly complete.
+	 * A value of '0' means that there will only ever be one completion
+	 * event, so there will be no racing and no need for an atomic operation
+	 * to detect the last event.
+	 * Any other value represents a simple count subject to atomic_inc() and
+	 * atomic_dec_and_test().
+	 */
 	atomic_t		__bi_remaining;
 
 	bio_end_io_t		*bi_end_io;
@@ -98,7 +107,7 @@ struct bio {
 #define BIO_USER_MAPPED 4	/* contains user pages */
 #define BIO_NULL_MAPPED 5	/* contains invalid user pages */
 #define BIO_QUIET	6	/* Make BIO Quiet */
-#define BIO_CHAIN	7	/* chained bio, ->bi_remaining in effect */
+/* 7 unused */
 #define BIO_REFFED	8	/* bio has elevated ->bi_cnt */
 #define BIO_THROTTLED	9	/* This bio has already been subjected to
 				 * throttling rules. Don't do it again. */

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox