Linux block layer
 help / color / mirror / Atom feed
* [PATCH] blk-mq: Rework blk-mq timeout handling again
@ 2018-05-10 17:56 Bart Van Assche
  2018-05-11 12:06 ` jianchao.wang
  0 siblings, 1 reply; 10+ messages in thread
From: Bart Van Assche @ 2018-05-10 17:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Tejun Heo,
	Ming Lei, Sebastian Ott, Sagi Grimberg, Israel Rukshin,
	Max Gurtovoy

Recently the blk-mq timeout handling code was reworked. See also Tejun
Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).

This patch reworks the blk-mq timeout handling code again. The timeout
handling code is simplified by introducing a state machine per request.
This change avoids that the blk-mq timeout handling code ignores
completions that occur after blk_mq_check_expired() has been called and
before blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block
driver timeout handler always returns BLK_EH_RESET_TIMER then the result
will be that the request never terminates.

Fix this race as follows:
- Replace the __deadline member of struct request by a new member
  called das that contains the generation number, state and deadline.
  Only 32 bits are used for the deadline field such that all three
  fields occupy only 64 bits. This change reduces the maximum supported
  request timeout value from (2**63/HZ) to (2**31/HZ).
- Remove all request member variables that became superfluous due to
  this change: gstate, gstate_seq and aborted_gstate_sync.
- Remove the request state information that became superfluous due to
  this patch, namely RQF_MQ_TIMEOUT_EXPIRED.
- Remove the code that became superfluous due to this change, namely
  the RCU lock and unlock statements in blk_mq_complete_request() and
  also the synchronize_rcu() call in the timeout handler.

Notes:
- A spinlock is used to protect atomic changes of rq->das on those
  architectures that do not provide a cmpxchg64() implementation
  (sh and mips).
- Atomic instructions are only used to update the request state if
  a concurrent request state change could be in progress.
- blk_add_timer() has been split into two functions - one for the
  legacy block layer and one for blk-mq.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Sebastian Ott <sebott@linux.ibm.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Israel Rukshin <israelr@mellanox.com>,
Cc: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---

Changes compared to v6:
- Used a union instead of bit manipulations to store multiple values into
  a single 64-bit field.
- Reduced the size of the timeout field from 64 to 32 bits.
- Made sure that the block layer still builds with this patch applied
  for the sh and mips architectures.
- Fixed two sparse warnings that were introduced by this patch in the
  WRITE_ONCE() calls.

Changes compared to v5:
- Restored the synchronize_rcu() call between marking a request for timeout
  handling and the actual timeout handling to avoid that timeout handling
  starts while .queue_rq() is still in progress if the timeout is very short.
- Only use cmpxchg() if another context could attempt to change the request
  state concurrently. Use WRITE_ONCE() otherwise.

Changes compared to v4:
- Addressed multiple review comments from Christoph. The most important are
  that atomic_long_cmpxchg() has been changed into cmpxchg() and also that
  there is now a nice and clean split between the legacy and blk-mq versions
  of blk_add_timer().
- Changed the patch name and modified the patch description because there is
  disagreement about whether or not the v4.16 blk-mq core can complete a
  single request twice. Kept the "Cc: stable" tag because of
  https://bugzilla.kernel.org/show_bug.cgi?id=199077.

Changes compared to v3 (see also https://www.mail-archive.com/linux-block@vger.kernel.org/msg20073.html):
- Removed the spinlock again that was introduced to protect the request state.
  v4 uses atomic_long_cmpxchg() instead.
- Split __deadline into two variables - one for the legacy block layer and one
  for blk-mq.

Changes compared to v2 (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18338.html):
- Rebased and retested on top of kernel v4.16.

Changes compared to v1 (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18089.html):
- Removed the gstate and aborted_gstate members of struct request and used
  the __deadline member to encode both the generation and state information.

 block/blk-core.c       |   6 --
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c         | 178 +++++++++++++------------------------------------
 block/blk-mq.h         | 118 ++++++++++++++++++++++++--------
 block/blk-timeout.c    |  89 ++++++++++++++-----------
 block/blk.h            |  21 +++---
 include/linux/blkdev.h |  44 ++++++------
 7 files changed, 221 insertions(+), 236 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 1418a1ccd80d..223f650712f4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -198,12 +198,6 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	rq->internal_tag = -1;
 	rq->start_time_ns = ktime_get_ns();
 	rq->part = NULL;
-	seqcount_init(&rq->gstate_seq);
-	u64_stats_init(&rq->aborted_gstate_sync);
-	/*
-	 * See comment of blk_mq_init_request
-	 */
-	WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 3080e18cb859..ffa622366922 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -344,7 +344,6 @@ static const char *const rqf_name[] = {
 	RQF_NAME(STATS),
 	RQF_NAME(SPECIAL_PAYLOAD),
 	RQF_NAME(ZONE_WRITE_LOCKED),
-	RQF_NAME(MQ_TIMEOUT_EXPIRED),
 	RQF_NAME(MQ_POLL_SLEPT),
 };
 #undef RQF_NAME
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4e9d83594cca..53c750f8db22 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -318,7 +318,10 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	rq->special = NULL;
 	/* tag was already set */
 	rq->extra_len = 0;
-	rq->__deadline = 0;
+#ifndef CONFIG_ARCH_HAS_CMPXCHG64
+	spin_lock_init(&rq->das_lock);
+#endif
+	rq->das.das = 0;
 
 	INIT_LIST_HEAD(&rq->timeout_list);
 	rq->timeout = 0;
@@ -492,7 +495,8 @@ void blk_mq_free_request(struct request *rq)
 	if (blk_rq_rl(rq))
 		blk_put_rl(blk_rq_rl(rq));
 
-	blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+	if (!blk_mq_change_rq_state(rq, blk_mq_rq_state(rq), MQ_RQ_IDLE))
+		WARN_ON_ONCE(true);
 	if (rq->tag != -1)
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
@@ -545,8 +549,7 @@ static void __blk_mq_complete_request(struct request *rq)
 	bool shared = false;
 	int cpu;
 
-	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
-	blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE);
+	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
 
 	if (rq->internal_tag != -1)
 		blk_mq_sched_completed_request(rq);
@@ -591,36 +594,6 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
 		*srcu_idx = srcu_read_lock(hctx->srcu);
 }
 
-static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
-{
-	unsigned long flags;
-
-	/*
-	 * blk_mq_rq_aborted_gstate() is used from the completion path and
-	 * can thus be called from irq context.  u64_stats_fetch in the
-	 * middle of update on the same CPU leads to lockup.  Disable irq
-	 * while updating.
-	 */
-	local_irq_save(flags);
-	u64_stats_update_begin(&rq->aborted_gstate_sync);
-	rq->aborted_gstate = gstate;
-	u64_stats_update_end(&rq->aborted_gstate_sync);
-	local_irq_restore(flags);
-}
-
-static u64 blk_mq_rq_aborted_gstate(struct request *rq)
-{
-	unsigned int start;
-	u64 aborted_gstate;
-
-	do {
-		start = u64_stats_fetch_begin(&rq->aborted_gstate_sync);
-		aborted_gstate = rq->aborted_gstate;
-	} while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start));
-
-	return aborted_gstate;
-}
-
 /**
  * blk_mq_complete_request - end I/O on a request
  * @rq:		the request being processed
@@ -632,27 +605,12 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq)
 void blk_mq_complete_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
-	int srcu_idx;
 
 	if (unlikely(blk_should_fake_timeout(q)))
 		return;
 
-	/*
-	 * If @rq->aborted_gstate equals the current instance, timeout is
-	 * claiming @rq and we lost.  This is synchronized through
-	 * hctx_lock().  See blk_mq_timeout_work() for details.
-	 *
-	 * Completion path never blocks and we can directly use RCU here
-	 * instead of hctx_lock() which can be either RCU or SRCU.
-	 * However, that would complicate paths which want to synchronize
-	 * against us.  Let stay in sync with the issue path so that
-	 * hctx_lock() covers both issue and completion paths.
-	 */
-	hctx_lock(hctx, &srcu_idx);
-	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
+	if (blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE))
 		__blk_mq_complete_request(rq);
-	hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
@@ -679,27 +637,7 @@ void blk_mq_start_request(struct request *rq)
 		wbt_issue(q->rq_wb, rq);
 	}
 
-	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
-
-	/*
-	 * Mark @rq in-flight which also advances the generation number,
-	 * and register for timeout.  Protect with a seqcount to allow the
-	 * timeout path to read both @rq->gstate and @rq->deadline
-	 * coherently.
-	 *
-	 * This is the only place where a request is marked in-flight.  If
-	 * the timeout path reads an in-flight @rq->gstate, the
-	 * @rq->deadline it reads together under @rq->gstate_seq is
-	 * guaranteed to be the matching one.
-	 */
-	preempt_disable();
-	write_seqcount_begin(&rq->gstate_seq);
-
-	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
-	blk_add_timer(rq);
-
-	write_seqcount_end(&rq->gstate_seq);
-	preempt_enable();
+	blk_mq_add_timer(rq, MQ_RQ_IDLE, MQ_RQ_IN_FLIGHT);
 
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
@@ -712,22 +650,19 @@ void blk_mq_start_request(struct request *rq)
 }
 EXPORT_SYMBOL(blk_mq_start_request);
 
-/*
- * When we reach here because queue is busy, it's safe to change the state
- * to IDLE without checking @rq->aborted_gstate because we should still be
- * holding the RCU read lock and thus protected against timeout.
- */
 static void __blk_mq_requeue_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
+	enum mq_rq_state old_state = blk_mq_rq_state(rq);
 
 	blk_mq_put_driver_tag(rq);
 
 	trace_block_rq_requeue(q, rq);
 	wbt_requeue(q->rq_wb, rq);
 
-	if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {
-		blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+	if (old_state != MQ_RQ_IDLE) {
+		if (!blk_mq_change_rq_state(rq, old_state, MQ_RQ_IDLE))
+			WARN_ON_ONCE(true);
 		if (q->dma_drain_size && blk_rq_bytes(rq))
 			rq->nr_phys_segments--;
 	}
@@ -836,8 +771,6 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 	const struct blk_mq_ops *ops = req->q->mq_ops;
 	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
 
-	req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
-
 	if (ops->timeout)
 		ret = ops->timeout(req, reserved);
 
@@ -846,13 +779,7 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		__blk_mq_complete_request(req);
 		break;
 	case BLK_EH_RESET_TIMER:
-		/*
-		 * As nothing prevents from completion happening while
-		 * ->aborted_gstate is set, this may lead to ignored
-		 * completions and further spurious timeouts.
-		 */
-		blk_mq_rq_update_aborted_gstate(req, 0);
-		blk_add_timer(req);
+		blk_mq_add_timer(req, MQ_RQ_COMPLETE, MQ_RQ_IN_FLIGHT);
 		break;
 	case BLK_EH_NOT_HANDLED:
 		break;
@@ -866,48 +793,44 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
 	struct blk_mq_timeout_data *data = priv;
-	unsigned long gstate, deadline;
-	int start;
-
-	might_sleep();
-
-	if (rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED)
-		return;
-
-	/* read coherent snapshots of @rq->state_gen and @rq->deadline */
-	while (true) {
-		start = read_seqcount_begin(&rq->gstate_seq);
-		gstate = READ_ONCE(rq->gstate);
-		deadline = blk_rq_deadline(rq);
-		if (!read_seqcount_retry(&rq->gstate_seq, start))
-			break;
-		cond_resched();
-	}
-
-	/* if in-flight && overdue, mark for abortion */
-	if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT &&
-	    time_after_eq(jiffies, deadline)) {
-		blk_mq_rq_update_aborted_gstate(rq, gstate);
+	union blk_deadline_and_state das = READ_ONCE(rq->das);
+	unsigned long now = jiffies;
+	int32_t diff_jiffies = das.deadline - now;
+	int32_t diff_next = das.deadline - data->next;
+	enum mq_rq_state rq_state = das.state;
+
+	rq->aborted_gstate = das;
+	rq->aborted_gstate.generation ^= (1ULL << 29);
+	if (diff_jiffies <= 0 && rq_state == MQ_RQ_IN_FLIGHT) {
+		/* request timed out */
+		rq->aborted_gstate = das;
 		data->nr_expired++;
 		hctx->nr_expired++;
-	} else if (!data->next_set || time_after(data->next, deadline)) {
-		data->next = deadline;
+	} else if (!data->next_set) {
+		/* data->next is not yet set; set it to deadline. */
+		data->next = now + diff_jiffies;
 		data->next_set = 1;
+	} else if (diff_next < 0) {
+		/* data->next is later than deadline; reduce data->next. */
+		data->next += diff_next;
 	}
+
 }
 
 static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
+	union blk_deadline_and_state old_val = rq->aborted_gstate;
+	union blk_deadline_and_state new_val = old_val;
+
+	new_val.state = MQ_RQ_COMPLETE;
+
 	/*
-	 * We marked @rq->aborted_gstate and waited for RCU.  If there were
-	 * completions that we lost to, they would have finished and
-	 * updated @rq->gstate by now; otherwise, the completion path is
-	 * now guaranteed to see @rq->aborted_gstate and yield.  If
-	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
+	 * We marked @rq->aborted_gstate and waited for ongoing .queue_rq()
+	 * calls. If rq->das has not changed that means that it
+	 * is now safe to change the request state and to handle the timeout.
 	 */
-	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
-	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
+	if (blk_mq_set_rq_state(rq, old_val, new_val))
 		blk_mq_rq_timed_out(rq, reserved);
 }
 
@@ -946,10 +869,10 @@ static void blk_mq_timeout_work(struct work_struct *work)
 		bool has_rcu = false;
 
 		/*
-		 * Wait till everyone sees ->aborted_gstate.  The
-		 * sequential waits for SRCUs aren't ideal.  If this ever
-		 * becomes a problem, we can add per-hw_ctx rcu_head and
-		 * wait in parallel.
+		 * For very short timeouts it can happen that
+		 * blk_mq_check_expired() modifies the state of a request
+		 * while .queue_rq() is still in progress. Hence wait until
+		 * these .queue_rq() calls have finished.
 		 */
 		queue_for_each_hw_ctx(q, hctx, i) {
 			if (!hctx->nr_expired)
@@ -965,7 +888,7 @@ static void blk_mq_timeout_work(struct work_struct *work)
 		if (has_rcu)
 			synchronize_rcu();
 
-		/* terminate the ones we won */
+		/* Terminate the requests marked by blk_mq_check_expired(). */
 		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
 	}
 
@@ -2061,15 +1984,6 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
 			return ret;
 	}
 
-	seqcount_init(&rq->gstate_seq);
-	u64_stats_init(&rq->aborted_gstate_sync);
-	/*
-	 * start gstate with gen 1 instead of 0, otherwise it will be equal
-	 * to aborted_gstate, and be identified timed out by
-	 * blk_mq_terminate_expired.
-	 */
-	WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
-
 	return 0;
 }
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index e1bb420dc5d6..3f29fbe7ee62 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -30,18 +30,11 @@ struct blk_mq_ctx {
 	struct kobject		kobj;
 } ____cacheline_aligned_in_smp;
 
-/*
- * Bits for request->gstate.  The lower two bits carry MQ_RQ_* state value
- * and the upper bits the generation number.
- */
+/* See also blk_deadline_and_state.state. */
 enum mq_rq_state {
 	MQ_RQ_IDLE		= 0,
 	MQ_RQ_IN_FLIGHT		= 1,
 	MQ_RQ_COMPLETE		= 2,
-
-	MQ_RQ_STATE_BITS	= 2,
-	MQ_RQ_STATE_MASK	= (1 << MQ_RQ_STATE_BITS) - 1,
-	MQ_RQ_GEN_INC		= 1 << MQ_RQ_STATE_BITS,
 };
 
 void blk_mq_freeze_queue(struct request_queue *q);
@@ -103,37 +96,108 @@ extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_release(struct request_queue *q);
 
+#if defined(CONFIG_SUPERH) || defined(CONFIG_MIPS)
+#undef CONFIG_ARCH_HAS_CMPXCHG64
+#else
+#define CONFIG_ARCH_HAS_CMPXCHG64
+#endif
+
+static inline bool blk_mq_set_rq_state(struct request *rq,
+				       union blk_deadline_and_state old_val,
+				       union blk_deadline_and_state new_val)
+{
+#ifdef CONFIG_ARCH_HAS_CMPXCHG64
+	return cmpxchg64(&rq->das.das, old_val.das, new_val.das) == old_val.das;
+#else
+	unsigned long flags;
+	bool res = false;
+
+	spin_lock_irqsave(&rq->das_lock, flags);
+	if (rq->das.das == old_val.das) {
+		rq->das = new_val;
+		res = true;
+	}
+	spin_unlock_irqrestore(&rq->das_lock, flags);
+
+	return res;
+#endif
+}
+
+/*
+ * If the state of request @rq equals @old_state, update deadline and request
+ * state atomically to @time and @new_state. blk-mq only. cmpxchg64() is only
+ * used if there could be a concurrent update attempt from another context.
+ */
+static inline bool blk_mq_rq_set_deadline(struct request *rq,
+					  unsigned long new_time,
+					  enum mq_rq_state old_state,
+					  enum mq_rq_state new_state)
+{
+	union blk_deadline_and_state old_val, new_val;
+
+	if (old_state != MQ_RQ_IN_FLIGHT) {
+		old_val = READ_ONCE(rq->das);
+		if (old_val.state != old_state)
+			return false;
+		new_val = old_val;
+		new_val.deadline = new_time;
+		new_val.state = new_state;
+		WRITE_ONCE(rq->das.das, new_val.das);
+		return true;
+	}
+
+	do {
+		old_val = READ_ONCE(rq->das);
+		if (old_val.state != old_state)
+			return false;
+		new_val = old_val;
+		new_val.deadline = new_time;
+		new_val.state = new_state;
+	} while (!blk_mq_set_rq_state(rq, old_val, new_val));
+
+	return true;
+}
+
 /**
  * blk_mq_rq_state() - read the current MQ_RQ_* state of a request
  * @rq: target request.
  */
-static inline int blk_mq_rq_state(struct request *rq)
+static inline enum mq_rq_state blk_mq_rq_state(struct request *rq)
 {
-	return READ_ONCE(rq->gstate) & MQ_RQ_STATE_MASK;
+	return rq->das.state;
 }
 
 /**
- * blk_mq_rq_update_state() - set the current MQ_RQ_* state of a request
- * @rq: target request.
- * @state: new state to set.
+ * blk_mq_change_rq_state - atomically test and set request state
+ * @rq: Request pointer.
+ * @old_state: Old request state.
+ * @new_state: New request state.
  *
- * Set @rq's state to @state.  The caller is responsible for ensuring that
- * there are no other updaters.  A request can transition into IN_FLIGHT
- * only from IDLE and doing so increments the generation number.
+ * Returns %true if and only if the old state was @old and if the state has
+ * been changed into @new.
  */
-static inline void blk_mq_rq_update_state(struct request *rq,
-					  enum mq_rq_state state)
+static inline bool blk_mq_change_rq_state(struct request *rq,
+					  enum mq_rq_state old_state,
+					  enum mq_rq_state new_state)
 {
-	u64 old_val = READ_ONCE(rq->gstate);
-	u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state;
-
-	if (state == MQ_RQ_IN_FLIGHT) {
-		WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE);
-		new_val += MQ_RQ_GEN_INC;
+	union blk_deadline_and_state das = READ_ONCE(rq->das);
+	union blk_deadline_and_state old_val = das;
+	union blk_deadline_and_state new_val = das;
+
+	old_val.state = old_state;
+	new_val.state = new_state;
+	if (new_state == MQ_RQ_IN_FLIGHT)
+		new_val.generation++;
+	/*
+	 * For transitions from state in-flight to another state cmpxchg64()
+	 * must be used. For other state transitions it is safe to use
+	 * WRITE_ONCE().
+	 */
+	if (old_state != MQ_RQ_IN_FLIGHT) {
+		WRITE_ONCE(rq->das.das, new_val.das);
+		return true;
 	}
-
-	/* avoid exposing interim values */
-	WRITE_ONCE(rq->gstate, new_val);
+	return blk_mq_set_rq_state(rq, old_val, new_val);
 }
 
 static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q,
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 652d4d4d3e97..efe237a421b3 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -162,8 +162,9 @@ void blk_abort_request(struct request *req)
 		 * immediately and that scan sees the new timeout value.
 		 * No need for fancy synchronizations.
 		 */
-		blk_rq_set_deadline(req, jiffies);
-		kblockd_schedule_work(&req->q->timeout_work);
+		if (blk_mq_rq_set_deadline(req, jiffies, MQ_RQ_IN_FLIGHT,
+					   MQ_RQ_IN_FLIGHT))
+			kblockd_schedule_work(&req->q->timeout_work);
 	} else {
 		if (blk_mark_rq_complete(req))
 			return;
@@ -184,52 +185,17 @@ unsigned long blk_rq_timeout(unsigned long timeout)
 	return timeout;
 }
 
-/**
- * blk_add_timer - Start timeout timer for a single request
- * @req:	request that is about to start running.
- *
- * Notes:
- *    Each request has its own timer, and as it is added to the queue, we
- *    set up the timer. When the request completes, we cancel the timer.
- */
-void blk_add_timer(struct request *req)
+static void __blk_add_timer(struct request *req)
 {
 	struct request_queue *q = req->q;
 	unsigned long expiry;
 
-	if (!q->mq_ops)
-		lockdep_assert_held(q->queue_lock);
-
-	/* blk-mq has its own handler, so we don't need ->rq_timed_out_fn */
-	if (!q->mq_ops && !q->rq_timed_out_fn)
-		return;
-
-	BUG_ON(!list_empty(&req->timeout_list));
-
-	/*
-	 * Some LLDs, like scsi, peek at the timeout to prevent a
-	 * command from being retried forever.
-	 */
-	if (!req->timeout)
-		req->timeout = q->rq_timeout;
-
-	blk_rq_set_deadline(req, jiffies + req->timeout);
-	req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
-
-	/*
-	 * Only the non-mq case needs to add the request to a protected list.
-	 * For the mq case we simply scan the tag map.
-	 */
-	if (!q->mq_ops)
-		list_add_tail(&req->timeout_list, &req->q->timeout_list);
-
 	/*
 	 * If the timer isn't already pending or this timeout is earlier
 	 * than an existing one, modify the timer. Round up to next nearest
 	 * second.
 	 */
 	expiry = blk_rq_timeout(round_jiffies_up(blk_rq_deadline(req)));
-
 	if (!timer_pending(&q->timeout) ||
 	    time_before(expiry, q->timeout.expires)) {
 		unsigned long diff = q->timeout.expires - expiry;
@@ -244,5 +210,52 @@ void blk_add_timer(struct request *req)
 		if (!timer_pending(&q->timeout) || (diff >= HZ / 2))
 			mod_timer(&q->timeout, expiry);
 	}
+}
+
+/**
+ * blk_add_timer - Start timeout timer for a single request
+ * @req:	request that is about to start running.
+ *
+ * Notes:
+ *    Each request has its own timer, and as it is added to the queue, we
+ *    set up the timer. When the request completes, we cancel the timer.
+ */
+void blk_add_timer(struct request *req)
+{
+	struct request_queue *q = req->q;
+
+	lockdep_assert_held(q->queue_lock);
 
+	if (!q->rq_timed_out_fn)
+		return;
+	if (!req->timeout)
+		req->timeout = q->rq_timeout;
+
+	blk_rq_set_deadline(req, jiffies + req->timeout);
+	list_add_tail(&req->timeout_list, &req->q->timeout_list);
+
+	return __blk_add_timer(req);
+}
+
+/**
+ * blk_mq_add_timer - set the deadline for a single request
+ * @req:	request for which to set the deadline.
+ * @old:	current request state.
+ * @new:	new request state.
+ *
+ * Sets the deadline of a request if and only if it has state @old and
+ * at the same time changes the request state from @old into @new. The caller
+ * must guarantee that the request state won't be modified while this function
+ * is in progress.
+ */
+void blk_mq_add_timer(struct request *req, enum mq_rq_state old,
+		      enum mq_rq_state new)
+{
+	struct request_queue *q = req->q;
+
+	if (!req->timeout)
+		req->timeout = q->rq_timeout;
+	if (!blk_mq_rq_set_deadline(req, jiffies + req->timeout, old, new))
+		WARN_ON_ONCE(true);
+	return __blk_add_timer(req);
 }
diff --git a/block/blk.h b/block/blk.h
index eaf1a8e87d11..e379de3d4b3b 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -170,6 +170,8 @@ static inline bool bio_integrity_endio(struct bio *bio)
 void blk_timeout_work(struct work_struct *work);
 unsigned long blk_rq_timeout(unsigned long timeout);
 void blk_add_timer(struct request *req);
+void blk_mq_add_timer(struct request *req, enum mq_rq_state old,
+		      enum mq_rq_state new);
 void blk_delete_timer(struct request *);
 
 
@@ -191,21 +193,21 @@ void blk_account_io_done(struct request *req, u64 now);
 /*
  * EH timer and IO completion will both attempt to 'grab' the request, make
  * sure that only one of them succeeds. Steal the bottom bit of the
- * __deadline field for this.
+ * das field for this.
  */
 static inline int blk_mark_rq_complete(struct request *rq)
 {
-	return test_and_set_bit(0, &rq->__deadline);
+	return test_and_set_bit(0, &rq->das.legacy_deadline);
 }
 
 static inline void blk_clear_rq_complete(struct request *rq)
 {
-	clear_bit(0, &rq->__deadline);
+	clear_bit(0, &rq->das.legacy_deadline);
 }
 
 static inline bool blk_rq_is_complete(struct request *rq)
 {
-	return test_bit(0, &rq->__deadline);
+	return test_bit(0, &rq->das.legacy_deadline);
 }
 
 /*
@@ -308,18 +310,19 @@ static inline void req_set_nomerge(struct request_queue *q, struct request *req)
 }
 
 /*
- * Steal a bit from this field for legacy IO path atomic IO marking. Note that
- * setting the deadline clears the bottom bit, potentially clearing the
- * completed bit. The user has to be OK with this (current ones are fine).
+ * Steal two bits from this field. The legacy IO path uses the lowest bit for
+ * atomic IO marking. Note that setting the deadline clears the bottom bit,
+ * potentially clearing the completed bit. The current legacy block layer is
+ * fine with that. Must be called with the request queue lock held.
  */
 static inline void blk_rq_set_deadline(struct request *rq, unsigned long time)
 {
-	rq->__deadline = time & ~0x1UL;
+	rq->das.legacy_deadline = time & ~0x1UL;
 }
 
 static inline unsigned long blk_rq_deadline(struct request *rq)
 {
-	return rq->__deadline & ~0x1UL;
+	return rq->das.legacy_deadline & ~0x1UL;
 }
 
 /*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e42d510daf3c..4915b4371e52 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -27,8 +27,6 @@
 #include <linux/percpu-refcount.h>
 #include <linux/scatterlist.h>
 #include <linux/blkzoned.h>
-#include <linux/seqlock.h>
-#include <linux/u64_stats_sync.h>
 
 struct module;
 struct scsi_ioctl_command;
@@ -125,15 +123,23 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_SPECIAL_PAYLOAD	((__force req_flags_t)(1 << 18))
 /* The per-zone write lock is held for this request */
 #define RQF_ZONE_WRITE_LOCKED	((__force req_flags_t)(1 << 19))
-/* timeout is expired */
-#define RQF_MQ_TIMEOUT_EXPIRED	((__force req_flags_t)(1 << 20))
 /* already slept for hybrid poll */
-#define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 21))
+#define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 20))
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \
 	(RQF_STARTED | RQF_SOFTBARRIER | RQF_FLUSH_SEQ | RQF_SPECIAL_PAYLOAD)
 
+union blk_deadline_and_state {
+	struct {
+		uint32_t generation:30;
+		uint32_t state:2;
+		uint32_t deadline;
+	};
+	unsigned long legacy_deadline;
+	uint64_t das;
+};
+
 /*
  * Try to put the fields that are referenced together in the same cacheline.
  *
@@ -236,29 +242,21 @@ struct request {
 
 	unsigned int extra_len;	/* length of alignment and padding */
 
-	/*
-	 * On blk-mq, the lower bits of ->gstate (generation number and
-	 * state) carry the MQ_RQ_* state value and the upper bits the
-	 * generation number which is monotonically incremented and used to
-	 * distinguish the reuse instances.
-	 *
-	 * ->gstate_seq allows updates to ->gstate and other fields
-	 * (currently ->deadline) during request start to be read
-	 * atomically from the timeout path, so that it can operate on a
-	 * coherent set of information.
-	 */
-	seqcount_t gstate_seq;
-	u64 gstate;
-
 	/*
 	 * ->aborted_gstate is used by the timeout to claim a specific
 	 * recycle instance of this request.  See blk_mq_timeout_work().
 	 */
-	struct u64_stats_sync aborted_gstate_sync;
-	u64 aborted_gstate;
+	union blk_deadline_and_state aborted_gstate;
 
-	/* access through blk_rq_set_deadline, blk_rq_deadline */
-	unsigned long __deadline;
+	/* Only used if CONFIG_ARCH_HAS_CMPXCHG64 is not defined. */
+	spinlock_t das_lock;
+	/*
+	 * Access through blk_rq_deadline() and blk_rq_set_deadline(),
+	 * blk_mark_rq_complete(), blk_clear_rq_complete() and
+	 * blk_rq_is_complete() for legacy queues or blk_mq_rq_state() for
+	 * blk-mq queues.
+	 */
+	union blk_deadline_and_state das;
 
 	struct list_head timeout_list;
 
-- 
2.16.3

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] blk-mq: Rework blk-mq timeout handling again
  2018-05-10 17:56 [PATCH] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
@ 2018-05-11 12:06 ` jianchao.wang
  2018-05-11 12:35   ` Christoph Hellwig
  2018-05-11 15:26   ` Bart Van Assche
  0 siblings, 2 replies; 10+ messages in thread
From: jianchao.wang @ 2018-05-11 12:06 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Tejun Heo, Ming Lei,
	Sebastian Ott, Sagi Grimberg, Israel Rukshin, Max Gurtovoy

Hi bart

I add debug log in blk_mq_add_timer as following

void blk_mq_add_timer(struct request *req, enum mq_rq_state old,
                     enum mq_rq_state new)
{
       struct request_queue *q = req->q;

       if (!req->timeout)
               req->timeout = q->rq_timeout;
       if (!blk_mq_rq_set_deadline(req, jiffies + req->timeout, old, new))
               WARN_ON_ONCE(true);

       trace_printk("jiffies %lx to %x ldl %lx gen %u dl %x\n",
                       jiffies,
                       req->timeout,
                       blk_rq_deadline(req),
                       req->das.generation,
                       req->das.deadline);
 
       return __blk_add_timer(req);
 }

And get log below:

     jbd2/sda2-8-320   [000] ...1    95.030824: blk_mq_add_timer: jiffies ffff37c0 to 1d4c ldl ffff550c40000000 gen 0 dl ffff550c
    kworker/0:1H-136   [000] ...1    95.031822: blk_mq_add_timer: jiffies ffff37c0 to 1d4c ldl ffff550c40000000 gen 0 dl ffff550c
    kworker/6:1H-244   [006] ...1    95.041695: blk_mq_add_timer: jiffies ffff37c3 to 1d4c ldl ffff550f40000000 gen 0 dl ffff550f
    kworker/6:1H-244   [006] ...1    95.041954: blk_mq_add_timer: jiffies ffff37c3 to 1d4c ldl ffff550f40000000 gen 0 dl ffff550f

The blk_rq_deadline return ffff550c40000000 which looks really crazy.
It should be due to union blk_deadline_and_state. 
+union blk_deadline_and_state {
+	struct {
+		uint32_t generation:30;
+		uint32_t state:2;
+		uint32_t deadline;
+	};
+	unsigned long legacy_deadline;
+	uint64_t das;
+};

And generation never change. 

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] blk-mq: Rework blk-mq timeout handling again
  2018-05-11 12:06 ` jianchao.wang
@ 2018-05-11 12:35   ` Christoph Hellwig
  2018-05-11 15:29     ` Bart Van Assche
  2018-05-11 15:26   ` Bart Van Assche
  1 sibling, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2018-05-11 12:35 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Bart Van Assche, Jens Axboe, linux-block, Christoph Hellwig,
	Tejun Heo, Ming Lei, Sebastian Ott, Sagi Grimberg, Israel Rukshin,
	Max Gurtovoy

> It should be due to union blk_deadline_and_state. 
> +union blk_deadline_and_state {
> +	struct {
> +		uint32_t generation:30;
> +		uint32_t state:2;
> +		uint32_t deadline;
> +	};
> +	unsigned long legacy_deadline;
> +	uint64_t das;
> +};

Yikes.  Or we just move it into a separate field.  This patch already
shrinks struct request a lot, so I'd rather do that to keep it simple.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] blk-mq: Rework blk-mq timeout handling again
  2018-05-11 12:06 ` jianchao.wang
  2018-05-11 12:35   ` Christoph Hellwig
@ 2018-05-11 15:26   ` Bart Van Assche
  2018-05-14  1:48     ` jianchao.wang
  1 sibling, 1 reply; 10+ messages in thread
From: Bart Van Assche @ 2018-05-11 15:26 UTC (permalink / raw)
  To: jianchao.w.wang@oracle.com, axboe@kernel.dk
  Cc: linux-block@vger.kernel.org, israelr@mellanox.com,
	sagi@grimberg.me, hch@lst.de, sebott@linux.ibm.com,
	ming.lei@redhat.com, maxg@mellanox.com, tj@kernel.org

T24gRnJpLCAyMDE4LTA1LTExIGF0IDIwOjA2ICswODAwLCBqaWFuY2hhby53YW5nIHdyb3RlOg0K
PiBIaSBiYXJ0DQo+IA0KPiBJIGFkZCBkZWJ1ZyBsb2cgaW4gYmxrX21xX2FkZF90aW1lciBhcyBm
b2xsb3dpbmcNCj4gDQo+IHZvaWQgYmxrX21xX2FkZF90aW1lcihzdHJ1Y3QgcmVxdWVzdCAqcmVx
LCBlbnVtIG1xX3JxX3N0YXRlIG9sZCwNCj4gICAgICAgICAgICAgICAgICAgICAgZW51bSBtcV9y
cV9zdGF0ZSBuZXcpDQo+IHsNCj4gICAgICAgIHN0cnVjdCByZXF1ZXN0X3F1ZXVlICpxID0gcmVx
LT5xOw0KPiANCj4gICAgICAgIGlmICghcmVxLT50aW1lb3V0KQ0KPiAgICAgICAgICAgICAgICBy
ZXEtPnRpbWVvdXQgPSBxLT5ycV90aW1lb3V0Ow0KPiAgICAgICAgaWYgKCFibGtfbXFfcnFfc2V0
X2RlYWRsaW5lKHJlcSwgamlmZmllcyArIHJlcS0+dGltZW91dCwgb2xkLCBuZXcpKQ0KPiAgICAg
ICAgICAgICAgICBXQVJOX09OX09OQ0UodHJ1ZSk7DQo+IA0KPiAgICAgICAgdHJhY2VfcHJpbnRr
KCJqaWZmaWVzICVseCB0byAleCBsZGwgJWx4IGdlbiAldSBkbCAleFxuIiwNCj4gICAgICAgICAg
ICAgICAgICAgICAgICBqaWZmaWVzLA0KPiAgICAgICAgICAgICAgICAgICAgICAgIHJlcS0+dGlt
ZW91dCwNCj4gICAgICAgICAgICAgICAgICAgICAgICBibGtfcnFfZGVhZGxpbmUocmVxKSwNCj4g
ICAgICAgICAgICAgICAgICAgICAgICByZXEtPmRhcy5nZW5lcmF0aW9uLA0KPiAgICAgICAgICAg
ICAgICAgICAgICAgIHJlcS0+ZGFzLmRlYWRsaW5lKTsNCj4gIA0KPiAgICAgICAgcmV0dXJuIF9f
YmxrX2FkZF90aW1lcihyZXEpOw0KPiAgfQ0KPiANCj4gQW5kIGdldCBsb2cgYmVsb3c6DQo+IA0K
PiAgICAgIGpiZDIvc2RhMi04LTMyMCAgIFswMDBdIC4uLjEgICAgOTUuMDMwODI0OiBibGtfbXFf
YWRkX3RpbWVyOiBqaWZmaWVzIGZmZmYzN2MwIHRvIDFkNGMgbGRsIGZmZmY1NTBjNDAwMDAwMDAg
Z2VuIDAgZGwgZmZmZjU1MGMNCj4gICAgIGt3b3JrZXIvMDoxSC0xMzYgICBbMDAwXSAuLi4xICAg
IDk1LjAzMTgyMjogYmxrX21xX2FkZF90aW1lcjogamlmZmllcyBmZmZmMzdjMCB0byAxZDRjIGxk
bCBmZmZmNTUwYzQwMDAwMDAwIGdlbiAwIGRsIGZmZmY1NTBjDQo+ICAgICBrd29ya2VyLzY6MUgt
MjQ0ICAgWzAwNl0gLi4uMSAgICA5NS4wNDE2OTU6IGJsa19tcV9hZGRfdGltZXI6IGppZmZpZXMg
ZmZmZjM3YzMgdG8gMWQ0YyBsZGwgZmZmZjU1MGY0MDAwMDAwMCBnZW4gMCBkbCBmZmZmNTUwZg0K
PiAgICAga3dvcmtlci82OjFILTI0NCAgIFswMDZdIC4uLjEgICAgOTUuMDQxOTU0OiBibGtfbXFf
YWRkX3RpbWVyOiBqaWZmaWVzIGZmZmYzN2MzIHRvIDFkNGMgbGRsIGZmZmY1NTBmNDAwMDAwMDAg
Z2VuIDAgZGwgZmZmZjU1MGYNCj4gDQo+IFRoZSBibGtfcnFfZGVhZGxpbmUgcmV0dXJuIGZmZmY1
NTBjNDAwMDAwMDAgd2hpY2ggbG9va3MgcmVhbGx5IGNyYXp5Lg0KDQpUaGUgYnVnIGlzIGluIHRo
ZSBhYm92ZSB0cmFjZV9wcmludGsoKSBjYWxsOiBibGtfcnFfZGVhZGxpbmUoKSBtdXN0IG9ubHkg
YmUgdXNlZA0KZm9yIHRoZSBsZWdhY3kgYmxvY2sgbGF5ZXIgYW5kIG5vdCBmb3IgYmxrLW1xIGNv
ZGUuIElmIHlvdSBoYXZlIGEgbG9vayBhdCB0aGUgdmFsdWUNCm9mIHRoZSBkYXMuZGVhZGxpbmUg
ZmllbGQgdGhlbiBvbmUgY2FuIHNlZSB0aGF0IHRoZSB2YWx1ZSBvZiB0aGF0IGZpZWxkIGlzIGNv
cnJlY3Q6DQoweGZmZmY1NTBjIC0gMHhmZmZmMzdjMCA9IDc1MDAgPSAzMCAqIDI1MC4gRG9lcyB0
aGF0IG1lYW4gdGhhdCBIWiA9IDI1MCBvbiB0aGUgc3lzdGVtDQpvbiB3aGljaCB5b3UgcmFuIHRo
aXMgdGVzdD8NCg0KPiBBbmQgZ2VuZXJhdGlvbiBuZXZlciBjaGFuZ2UuIA0KDQpUaGF0J3MgYSBn
b29kIGNhdGNoLiBUaGUgY29kZSBmb3IgaW5jcmVtZW50aW5nIHRoZSBnZW5lcmF0aW9uIG51bWJl
ciBvY2N1cnMgaW4NCmJsa19tcV9jaGFuZ2VfcnFfc3RhdGUoKSBidXQgaXMgbWlzc2luZyBmcm9t
IGJsa19tcV9ycV9zZXRfZGVhZGxpbmUoKS4gSSB3aWxsIGZpeCB0aGlzLg0KDQpCYXJ0Lg0KDQoN
Cg==

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] blk-mq: Rework blk-mq timeout handling again
  2018-05-11 12:35   ` Christoph Hellwig
@ 2018-05-11 15:29     ` Bart Van Assche
  2018-05-14  1:37       ` jianchao.wang
  0 siblings, 1 reply; 10+ messages in thread
From: Bart Van Assche @ 2018-05-11 15:29 UTC (permalink / raw)
  To: hch@lst.de, jianchao.w.wang@oracle.com
  Cc: linux-block@vger.kernel.org, israelr@mellanox.com,
	sagi@grimberg.me, sebott@linux.ibm.com, axboe@kernel.dk,
	ming.lei@redhat.com, maxg@mellanox.com, tj@kernel.org

T24gRnJpLCAyMDE4LTA1LTExIGF0IDE0OjM1ICswMjAwLCBDaHJpc3RvcGggSGVsbHdpZyB3cm90
ZToNCj4gPiBJdCBzaG91bGQgYmUgZHVlIHRvIHVuaW9uIGJsa19kZWFkbGluZV9hbmRfc3RhdGUu
IA0KPiA+ICt1bmlvbiBibGtfZGVhZGxpbmVfYW5kX3N0YXRlIHsNCj4gPiArCXN0cnVjdCB7DQo+
ID4gKwkJdWludDMyX3QgZ2VuZXJhdGlvbjozMDsNCj4gPiArCQl1aW50MzJfdCBzdGF0ZToyOw0K
PiA+ICsJCXVpbnQzMl90IGRlYWRsaW5lOw0KPiA+ICsJfTsNCj4gPiArCXVuc2lnbmVkIGxvbmcg
bGVnYWN5X2RlYWRsaW5lOw0KPiA+ICsJdWludDY0X3QgZGFzOw0KPiA+ICt9Ow0KPiANCj4gWWlr
ZXMuICBPciB3ZSBqdXN0IG1vdmUgaXQgaW50byBhIHNlcGFyYXRlIGZpZWxkLiAgVGhpcyBwYXRj
aCBhbHJlYWR5DQo+IHNocmlua3Mgc3RydWN0IHJlcXVlc3QgYSBsb3QsIHNvIEknZCByYXRoZXIg
ZG8gdGhhdCB0byBrZWVwIGl0IHNpbXBsZS4NCg0KSGVsbG8gQ2hyaXN0b3BoLA0KDQpBcmUgeW91
IHBlcmhhcHMgcmVmZXJyaW5nIHRvIHRoZSBsZWdhY3lfZGVhZGxpbmUgZmllbGQ/IEhhdmUgeW91
IG5vdGljZWQgdGhhdA0KSmlhbmNoYW8gdXNlZCBhIGxlZ2FjeSBibG9jayBsYXllciBmdW5jdGlv
biBpbiBibGstbXEgY29kZSBhbmQgdGhhdCB0aGF0IGlzDQp3aHkgYSB3cm9uZyB2YWx1ZSBmb3Ig
dGhlIGRlYWRsaW5lIGFwcGVhcmVkIGluIHRoZSB0cmFjZSBvdXRwdXQ/DQoNClRoYW5rcywNCg0K
QmFydC4NCg0KDQoNCg==

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] blk-mq: Rework blk-mq timeout handling again
  2018-05-11 15:29     ` Bart Van Assche
@ 2018-05-14  1:37       ` jianchao.wang
  2018-05-14  4:03         ` Bart Van Assche
  0 siblings, 1 reply; 10+ messages in thread
From: jianchao.wang @ 2018-05-14  1:37 UTC (permalink / raw)
  To: Bart Van Assche, hch@lst.de
  Cc: linux-block@vger.kernel.org, israelr@mellanox.com,
	sagi@grimberg.me, sebott@linux.ibm.com, axboe@kernel.dk,
	ming.lei@redhat.com, maxg@mellanox.com, tj@kernel.org


Hi bart

On 05/11/2018 11:29 PM, Bart Van Assche wrote:
> On Fri, 2018-05-11 at 14:35 +0200, Christoph Hellwig wrote:
>>> It should be due to union blk_deadline_and_state. 
>>> +union blk_deadline_and_state {
>>> +	struct {
>>> +		uint32_t generation:30;
>>> +		uint32_t state:2;
>>> +		uint32_t deadline;
>>> +	};
>>> +	unsigned long legacy_deadline;
>>> +	uint64_t das;
>>> +};
>>
>> Yikes.  Or we just move it into a separate field.  This patch already
>> shrinks struct request a lot, so I'd rather do that to keep it simple.
> 
> Hello Christoph,
> 
> Are you perhaps referring to the legacy_deadline field? Have you noticed that
> Jianchao used a legacy block layer function in blk-mq code and that that is
> why a wrong value for the deadline appeared in the trace output?
> 

I use blk_rq_deadline because __blk_add_timer uses it.

-
 	/*
 	 * If the timer isn't already pending or this timeout is earlier
 	 * than an existing one, modify the timer. Round up to next nearest
 	 * second.
 	 */
 	expiry = blk_rq_timeout(round_jiffies_up(blk_rq_deadline(req)));
-
 	if (!timer_pending(&q->timeout) ||
 	    time_before(expiry, q->timeout.expires)) {
 		unsigned long diff = q->timeout.expires - expiry;


This is also the reason why there is no issue about timeout when test.
blk_rq_timeout will return reasonable value.

In addition, on a 64bit system, how do you set up the timer with a 32bit deadline ?


Thanks
Jianchao

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] blk-mq: Rework blk-mq timeout handling again
  2018-05-11 15:26   ` Bart Van Assche
@ 2018-05-14  1:48     ` jianchao.wang
  0 siblings, 0 replies; 10+ messages in thread
From: jianchao.wang @ 2018-05-14  1:48 UTC (permalink / raw)
  To: Bart Van Assche, axboe@kernel.dk
  Cc: linux-block@vger.kernel.org, israelr@mellanox.com,
	sagi@grimberg.me, hch@lst.de, sebott@linux.ibm.com,
	ming.lei@redhat.com, maxg@mellanox.com, tj@kernel.org

Hi Bart

On 05/11/2018 11:26 PM, Bart Van Assche wrote:
> The bug is in the above trace_printk() call: blk_rq_deadline() must only be used
> for the legacy block layer and not for blk-mq code. If you have a look at the value
> of the das.deadline field then one can see that the value of that field is correct:

blk_rq_deadline is used by __blk_add_timer, and __blk_add_timer is also used by blk_mq_add_timer.
This is why I used blk_rq_deadline to print out the value in my test.
on 64bit system, das.deadline will be a correct value,  what if the jiffies has crossed the 32bit boundary ?

> 0xffff550c - 0xffff37c0 = 7500 = 30 * 250. Does that mean that HZ = 250 on the system
> on which you ran this test?
> 

Yes, the HZ is 250

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] blk-mq: Rework blk-mq timeout handling again
  2018-05-14  1:37       ` jianchao.wang
@ 2018-05-14  4:03         ` Bart Van Assche
  2018-05-14  5:15           ` jianchao.wang
  0 siblings, 1 reply; 10+ messages in thread
From: Bart Van Assche @ 2018-05-14  4:03 UTC (permalink / raw)
  To: hch@lst.de, jianchao.w.wang@oracle.com
  Cc: linux-block@vger.kernel.org, israelr@mellanox.com,
	sagi@grimberg.me, sebott@linux.ibm.com, ming.lei@redhat.com,
	axboe@kernel.dk, maxg@mellanox.com, tj@kernel.org

T24gTW9uLCAyMDE4LTA1LTE0IGF0IDA5OjM3ICswODAwLCBqaWFuY2hhby53YW5nIHdyb3RlOg0K
PiBJbiBhZGRpdGlvbiwgb24gYSA2NGJpdCBzeXN0ZW0sIGhvdyBkbyB5b3Ugc2V0IHVwIHRoZSB0
aW1lciB3aXRoIGEgMzJiaXQgZGVhZGxpbmUgPw0KDQpJZiB0aW1lb3V0IGhhbmRsaW5nIG9jY3Vy
cyBsZXNzIHRoYW4gKDEgPDwgMzEpIC8gSFogc2Vjb25kcyBhZnRlciBhIHJlcXVlc3QgaGFzIGJl
ZW4NCnN1Ym1pdHRlZCB0aGVuIGEgMzItYml0IHZhcmlhYmxlIGlzIHN1ZmZpY2llbnQgdG8gc3Rv
cmUgdGhlIGRlYWRsaW5lLiBJZiB5b3Ugd2FudCB0bw0KdmVyaWZ5IHRoZSBpbXBsZW1lbnRhdGlv
biwgcGxlYXNlIGhhdmUgYSBsb29rIGF0IHRoZSBpbnQzMl90IHZhcmlhYmxlcyBpbg0KYmxrX21x
X2NoZWNrX2V4cGlyZWQoKS4NCg0KQmFydC4NCg0KDQo=

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] blk-mq: Rework blk-mq timeout handling again
  2018-05-14  4:03         ` Bart Van Assche
@ 2018-05-14  5:15           ` jianchao.wang
  2018-05-14 18:44             ` Bart Van Assche
  0 siblings, 1 reply; 10+ messages in thread
From: jianchao.wang @ 2018-05-14  5:15 UTC (permalink / raw)
  To: Bart Van Assche, hch@lst.de
  Cc: linux-block@vger.kernel.org, israelr@mellanox.com,
	sagi@grimberg.me, sebott@linux.ibm.com, ming.lei@redhat.com,
	axboe@kernel.dk, maxg@mellanox.com, tj@kernel.org

Hi Bart

On 05/14/2018 12:03 PM, Bart Van Assche wrote:
> On Mon, 2018-05-14 at 09:37 +0800, jianchao.wang wrote:
>> In addition, on a 64bit system, how do you set up the timer with a 32bit deadline ?
> 
> If timeout handling occurs less than (1 << 31) / HZ seconds after a request has been
> submitted then a 32-bit variable is sufficient to store the deadline. If you want to
> verify the implementation, please have a look at the int32_t variables in
> blk_mq_check_expired().
> 
a 32bit deadline is indeed OK to judge whether a request is timeout or not.
but how is the expires value determined for __blk_add_timer -> mod_timer ?
as we know, when a request is started, we need to arm a timer for it.
the expires value is 'unsigned long'.

Thanks
Jianchao 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] blk-mq: Rework blk-mq timeout handling again
  2018-05-14  5:15           ` jianchao.wang
@ 2018-05-14 18:44             ` Bart Van Assche
  0 siblings, 0 replies; 10+ messages in thread
From: Bart Van Assche @ 2018-05-14 18:44 UTC (permalink / raw)
  To: hch@lst.de, jianchao.w.wang@oracle.com
  Cc: linux-block@vger.kernel.org, israelr@mellanox.com,
	sagi@grimberg.me, sebott@linux.ibm.com, axboe@kernel.dk,
	ming.lei@redhat.com, maxg@mellanox.com, tj@kernel.org

T24gTW9uLCAyMDE4LTA1LTE0IGF0IDEzOjE1ICswODAwLCBqaWFuY2hhby53YW5nIHdyb3RlOg0K
PiBhIDMyYml0IGRlYWRsaW5lIGlzIGluZGVlZCBPSyB0byBqdWRnZSB3aGV0aGVyIGEgcmVxdWVz
dCBpcyB0aW1lb3V0IG9yIG5vdC4NCj4gYnV0IGhvdyBpcyB0aGUgZXhwaXJlcyB2YWx1ZSBkZXRl
cm1pbmVkIGZvciBfX2Jsa19hZGRfdGltZXIgLT4gbW9kX3RpbWVyID8NCj4gYXMgd2Uga25vdywg
d2hlbiBhIHJlcXVlc3QgaXMgc3RhcnRlZCwgd2UgbmVlZCB0byBhcm0gYSB0aW1lciBmb3IgaXQu
DQo+IHRoZSBleHBpcmVzIHZhbHVlIGlzICd1bnNpZ25lZCBsb25nJy4NCg0KVGhpcyBoYXMgYmVl
biBhZGRyZXNzZWQgaW4gdGhlIGp1c3QgcG9zdGVkIHZlcnNpb24gOS4gQ2FuIHlvdSBoYXZlIGEg
bG9vaz8NCkFsdGhvdWdoIEkgaGFkIHRlc3RlZCB0aW1lb3V0IGhhbmRsaW5nIEkgdGhpbmsgdGhp
cyBoYWQgbm90IGNhdXNlZCBteSB0ZXN0cw0KdG8gZmFpbCBiZWNhdXNlIHRoZSBmb2xsb3dpbmcg
Y29kZSBsaW1pdHMgcmVxdWVzdCB0aW1lb3V0IGV4YW1pbmF0aW9uIHRvIDVzDQppbiB0aGUgZnV0
dXJlOg0KDQojZGVmaW5lIEJMS19NQVhfVElNRU9VVAkoNSAqIEhaKQ0KDQp1bnNpZ25lZCBsb25n
IGJsa19ycV90aW1lb3V0KHVuc2lnbmVkIGxvbmcgdGltZW91dCkNCnsNCgl1bnNpZ25lZCBsb25n
IG1heHQ7DQoNCgltYXh0ID0gcm91bmRfamlmZmllc191cChqaWZmaWVzICsgQkxLX01BWF9USU1F
T1VUKTsNCglpZiAodGltZV9hZnRlcih0aW1lb3V0LCBtYXh0KSkNCgkJdGltZW91dCA9IG1heHQ7
DQoNCglyZXR1cm4gdGltZW91dDsNCn0NCg0KVGhhbmtzLA0KDQpCYXJ0Lg0KDQoNCg==

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-05-14 18:44 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-05-10 17:56 [PATCH] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
2018-05-11 12:06 ` jianchao.wang
2018-05-11 12:35   ` Christoph Hellwig
2018-05-11 15:29     ` Bart Van Assche
2018-05-14  1:37       ` jianchao.wang
2018-05-14  4:03         ` Bart Van Assche
2018-05-14  5:15           ` jianchao.wang
2018-05-14 18:44             ` Bart Van Assche
2018-05-11 15:26   ` Bart Van Assche
2018-05-14  1:48     ` jianchao.wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox