blk-mq: takes hours for scsi scanning finish when thousands of LUNs

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* blk-mq: takes hours for scsi scanning finish when thousands of LUNs
@ 2015-10-19 14:40 Zhangqing Luo
  2015-10-22  8:47 ` Tejun Heo
  0 siblings, 1 reply; 9+ messages in thread
From: Zhangqing Luo @ 2015-10-19 14:40 UTC (permalink / raw)
  To: axboe, tj; +Cc: Guru Anbalagane, Feng Jin, linux-kernel

Hi, Jens,Tejun

Current problem we meet:

When Multiple Queue is used for scsi, the period between each LUN probe
is increasing as the number of block request queue goes up, eventually 
it takes hours for Initiator to finish scanning thousands of LUNs.

Kernel version we're using: 4.1.6-14
I tested with MainLine, it also has same issue.

My analysis:

As for each new LUN, scsi_mq_alloc_queue is called when Multiple Queue
is enabled for scsi to allocate and initialize the new queue, the process is:
->blk_mq_init_queue
->blk_mq_init_allocated_queue
->blk_mq_add_queue_tag_set
->blk_mq_update_tag_set_depth

blk_mq_update_tag_set_depth calls blk_mq_freeze_queue for each 
queue in the set tag list, if the queue is already in the tag list, 
we can see that it takes tens of ticks to freeze the queue, if there are 
thousands of queues in the list, it will take more than ten seconds(HZ=1000) 
for a new queue to finish its initialization, this is really a high delay,

my question here is: why blk_mq_update_tag_set_depth updates
each queue in the tag list? 
if this thing must be done, as the code below shows just changing flags
depending on 'shared' variable why shouldn't we store the previous result of 'shared' 
and compare with the current result, if it's unchanged, nothing will be done
and avoid looping all queues in list.

static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set)
{
     struct blk_mq_hw_ctx *hctx;
     struct request_queue *q;
     bool shared;
     int i;

     if (set->tag_list.next == set->tag_list.prev)
         shared = false;
     else
         shared = true;

     list_for_each_entry(q, &set->tag_list, tag_set_list) {
         blk_mq_freeze_queue(q);

         queue_for_each_hw_ctx(q, hctx, i) {
             if (shared)
                 hctx->flags |= BLK_MQ_F_TAG_SHARED;
             else
                 hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
         }
         blk_mq_unfreeze_queue(q);
     }
}

Deep insight:
I dig more into the delay and find  more details, for each already existing
queue, the delay comes from  blk_mq_freeze_queue_wait function
because the request queue->mq_usage_counter is not zero, blocked on
it and woken up by percpu_ref_switch_to_atomic_rcu after a grace period, 
let me explain how this happens.

Although for each new allocating queue, PERCPU_REF_INIT_ATOMIC flag is set
when allocating the percpu ref, but when registering queue, 
blk_mq_finish_init changes it to PERCPU by calling percpu_ref_switch_to_percpu,

So every time blk_mq_freeze_queue_start, it runs in this way

blk_mq_freeze_queue_start
->percpu_ref_kill->percpu_ref_kill_and_confirm
->__percpu_ref_switch_to_atomic
->call_rcu_sched(&ref->rcu,percpu_ref_switch_to_atomic_rcu)

and blk_mq_freeze_queue_wait blocks on queue->mq_usage_counter
as it is not zero, and wake up by percpu_ref_switch_to_atomic_rcu
after a grace period

My question here is why should we change ref to PERCPU at blk_mq_finish_init?
because of this changing, delay appears.

My suggestion:

Don't change ref to PERCPU by blk_mq_finish_init, keeping
it in ATOMIC, this can solve the problem.
Or optimize blk_mq_update_tag_set_depth to avoid such delay

The patch below is one possible fix, as I can't see if it can introduce
other regression issue, so could you guy look into it and give some advice?

/Thanks

0001-Avoid-long-delay-for-scsi-scanning-when-thousands-of.patch

>From 4505e8d499a7c336ebc3b588b089f5408841b421 Mon Sep 17 00:00:00 2001
From: Jason Luo <zhangqing.luo@oracle.com>
Date: Mon, 19 Oct 2015 14:51:51 +0800
Subject: [PATCH] Avoid long delay for scsi scanning when thousands of LUNs

Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>
---
  block/blk-mq-sysfs.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index b79685e..06533a9 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -404,7 +404,9 @@ static void blk_mq_sysfs_init(struct request_queue *q)
  /* see blk_register_queue() */
  void blk_mq_finish_init(struct request_queue *q)
  {
-	percpu_ref_switch_to_percpu(&q->mq_usage_counter);
+	/* Maybe some other things can be done here in the future
+	 * leave it with empty */
+	/*percpu_ref_switch_to_percpu(&q->mq_usage_counter);*/
  }

  int blk_mq_register_disk(struct gendisk *disk)
-- 2.1.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: blk-mq: takes hours for scsi scanning finish when thousands of LUNs
  2015-10-19 14:40 blk-mq: takes hours for scsi scanning finish when thousands of LUNs Zhangqing Luo
@ 2015-10-22  8:47 ` Tejun Heo
  2015-10-22  9:15   ` jason
  0 siblings, 1 reply; 9+ messages in thread
From: Tejun Heo @ 2015-10-22  8:47 UTC (permalink / raw)
  To: Zhangqing Luo; +Cc: axboe, Guru Anbalagane, Feng Jin, linux-kernel

Hello,

On Mon, Oct 19, 2015 at 07:40:13AM -0700, Zhangqing Luo wrote:
....
> So every time blk_mq_freeze_queue_start, it runs in this way
> 
> blk_mq_freeze_queue_start
> ->percpu_ref_kill->percpu_ref_kill_and_confirm
> ->__percpu_ref_switch_to_atomic
> ->call_rcu_sched(&ref->rcu,percpu_ref_switch_to_atomic_rcu)
> 
> and blk_mq_freeze_queue_wait blocks on queue->mq_usage_counter
> as it is not zero, and wake up by percpu_ref_switch_to_atomic_rcu
> after a grace period
> 
> 
> My question here is why should we change ref to PERCPU at blk_mq_finish_init?
> because of this changing, delay appears.

Because percpu operation is way cheaper than atomic ones and we want
to optimize hot paths (request issue and completion) over cold paths
(init and config changes).  That's the whole point of percpu
refcnting.

The reason why percpu ref starts in atomic mode is to avoid expensive
percpu freezing if the queue is created and abandoned in quick
succession as SCSI does during LUN scanning.  If percpu freezing is
happening during that, the right solution is moving finish_init to
late enough point so that percpu switching happens only after it's
known that the queue won't be abandoned.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: blk-mq: takes hours for scsi scanning finish when thousands of LUNs
  2015-10-22  8:47 ` Tejun Heo
@ 2015-10-22  9:15   ` jason
  2015-10-22 15:14     ` Jens Axboe
  2015-10-23  0:57     ` Ming Lei
  0 siblings, 2 replies; 9+ messages in thread
From: jason @ 2015-10-22  9:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, Guru Anbalagane, Feng Jin, linux-kernel



On Thursday, October 22, 2015 04:47 PM, Tejun Heo wrote:
> Hello,
>
> On Mon, Oct 19, 2015 at 07:40:13AM -0700, Zhangqing Luo wrote:
> ....
> > So every time blk_mq_freeze_queue_start, it runs in this way
> >
> > blk_mq_freeze_queue_start
> > ->percpu_ref_kill->percpu_ref_kill_and_confirm
> > ->__percpu_ref_switch_to_atomic
> > ->call_rcu_sched(&ref->rcu,percpu_ref_switch_to_atomic_rcu)
> >
> > and blk_mq_freeze_queue_wait blocks on queue->mq_usage_counter
> > as it is not zero, and wake up by percpu_ref_switch_to_atomic_rcu
> > after a grace period
> >
> >
> > My question here is why should we change ref to PERCPU at blk_mq_finish_init?
> > because of this changing, delay appears.
>
> Because percpu operation is way cheaper than atomic ones and we want
> to optimize hot paths (request issue and completion) over cold paths
> (init and config changes).  That's the whole point of percpu
> refcnting.
>
> The reason why percpu ref starts in atomic mode is to avoid expensive
> percpu freezing if the queue is created and abandoned in quick
> succession as SCSI does during LUN scanning.  If percpu freezing is
> happening during that, the right solution is moving finish_init to
> late enough point so that percpu switching happens only after it's
> known that the queue won't be abandoned.
>
> Thanks.
>
I agree with the optimizing hot paths by cheaper percpu operation,
but how much does it affect the performance?

as you know the switching causes delay, when the the LUN  number is increasing
the delay is becoming higher, so do you have any idea
about the problem?

/Thanks

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: blk-mq: takes hours for scsi scanning finish when thousands of LUNs
  2015-10-22  9:15   ` jason
@ 2015-10-22 15:14     ` Jens Axboe
  2015-10-22 15:53       ` Jeff Moyer
  2015-10-23  0:57     ` Ming Lei
  1 sibling, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2015-10-22 15:14 UTC (permalink / raw)
  To: jason, Tejun Heo; +Cc: Guru Anbalagane, Feng Jin, linux-kernel

On 10/22/2015 03:15 AM, jason wrote:
>
>
> On Thursday, October 22, 2015 04:47 PM, Tejun Heo wrote:
>> Hello,
>>
>> On Mon, Oct 19, 2015 at 07:40:13AM -0700, Zhangqing Luo wrote:
>> ....
>> > So every time blk_mq_freeze_queue_start, it runs in this way
>> >
>> > blk_mq_freeze_queue_start
>> > ->percpu_ref_kill->percpu_ref_kill_and_confirm
>> > ->__percpu_ref_switch_to_atomic
>> > ->call_rcu_sched(&ref->rcu,percpu_ref_switch_to_atomic_rcu)
>> >
>> > and blk_mq_freeze_queue_wait blocks on queue->mq_usage_counter
>> > as it is not zero, and wake up by percpu_ref_switch_to_atomic_rcu
>> > after a grace period
>> >
>> >
>> > My question here is why should we change ref to PERCPU at
>> blk_mq_finish_init?
>> > because of this changing, delay appears.
>>
>> Because percpu operation is way cheaper than atomic ones and we want
>> to optimize hot paths (request issue and completion) over cold paths
>> (init and config changes).  That's the whole point of percpu
>> refcnting.
>>
>> The reason why percpu ref starts in atomic mode is to avoid expensive
>> percpu freezing if the queue is created and abandoned in quick
>> succession as SCSI does during LUN scanning.  If percpu freezing is
>> happening during that, the right solution is moving finish_init to
>> late enough point so that percpu switching happens only after it's
>> known that the queue won't be abandoned.
>>
>> Thanks.
>>
> I agree with the optimizing hot paths by cheaper percpu operation,
> but how much does it affect the performance?

A lot, since the queue referencing happens twice per IO. The switch to 
percpu was done to use shared/common code for this, the previous version 
was a handrolled version of that.

> as you know the switching causes delay, when the the LUN  number is
> increasing
> the delay is becoming higher, so do you have any idea
> about the problem?

Tejun already outlined a good solution to the problem:

"If percpu freezing is
happening during that, the right solution is moving finish_init to
late enough point so that percpu switching happens only after it's
known that the queue won't be abandoned."

It'd be great if you could look into that. Your original patch 
demonstrates exactly where the problem is, but it's not something that 
can be applied of course.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: blk-mq: takes hours for scsi scanning finish when thousands of LUNs
  2015-10-22 15:14     ` Jens Axboe
@ 2015-10-22 15:53       ` Jeff Moyer
  2015-10-22 16:06         ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Jeff Moyer @ 2015-10-22 15:53 UTC (permalink / raw)
  To: Jens Axboe; +Cc: jason, Tejun Heo, Guru Anbalagane, Feng Jin, linux-kernel

Jens Axboe <axboe@kernel.dk> writes:

>> I agree with the optimizing hot paths by cheaper percpu operation,
>> but how much does it affect the performance?
>
> A lot, since the queue referencing happens twice per IO. The switch to
> percpu was done to use shared/common code for this, the previous
> version was a handrolled version of that.
>
>> as you know the switching causes delay, when the the LUN  number is
>> increasing
>> the delay is becoming higher, so do you have any idea
>> about the problem?
>
> Tejun already outlined a good solution to the problem:
>
> "If percpu freezing is
> happening during that, the right solution is moving finish_init to
> late enough point so that percpu switching happens only after it's
> known that the queue won't be abandoned."

I'm sure I'm missing something, but I don't think that will work.
blk_mq_update_tag_depth is freezing every single queue.  Those queues
are already setup and will not go away.  So how will moving finish_init
later in the queue setup fix this?  The patch Jason provided most likely
works because __percpu_ref_switch_to_atomic doesn't do anything.  The
most important things it doesn't do are:
1) percpu_ref_get(mq_usage_counter), followed by
2) call_rcu_sched()

It seems likely to me that forcing an rcu grace period for every single
LUN attached to a particular host is what's causing the delay.

And now you'll tell me how I've got that all wrong.  ;-)

Anyway, I think what Jason had initially suggested, would work:

  "if this thing must be done, as the code below shows just changing
   flags depending on 'shared' variable why shouldn't we store the
   previous result of 'shared' and compare with the current result, if
   it's unchanged, nothing will be done and avoid looping all queues in
   list."

I think that percolating BLK_MQ_F_TAG_SHARED up to the tag set would
allow newly created hctxs to simply inherit the shared state (in
blk_mq_init_hctx), and you won't need to freeze every queue in order to
guarantee that.

I was writing a patch to that effect.  I've now stopped as I want to
make sure I'm not off in the weeds.  :)

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: blk-mq: takes hours for scsi scanning finish when thousands of LUNs
  2015-10-22 15:53       ` Jeff Moyer
@ 2015-10-22 16:06         ` Jens Axboe
  2015-10-22 19:04           ` Jeff Moyer
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2015-10-22 16:06 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: jason, Tejun Heo, Guru Anbalagane, Feng Jin, linux-kernel

On 10/22/2015 09:53 AM, Jeff Moyer wrote:
> Jens Axboe <axboe@kernel.dk> writes:
>
>>> I agree with the optimizing hot paths by cheaper percpu operation,
>>> but how much does it affect the performance?
>>
>> A lot, since the queue referencing happens twice per IO. The switch to
>> percpu was done to use shared/common code for this, the previous
>> version was a handrolled version of that.
>>
>>> as you know the switching causes delay, when the the LUN  number is
>>> increasing
>>> the delay is becoming higher, so do you have any idea
>>> about the problem?
>>
>> Tejun already outlined a good solution to the problem:
>>
>> "If percpu freezing is
>> happening during that, the right solution is moving finish_init to
>> late enough point so that percpu switching happens only after it's
>> known that the queue won't be abandoned."
>
> I'm sure I'm missing something, but I don't think that will work.
> blk_mq_update_tag_depth is freezing every single queue.  Those queues
> are already setup and will not go away.  So how will moving finish_init
> later in the queue setup fix this?  The patch Jason provided most likely
> works because __percpu_ref_switch_to_atomic doesn't do anything.  The
> most important things it doesn't do are:
> 1) percpu_ref_get(mq_usage_counter), followed by
> 2) call_rcu_sched()
>
> It seems likely to me that forcing an rcu grace period for every single
> LUN attached to a particular host is what's causing the delay.
>
> And now you'll tell me how I've got that all wrong.  ;-)

Haha, no I think that is absolutely right. We've seen these bugs a lot, 
having thousands of serialized rcu grace period waits, this is just one 
more. The patch that Jason sent just bypassed the percpu switch, which 
we can't do.

> Anyway, I think what Jason had initially suggested, would work:
>
>    "if this thing must be done, as the code below shows just changing
>     flags depending on 'shared' variable why shouldn't we store the
>     previous result of 'shared' and compare with the current result, if
>     it's unchanged, nothing will be done and avoid looping all queues in
>     list."
>
> I think that percolating BLK_MQ_F_TAG_SHARED up to the tag set would
> allow newly created hctxs to simply inherit the shared state (in
> blk_mq_init_hctx), and you won't need to freeze every queue in order to
> guarantee that.
>
> I was writing a patch to that effect.  I've now stopped as I want to
> make sure I'm not off in the weeds.  :)

If that is where the delay is done, then yes, that should fix it and be 
a trivial patch.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: blk-mq: takes hours for scsi scanning finish when thousands of LUNs
  2015-10-22 16:06         ` Jens Axboe
@ 2015-10-22 19:04           ` Jeff Moyer
  2015-10-23  9:48             ` jason
  0 siblings, 1 reply; 9+ messages in thread
From: Jeff Moyer @ 2015-10-22 19:04 UTC (permalink / raw)
  To: Jens Axboe; +Cc: jason, Tejun Heo, Guru Anbalagane, Feng Jin, linux-kernel

Jens Axboe <axboe@kernel.dk> writes:

> On 10/22/2015 09:53 AM, Jeff Moyer wrote:
>> I think that percolating BLK_MQ_F_TAG_SHARED up to the tag set would
>> allow newly created hctxs to simply inherit the shared state (in
>> blk_mq_init_hctx), and you won't need to freeze every queue in order to
>> guarantee that.
>>
>> I was writing a patch to that effect.  I've now stopped as I want to
>> make sure I'm not off in the weeds.  :)
>
> If that is where the delay is done, then yes, that should fix it and
> be a trivial patch.

It's not quite as trivial as I had hoped.  Jason, can you give the
attached patch a try?  All I've done is boot tested it so far.

Thanks!
Jeff

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7785ae9..8b4c484 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1860,27 +1860,26 @@ static void blk_mq_map_swqueue(struct request_queue *q,
 	}
 }
 
-static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set)
+static void queue_set_hctx_shared(struct request_queue *q, bool shared)
 {
 	struct blk_mq_hw_ctx *hctx;
-	struct request_queue *q;
-	bool shared;
 	int i;
 
-	if (set->tag_list.next == set->tag_list.prev)
-		shared = false;
-	else
-		shared = true;
+	queue_for_each_hw_ctx(q, hctx, i) {
+		if (shared)
+			hctx->flags |= BLK_MQ_F_TAG_SHARED;
+		else
+			hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
+	}
+}
+
+static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set, bool shared)
+{
+	struct request_queue *q;
 
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_freeze_queue(q);
-
-		queue_for_each_hw_ctx(q, hctx, i) {
-			if (shared)
-				hctx->flags |= BLK_MQ_F_TAG_SHARED;
-			else
-				hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
-		}
+		queue_set_hctx_shared(q, shared);
 		blk_mq_unfreeze_queue(q);
 	}
 }
@@ -1891,7 +1890,13 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
 
 	mutex_lock(&set->tag_list_lock);
 	list_del_init(&q->tag_set_list);
-	blk_mq_update_tag_set_depth(set);
+
+	if (set->tag_list.next == set->tag_list.prev) {
+		/* just transitioned to unshared */
+		set->flags &= ~BLK_MQ_F_TAG_SHARED;
+		/* update existing queue */
+		blk_mq_update_tag_set_depth(set, false);
+	}
 	mutex_unlock(&set->tag_list_lock);
 }
 
@@ -1902,7 +1907,24 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
 
 	mutex_lock(&set->tag_list_lock);
 	list_add_tail(&q->tag_set_list, &set->tag_list);
-	blk_mq_update_tag_set_depth(set);
+
+	if (set->tag_list.next != set->tag_list.prev) {
+		/*
+		 * Only update the tag set state if the state has
+		 * actually changed.
+		 */
+		if (!(set->flags & BLK_MQ_F_TAG_SHARED)) {
+			/* just transitioned to shared tags */
+			set->flags |= BLK_MQ_F_TAG_SHARED;
+			blk_mq_update_tag_set_depth(set, true);
+		} else {
+			/* ensure we didn't race with another addition */
+			struct blk_mq_hw_ctx *hctx = queue_first_hw_ctx(q);
+			if ((hctx->flags & BLK_MQ_F_TAG_SHARED) !=
+							BLK_MQ_F_TAG_SHARED)
+				queue_set_hctx_shared(q, true);
+		}
+	}
 	mutex_unlock(&set->tag_list_lock);
 }
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 5e7d43a..12ffc40 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -254,6 +254,9 @@ static inline void *blk_mq_rq_to_pdu(struct request *rq)
 	for ((i) = 0; (i) < (hctx)->nr_ctx &&				\
 	     ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++)
 
+#define queue_first_hw_ctx(q)						\
+	(q)->queue_hw_ctx[0]
+
 #define blk_ctx_sum(q, sum)						\
 ({									\
 	struct blk_mq_ctx *__x;						\

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: blk-mq: takes hours for scsi scanning finish when thousands of LUNs
  2015-10-22  9:15   ` jason
  2015-10-22 15:14     ` Jens Axboe
@ 2015-10-23  0:57     ` Ming Lei
  1 sibling, 0 replies; 9+ messages in thread
From: Ming Lei @ 2015-10-23  0:57 UTC (permalink / raw)
  To: jason
  Cc: Tejun Heo, Jens Axboe, Guru Anbalagane, Feng Jin,
	Linux Kernel Mailing List

On Thu, Oct 22, 2015 at 5:15 PM, jason <zhangqing.luo@oracle.com> wrote:
>
>
> On Thursday, October 22, 2015 04:47 PM, Tejun Heo wrote:
>>
>> Hello,
>>
>> On Mon, Oct 19, 2015 at 07:40:13AM -0700, Zhangqing Luo wrote:
>> ....
>> > So every time blk_mq_freeze_queue_start, it runs in this way
>> >
>> > blk_mq_freeze_queue_start
>> > ->percpu_ref_kill->percpu_ref_kill_and_confirm
>> > ->__percpu_ref_switch_to_atomic
>> > ->call_rcu_sched(&ref->rcu,percpu_ref_switch_to_atomic_rcu)
>> >
>> > and blk_mq_freeze_queue_wait blocks on queue->mq_usage_counter
>> > as it is not zero, and wake up by percpu_ref_switch_to_atomic_rcu
>> > after a grace period
>> >
>> >
>> > My question here is why should we change ref to PERCPU at
>> > blk_mq_finish_init?
>> > because of this changing, delay appears.
>>
>> Because percpu operation is way cheaper than atomic ones and we want
>> to optimize hot paths (request issue and completion) over cold paths
>> (init and config changes).  That's the whole point of percpu
>> refcnting.
>>
>> The reason why percpu ref starts in atomic mode is to avoid expensive
>> percpu freezing if the queue is created and abandoned in quick
>> succession as SCSI does during LUN scanning.  If percpu freezing is
>> happening during that, the right solution is moving finish_init to
>> late enough point so that percpu switching happens only after it's
>> known that the queue won't be abandoned.
>>
>> Thanks.
>>
> I agree with the optimizing hot paths by cheaper percpu operation,
> but how much does it affect the performance?
>
> as you know the switching causes delay, when the the LUN  number is
> increasing
> the delay is becoming higher, so do you have any idea
> about the problem?

For a normal SCSI disk, add_disk() is called in sd probe path, during that
time, the reuquest queue should have been initialized completely, that
said blk_mq_add_queue_tag_set() should be run in atomic mode
of the usage ref counter.

Not sure how can you observe the ref counter is running at percpu mode
during queue initialization.

Care to share what your disk/driver is? I doubt it might be one SCSI LLD's
issue.

Thanks,

>
> /Thanks
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Ming Lei

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: blk-mq: takes hours for scsi scanning finish when thousands of LUNs
  2015-10-22 19:04           ` Jeff Moyer
@ 2015-10-23  9:48             ` jason
  0 siblings, 0 replies; 9+ messages in thread
From: jason @ 2015-10-23  9:48 UTC (permalink / raw)
  To: Jeff Moyer, Jens Axboe; +Cc: Tejun Heo, Guru Anbalagane, Feng Jin, linux-kernel

Hi,Jeff

On Friday, October 23, 2015 03:04 AM, Jeff Moyer wrote:
> Jens Axboe <axboe@kernel.dk> writes:
>
>> On 10/22/2015 09:53 AM, Jeff Moyer wrote:
>>> I think that percolating BLK_MQ_F_TAG_SHARED up to the tag set would
>>> allow newly created hctxs to simply inherit the shared state (in
>>> blk_mq_init_hctx), and you won't need to freeze every queue in order to
>>> guarantee that.
>>>
>>> I was writing a patch to that effect.  I've now stopped as I want to
>>> make sure I'm not off in the weeds.  :)
>>
>> If that is where the delay is done, then yes, that should fix it and
>> be a trivial patch.
>
> It's not quite as trivial as I had hoped.  Jason, can you give the
> attached patch a try?  All I've done is boot tested it so far.
>
> Thanks!
> Jeff
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 7785ae9..8b4c484 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1860,27 +1860,26 @@ static void blk_mq_map_swqueue(struct request_queue *q,
>   	}
>   }
>
> -static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set)
> +static void queue_set_hctx_shared(struct request_queue *q, bool shared)
>   {
>   	struct blk_mq_hw_ctx *hctx;
> -	struct request_queue *q;
> -	bool shared;
>   	int i;
>
> -	if (set->tag_list.next == set->tag_list.prev)
> -		shared = false;
> -	else
> -		shared = true;
> +	queue_for_each_hw_ctx(q, hctx, i) {
> +		if (shared)
> +			hctx->flags |= BLK_MQ_F_TAG_SHARED;
> +		else
> +			hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
> +	}
> +}
> +
> +static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set, bool shared)
> +{
> +	struct request_queue *q;
>
>   	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>   		blk_mq_freeze_queue(q);
> -
> -		queue_for_each_hw_ctx(q, hctx, i) {
> -			if (shared)
> -				hctx->flags |= BLK_MQ_F_TAG_SHARED;
> -			else
> -				hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
> -		}
> +		queue_set_hctx_shared(q, shared);
>   		blk_mq_unfreeze_queue(q);
>   	}
>   }
> @@ -1891,7 +1890,13 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
>
>   	mutex_lock(&set->tag_list_lock);
>   	list_del_init(&q->tag_set_list);
> -	blk_mq_update_tag_set_depth(set);
> +
> +	if (set->tag_list.next == set->tag_list.prev) {
> +		/* just transitioned to unshared */
> +		set->flags &= ~BLK_MQ_F_TAG_SHARED;
> +		/* update existing queue */
> +		blk_mq_update_tag_set_depth(set, false);
> +	}
>   	mutex_unlock(&set->tag_list_lock);
>   }
>
> @@ -1902,7 +1907,24 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
>
>   	mutex_lock(&set->tag_list_lock);
>   	list_add_tail(&q->tag_set_list, &set->tag_list);
> -	blk_mq_update_tag_set_depth(set);
> +
> +	if (set->tag_list.next != set->tag_list.prev) {
> +		/*
> +		 * Only update the tag set state if the state has
> +		 * actually changed.
> +		 */
> +		if (!(set->flags & BLK_MQ_F_TAG_SHARED)) {
> +			/* just transitioned to shared tags */
> +			set->flags |= BLK_MQ_F_TAG_SHARED;
> +			blk_mq_update_tag_set_depth(set, true);
> +		} else {
> +			/* ensure we didn't race with another addition */
> +			struct blk_mq_hw_ctx *hctx = queue_first_hw_ctx(q);
> +			if ((hctx->flags & BLK_MQ_F_TAG_SHARED) !=
> +							BLK_MQ_F_TAG_SHARED)
> +				queue_set_hctx_shared(q, true);
> +		}
> +	}
>   	mutex_unlock(&set->tag_list_lock);
>   }
>
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 5e7d43a..12ffc40 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -254,6 +254,9 @@ static inline void *blk_mq_rq_to_pdu(struct request *rq)
>   	for ((i) = 0; (i) < (hctx)->nr_ctx &&				\
>   	     ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++)
>
> +#define queue_first_hw_ctx(q)						\
> +	(q)->queue_hw_ctx[0]
> +
>   #define blk_ctx_sum(q, sum)						\
>   ({									\
>   	struct blk_mq_ctx *__x;						\
>

I tested with your patch which avoids looping all queues in set by 
checking flag BLK_MQ_F_TAG_SHARED.
it works!

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-10-23  9:49 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-19 14:40 blk-mq: takes hours for scsi scanning finish when thousands of LUNs Zhangqing Luo
2015-10-22  8:47 ` Tejun Heo
2015-10-22  9:15   ` jason
2015-10-22 15:14     ` Jens Axboe
2015-10-22 15:53       ` Jeff Moyer
2015-10-22 16:06         ` Jens Axboe
2015-10-22 19:04           ` Jeff Moyer
2015-10-23  9:48             ` jason
2015-10-23  0:57     ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).