From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S965382AbbJVQGS (ORCPT <rfc822;w@1wt.eu>);
	Thu, 22 Oct 2015 12:06:18 -0400
Received: from mail-ob0-f180.google.com ([209.85.214.180]:33621 "EHLO
	mail-ob0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S965246AbbJVQGL (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Thu, 22 Oct 2015 12:06:11 -0400
Subject: Re: blk-mq: takes hours for scsi scanning finish when thousands of
 LUNs
To: Jeff Moyer <jmoyer@redhat.com>
References: <a7028aac-e180-43eb-8099-cb0bb51b58d6@default>
 <20151022084733.GA24379@mtj.duckdns.org> <5628A91E.60208@oracle.com>
 <5628FD58.4090909@kernel.dk>
 <x49a8rahpcw.fsf@segfault.boston.devel.redhat.com>
Cc: jason <zhangqing.luo@oracle.com>, Tejun Heo <tj@kernel.org>,
        Guru Anbalagane <guru.anbalagane@oracle.com>,
        Feng Jin <joe.jin@oracle.com>, linux-kernel@vger.kernel.org
From: Jens Axboe <axboe@kernel.dk>
Message-ID: <56290971.9060403@kernel.dk>
Date: Thu, 22 Oct 2015 10:06:09 -0600
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
 Thunderbird/38.3.0
MIME-Version: 1.0
In-Reply-To: <x49a8rahpcw.fsf@segfault.boston.devel.redhat.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 10/22/2015 09:53 AM, Jeff Moyer wrote:
> Jens Axboe <axboe@kernel.dk> writes:
>
>>> I agree with the optimizing hot paths by cheaper percpu operation,
>>> but how much does it affect the performance?
>>
>> A lot, since the queue referencing happens twice per IO. The switch to
>> percpu was done to use shared/common code for this, the previous
>> version was a handrolled version of that.
>>
>>> as you know the switching causes delay, when the the LUN  number is
>>> increasing
>>> the delay is becoming higher, so do you have any idea
>>> about the problem?
>>
>> Tejun already outlined a good solution to the problem:
>>
>> "If percpu freezing is
>> happening during that, the right solution is moving finish_init to
>> late enough point so that percpu switching happens only after it's
>> known that the queue won't be abandoned."
>
> I'm sure I'm missing something, but I don't think that will work.
> blk_mq_update_tag_depth is freezing every single queue.  Those queues
> are already setup and will not go away.  So how will moving finish_init
> later in the queue setup fix this?  The patch Jason provided most likely
> works because __percpu_ref_switch_to_atomic doesn't do anything.  The
> most important things it doesn't do are:
> 1) percpu_ref_get(mq_usage_counter), followed by
> 2) call_rcu_sched()
>
> It seems likely to me that forcing an rcu grace period for every single
> LUN attached to a particular host is what's causing the delay.
>
> And now you'll tell me how I've got that all wrong.  ;-)

Haha, no I think that is absolutely right. We've seen these bugs a lot, 
having thousands of serialized rcu grace period waits, this is just one 
more. The patch that Jason sent just bypassed the percpu switch, which 
we can't do.

> Anyway, I think what Jason had initially suggested, would work:
>
>    "if this thing must be done, as the code below shows just changing
>     flags depending on 'shared' variable why shouldn't we store the
>     previous result of 'shared' and compare with the current result, if
>     it's unchanged, nothing will be done and avoid looping all queues in
>     list."
>
> I think that percolating BLK_MQ_F_TAG_SHARED up to the tag set would
> allow newly created hctxs to simply inherit the shared state (in
> blk_mq_init_hctx), and you won't need to freeze every queue in order to
> guarantee that.
>
> I was writing a patch to that effect.  I've now stopped as I want to
> make sure I'm not off in the weeds.  :)

If that is where the delay is done, then yes, that should fix it and be 
a trivial patch.

-- 
Jens Axboe