From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965382AbbJVQGS (ORCPT ); Thu, 22 Oct 2015 12:06:18 -0400 Received: from mail-ob0-f180.google.com ([209.85.214.180]:33621 "EHLO mail-ob0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965246AbbJVQGL (ORCPT ); Thu, 22 Oct 2015 12:06:11 -0400 Subject: Re: blk-mq: takes hours for scsi scanning finish when thousands of LUNs To: Jeff Moyer References: <20151022084733.GA24379@mtj.duckdns.org> <5628A91E.60208@oracle.com> <5628FD58.4090909@kernel.dk> Cc: jason , Tejun Heo , Guru Anbalagane , Feng Jin , linux-kernel@vger.kernel.org From: Jens Axboe Message-ID: <56290971.9060403@kernel.dk> Date: Thu, 22 Oct 2015 10:06:09 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 10/22/2015 09:53 AM, Jeff Moyer wrote: > Jens Axboe writes: > >>> I agree with the optimizing hot paths by cheaper percpu operation, >>> but how much does it affect the performance? >> >> A lot, since the queue referencing happens twice per IO. The switch to >> percpu was done to use shared/common code for this, the previous >> version was a handrolled version of that. >> >>> as you know the switching causes delay, when the the LUN number is >>> increasing >>> the delay is becoming higher, so do you have any idea >>> about the problem? >> >> Tejun already outlined a good solution to the problem: >> >> "If percpu freezing is >> happening during that, the right solution is moving finish_init to >> late enough point so that percpu switching happens only after it's >> known that the queue won't be abandoned." > > I'm sure I'm missing something, but I don't think that will work. > blk_mq_update_tag_depth is freezing every single queue. Those queues > are already setup and will not go away. So how will moving finish_init > later in the queue setup fix this? The patch Jason provided most likely > works because __percpu_ref_switch_to_atomic doesn't do anything. The > most important things it doesn't do are: > 1) percpu_ref_get(mq_usage_counter), followed by > 2) call_rcu_sched() > > It seems likely to me that forcing an rcu grace period for every single > LUN attached to a particular host is what's causing the delay. > > And now you'll tell me how I've got that all wrong. ;-) Haha, no I think that is absolutely right. We've seen these bugs a lot, having thousands of serialized rcu grace period waits, this is just one more. The patch that Jason sent just bypassed the percpu switch, which we can't do. > Anyway, I think what Jason had initially suggested, would work: > > "if this thing must be done, as the code below shows just changing > flags depending on 'shared' variable why shouldn't we store the > previous result of 'shared' and compare with the current result, if > it's unchanged, nothing will be done and avoid looping all queues in > list." > > I think that percolating BLK_MQ_F_TAG_SHARED up to the tag set would > allow newly created hctxs to simply inherit the shared state (in > blk_mq_init_hctx), and you won't need to freeze every queue in order to > guarantee that. > > I was writing a patch to that effect. I've now stopped as I want to > make sure I'm not off in the weeds. :) If that is where the delay is done, then yes, that should fix it and be a trivial patch. -- Jens Axboe