From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965698AbcIPUxv (ORCPT ); Fri, 16 Sep 2016 16:53:51 -0400 Received: from mga11.intel.com ([192.55.52.93]:44858 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965254AbcIPUxl (ORCPT ); Fri, 16 Sep 2016 16:53:41 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.30,346,1470726000"; d="scan'208";a="169673348" Date: Fri, 16 Sep 2016 17:04:48 -0400 From: Keith Busch To: Alexander Gordeev Cc: linux-kernel@vger.kernel.org, Jens Axboe , linux-nvme@lists.infradead.org Subject: Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues Message-ID: <20160916210448.GA1178@localhost.localdomain> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.7.0 (2016-08-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Sep 16, 2016 at 10:51:11AM +0200, Alexander Gordeev wrote: > Linux block device layer limits number of hardware contexts queues > to number of CPUs in the system. That looks like suboptimal hardware > utilization in systems where number of CPUs is (significantly) less > than number of hardware queues. > > In addition, there is a need to deal with tag starvation (see commit > 0d2602ca "blk-mq: improve support for shared tags maps"). While unused > hardware queues stay idle, extra efforts are taken to maintain a notion > of fairness between queue users. Deeper queue depth could probably > mitigate the whole issue sometimes. > > That all brings a straightforward idea that hardware queues provided by > a device should be utilized as much as possible. Hi Alex, I'm not sure I see how this helps. That probably means I'm not considering the right scenario. Could you elaborate on when having multiple hardware queues to choose from a given CPU will provide a benefit? If we're out of avaliable h/w tags, having more queues shouldn't improve performance. The tag depth on each nvme hw context is already deep enough that it should mean even one full queue has saturated the device capabilities. Having a 1:1 already seemed like the ideal solution since you can't simultaneously utilize more than that from the host, so there's no more h/w parallelisms from we can exploit. On the controller side, fetching commands is serialized memory reads, so I don't think spreading IO among more h/w queues helps the target over posting more commands to a single queue. If a CPU has more than one to choose from, a command sent to a less used queue would be serviced ahead of previously issued commands on a more heavily used one from the same CPU thread due to how NVMe command arbitraration works, so it sounds like this would create odd latency outliers. Thanks, Keith