From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from verein.lst.de (verein.lst.de [213.95.11.211])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id BAFDF1FC7D6
	for <linux-kernel@vger.kernel.org>; Tue, 15 Oct 2024 13:29:33 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=213.95.11.211
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1728998976; cv=none; b=bZCZI/xqnqyscy3H/fp25uLi+l7qYrj1oDz4mx0EPSzBng5KEim9dQI9xdOt0hGidLXiMctkNzvFvF+BRlJvrt6dDPFATmb6mGBTzdp8gtWIZm4bTJ60oF8X3zEboyaZlaQPbhmo6FrER9tWOV0ELgqgsYLDVvN8n7ap53RNblE=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1728998976; c=relaxed/simple;
	bh=7EX4d5MFq3GySGQgWkMHbcmsHKrfVUgSLTVuPf/ljoA=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=MgVeefN2PnHSt9dlT5+SkkTlkDc2IW4YS2cZ7yisxNyLrGtVom1Bka6S2096HuvPDgPZ02ljaQrKQxDBnyZSIyluzyHFA6BqZfKjh88Tmn/7GfS/exqw6cu5IuVOjSOLsO80AAzoDD0pn4M3OHzDowiioemnqEagKLPDMxyUhdk=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=lst.de; spf=pass smtp.mailfrom=lst.de; arc=none smtp.client-ip=213.95.11.211
Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=lst.de
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=lst.de
Received: by verein.lst.de (Postfix, from userid 2407)
	id 19400227AB7; Tue, 15 Oct 2024 15:29:29 +0200 (CEST)
Date: Tue, 15 Oct 2024 15:29:28 +0200
From: Christoph Hellwig <hch@lst.de>
To: Tero Kristo <tero.kristo@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>, linux-kernel@vger.kernel.org,
	axboe@kernel.dk, linux-nvme@lists.infradead.org, sagi@grimberg.me,
	kbusch@kernel.org
Subject: Re: [PATCH 1/1] nvme-pci: Add CPU latency pm-qos handling
Message-ID: <20241015132928.GA3961@lst.de>
References: <20241004101014.3716006-1-tero.kristo@linux.intel.com> <20241004101014.3716006-2-tero.kristo@linux.intel.com> <20241007061926.GA800@lst.de> <913b063d0638614bc95d92969879d2096ffc0722.camel@linux.intel.com> <20241009080052.GA16711@lst.de> <accb9ceb501197b71259d8d3996c461dcef1e7d6.camel@linux.intel.com> <0feb16b0bc3515b0a77f33a3e18568f62236b691.camel@linux.intel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <0feb16b0bc3515b0a77f33a3e18568f62236b691.camel@linux.intel.com>
User-Agent: Mutt/1.5.17 (2007-11-01)

On Tue, Oct 15, 2024 at 12:25:37PM +0300, Tero Kristo wrote:
> I've been giving this some thought offline, but can't really think of
> how this could be done in the generic layers; the code needs to figure
> out the interrupt that gets fired by the activity, to prevent the CPU
> that is going to handle that interrupt to go into deep idle,
> potentially ruining the latency and throughput of the request. The
> knowledge of this interrupt mapping only resides in the driver level,
> in this case NVMe.
> 
> One thing that could be done is to prevent the whole feature to be used
> on setups where the number of cpus per irq is above some threshold;
> lets say 4 as an example.

As a disclaimer I don't really understand the PM QOS framework, just
the NVMe driver and block layer.

With that my gut feeling is that all this latency management should
be driven by the blk_mq_hctx structure, the block layer equivalent
to a queue.  And instead of having a per-cpu array of QOS requests
per device, there should one per cpu in the actual mask of the
hctx, so that you only have to iterate this local shared data
structure.

Preferably there would be one single active check per hctx and
not one per cpu, e.g. when the block layer submits commands
it has to do one single check instead of an iteration.  Similarly
the block layer code would time out the activity once per hctx,
and only then iterate the (usually few) CPUs per hctx.