From mboxrd@z Thu Jan 1 00:00:00 1970 From: keith.busch@intel.com (Keith Busch) Date: Fri, 4 Jan 2019 15:56:22 -0700 Subject: [PATCHv2 3/4] PCI/MSI: Handle vector reduce and retry In-Reply-To: <20190104223531.GA223506@google.com> References: <20190103225033.11249-1-keith.busch@intel.com> <20190103225033.11249-4-keith.busch@intel.com> <20190104223531.GA223506@google.com> Message-ID: <20190104225621.GA12916@localhost.localdomain> On Fri, Jan 04, 2019@04:35:31PM -0600, Bjorn Helgaas wrote: > On Thu, Jan 03, 2019@03:50:32PM -0700, Keith Busch wrote: > > +The 'struct irq_affinity *affd' allows a driver to specify additional > > +characteristics for how a driver wants the vector management to occur. The > > +'pre_vectors' and 'post_vectors' fields define how many vectors the driver > > +wants to not participate in kernel managed affinities, and whether those > > +special vectors are at the beginning or the end of the vector space. > > How are the pre_vectors and post_vectors handled? Do they get > assigned to random CPUs? Current CPU? Are their assignments tunable > from user space? Point taken. Those do get assigned a default mask, but they are also user tunable and kernel migratable when CPUs offline/online. > > +It may also be the case that a driver wants multiple sets of fully > > +affinitized vectors. For example, a single PCI function may provide > > +different high performance services that want full CPU affinity for each > > +service independent of other services. In this case, the driver may use > > +the struct irq_affinity's 'nr_sets' field to specify how many groups of > > +vectors need to be spread across all the CPUs, and fill in the 'sets' > > +array to say how many vectors the driver wants in each set. > > I think the issue here is IRQ vectors, and "services" and whether > they're high performance are unnecessary concepts. It's really intended for when your device has resources optimally accessed in a per-cpu manner. I can better rephrase this description. > What does irq_affinity.sets point to? I guess it's a table of > integers where the table size is the number of sets and each entry is > the number of vectors in the set? > > So we'd have something like this: > > pre_vectors # vectors [0..pre_vectors) (pre_vectors >= 0) > set 0 # vectors [pre_vectors..pre_vectors+set0) (set0 >= 1) > set 1 # vectors [pre_vectors+set0..pre_vectors+set0+set1) (set1 >= 1) > ... > post_vectors # vectors [pre_vectors+set0..pre_vectors+set0+set1+setN+post_vectors) > > where the vectors in set0 are spread across all CPUs, those in set1 > are independently spread across all CPUs, etc? > > I would guess there may be device-specific restrictions on the mapping > of of these vectors to sets, so the PCI core probably can't assume the > sets can be of arbitrary size, contiguous, etc. I think it's fair to say the caller wants vectors allocated and each set affinitized contiguously such that each set starts after the previous one ends. That works great with how NVMe wants to use it, at least. If there is really any other way a device driver wants it, I can't see how that can be easily accomodated. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 706C8C43387 for ; Fri, 4 Jan 2019 22:58:03 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 3E079218D3 for ; Fri, 4 Jan 2019 22:58:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726074AbfADW6C (ORCPT ); Fri, 4 Jan 2019 17:58:02 -0500 Received: from mga18.intel.com ([134.134.136.126]:45791 "EHLO mga18.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726009AbfADW6C (ORCPT ); Fri, 4 Jan 2019 17:58:02 -0500 X-Amp-Result: UNSCANNABLE X-Amp-File-Uploaded: False Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orsmga106.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 04 Jan 2019 14:58:01 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,439,1539673200"; d="scan'208";a="133047603" Received: from unknown (HELO localhost.localdomain) ([10.232.112.69]) by fmsmga004.fm.intel.com with ESMTP; 04 Jan 2019 14:58:01 -0800 Date: Fri, 4 Jan 2019 15:56:22 -0700 From: Keith Busch To: Bjorn Helgaas Cc: Jens Axboe , Christoph Hellwig , Sagi Grimberg , Ming Lei , linux-nvme@lists.infradead.org, linux-pci@vger.kernel.org Subject: Re: [PATCHv2 3/4] PCI/MSI: Handle vector reduce and retry Message-ID: <20190104225621.GA12916@localhost.localdomain> References: <20190103225033.11249-1-keith.busch@intel.com> <20190103225033.11249-4-keith.busch@intel.com> <20190104223531.GA223506@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190104223531.GA223506@google.com> User-Agent: Mutt/1.9.1 (2017-09-22) Sender: linux-pci-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org On Fri, Jan 04, 2019 at 04:35:31PM -0600, Bjorn Helgaas wrote: > On Thu, Jan 03, 2019 at 03:50:32PM -0700, Keith Busch wrote: > > +The 'struct irq_affinity *affd' allows a driver to specify additional > > +characteristics for how a driver wants the vector management to occur. The > > +'pre_vectors' and 'post_vectors' fields define how many vectors the driver > > +wants to not participate in kernel managed affinities, and whether those > > +special vectors are at the beginning or the end of the vector space. > > How are the pre_vectors and post_vectors handled? Do they get > assigned to random CPUs? Current CPU? Are their assignments tunable > from user space? Point taken. Those do get assigned a default mask, but they are also user tunable and kernel migratable when CPUs offline/online. > > +It may also be the case that a driver wants multiple sets of fully > > +affinitized vectors. For example, a single PCI function may provide > > +different high performance services that want full CPU affinity for each > > +service independent of other services. In this case, the driver may use > > +the struct irq_affinity's 'nr_sets' field to specify how many groups of > > +vectors need to be spread across all the CPUs, and fill in the 'sets' > > +array to say how many vectors the driver wants in each set. > > I think the issue here is IRQ vectors, and "services" and whether > they're high performance are unnecessary concepts. It's really intended for when your device has resources optimally accessed in a per-cpu manner. I can better rephrase this description. > What does irq_affinity.sets point to? I guess it's a table of > integers where the table size is the number of sets and each entry is > the number of vectors in the set? > > So we'd have something like this: > > pre_vectors # vectors [0..pre_vectors) (pre_vectors >= 0) > set 0 # vectors [pre_vectors..pre_vectors+set0) (set0 >= 1) > set 1 # vectors [pre_vectors+set0..pre_vectors+set0+set1) (set1 >= 1) > ... > post_vectors # vectors [pre_vectors+set0..pre_vectors+set0+set1+setN+post_vectors) > > where the vectors in set0 are spread across all CPUs, those in set1 > are independently spread across all CPUs, etc? > > I would guess there may be device-specific restrictions on the mapping > of of these vectors to sets, so the PCI core probably can't assume the > sets can be of arbitrary size, contiguous, etc. I think it's fair to say the caller wants vectors allocated and each set affinitized contiguously such that each set starts after the previous one ends. That works great with how NVMe wants to use it, at least. If there is really any other way a device driver wants it, I can't see how that can be easily accomodated.