From mboxrd@z Thu Jan  1 00:00:00 1970
From: keith.busch@intel.com (Keith Busch)
Date: Fri, 4 Jan 2019 15:56:22 -0700
Subject: [PATCHv2 3/4] PCI/MSI: Handle vector reduce and retry
In-Reply-To: <20190104223531.GA223506@google.com>
References: <20190103225033.11249-1-keith.busch@intel.com>
 <20190103225033.11249-4-keith.busch@intel.com>
 <20190104223531.GA223506@google.com>
Message-ID: <20190104225621.GA12916@localhost.localdomain>

On Fri, Jan 04, 2019@04:35:31PM -0600, Bjorn Helgaas wrote:
> On Thu, Jan 03, 2019@03:50:32PM -0700, Keith Busch wrote:
> > +The 'struct irq_affinity *affd' allows a driver to specify additional
> > +characteristics for how a driver wants the vector management to occur. The
> > +'pre_vectors' and 'post_vectors' fields define how many vectors the driver
> > +wants to not participate in kernel managed affinities, and whether those
> > +special vectors are at the beginning or the end of the vector space.
> 
> How are the pre_vectors and post_vectors handled?  Do they get
> assigned to random CPUs?  Current CPU?  Are their assignments tunable
> from user space?

Point taken. Those do get assigned a default mask, but they are also
user tunable and kernel migratable when CPUs offline/online.
 
> > +It may also be the case that a driver wants multiple sets of fully
> > +affinitized vectors. For example, a single PCI function may provide
> > +different high performance services that want full CPU affinity for each
> > +service independent of other services. In this case, the driver may use
> > +the struct irq_affinity's 'nr_sets' field to specify how many groups of
> > +vectors need to be spread across all the CPUs, and fill in the 'sets'
> > +array to say how many vectors the driver wants in each set.
> 
> I think the issue here is IRQ vectors, and "services" and whether
> they're high performance are unnecessary concepts.

It's really intended for when your device has resources optimally accessed
in a per-cpu manner. I can better rephrase this description.

> What does irq_affinity.sets point to?  I guess it's a table of
> integers where the table size is the number of sets and each entry is
> the number of vectors in the set?
>
> So we'd have something like this:
> 
>   pre_vectors     # vectors [0..pre_vectors) (pre_vectors >= 0)
>   set 0           # vectors [pre_vectors..pre_vectors+set0) (set0 >= 1)
>   set 1           # vectors [pre_vectors+set0..pre_vectors+set0+set1) (set1 >= 1)
>   ...
>   post_vectors    # vectors [pre_vectors+set0..pre_vectors+set0+set1+setN+post_vectors)
> 
> where the vectors in set0 are spread across all CPUs, those in set1
> are independently spread across all CPUs, etc?
>
> I would guess there may be device-specific restrictions on the mapping
> of of these vectors to sets, so the PCI core probably can't assume the
> sets can be of arbitrary size, contiguous, etc.

I think it's fair to say the caller wants vectors allocated and each set
affinitized contiguously such that each set starts after the previous
one ends. That works great with how NVMe wants to use it, at least. If
there is really any other way a device driver wants it, I can't see how
that can be easily accomodated.

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=mKou=PM=vger.kernel.org=linux-pci-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_PASS,USER_AGENT_MUTT autolearn=ham autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 706C8C43387
	for <linux-pci@archiver.kernel.org>; Fri,  4 Jan 2019 22:58:03 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 3E079218D3
	for <linux-pci@archiver.kernel.org>; Fri,  4 Jan 2019 22:58:03 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726074AbfADW6C (ORCPT <rfc822;linux-pci@archiver.kernel.org>);
        Fri, 4 Jan 2019 17:58:02 -0500
Received: from mga18.intel.com ([134.134.136.126]:45791 "EHLO mga18.intel.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1726009AbfADW6C (ORCPT <rfc822;linux-pci@vger.kernel.org>);
        Fri, 4 Jan 2019 17:58:02 -0500
X-Amp-Result: UNSCANNABLE
X-Amp-File-Uploaded: False
Received: from fmsmga004.fm.intel.com ([10.253.24.48])
  by orsmga106.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 04 Jan 2019 14:58:01 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.56,439,1539673200"; 
   d="scan'208";a="133047603"
Received: from unknown (HELO localhost.localdomain) ([10.232.112.69])
  by fmsmga004.fm.intel.com with ESMTP; 04 Jan 2019 14:58:01 -0800
Date:   Fri, 4 Jan 2019 15:56:22 -0700
From:   Keith Busch <keith.busch@intel.com>
To:     Bjorn Helgaas <helgaas@kernel.org>
Cc:     Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@lst.de>,
        Sagi Grimberg <sagi@grimberg.me>,
        Ming Lei <ming.lei@redhat.com>, linux-nvme@lists.infradead.org,
        linux-pci@vger.kernel.org
Subject: Re: [PATCHv2 3/4] PCI/MSI: Handle vector reduce and retry
Message-ID: <20190104225621.GA12916@localhost.localdomain>
References: <20190103225033.11249-1-keith.busch@intel.com>
 <20190103225033.11249-4-keith.busch@intel.com>
 <20190104223531.GA223506@google.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20190104223531.GA223506@google.com>
User-Agent: Mutt/1.9.1 (2017-09-22)
Sender: linux-pci-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-pci.vger.kernel.org>
X-Mailing-List: linux-pci@vger.kernel.org

On Fri, Jan 04, 2019 at 04:35:31PM -0600, Bjorn Helgaas wrote:
> On Thu, Jan 03, 2019 at 03:50:32PM -0700, Keith Busch wrote:
> > +The 'struct irq_affinity *affd' allows a driver to specify additional
> > +characteristics for how a driver wants the vector management to occur. The
> > +'pre_vectors' and 'post_vectors' fields define how many vectors the driver
> > +wants to not participate in kernel managed affinities, and whether those
> > +special vectors are at the beginning or the end of the vector space.
> 
> How are the pre_vectors and post_vectors handled?  Do they get
> assigned to random CPUs?  Current CPU?  Are their assignments tunable
> from user space?

Point taken. Those do get assigned a default mask, but they are also
user tunable and kernel migratable when CPUs offline/online.
 
> > +It may also be the case that a driver wants multiple sets of fully
> > +affinitized vectors. For example, a single PCI function may provide
> > +different high performance services that want full CPU affinity for each
> > +service independent of other services. In this case, the driver may use
> > +the struct irq_affinity's 'nr_sets' field to specify how many groups of
> > +vectors need to be spread across all the CPUs, and fill in the 'sets'
> > +array to say how many vectors the driver wants in each set.
> 
> I think the issue here is IRQ vectors, and "services" and whether
> they're high performance are unnecessary concepts.

It's really intended for when your device has resources optimally accessed
in a per-cpu manner. I can better rephrase this description.

> What does irq_affinity.sets point to?  I guess it's a table of
> integers where the table size is the number of sets and each entry is
> the number of vectors in the set?
>
> So we'd have something like this:
> 
>   pre_vectors     # vectors [0..pre_vectors) (pre_vectors >= 0)
>   set 0           # vectors [pre_vectors..pre_vectors+set0) (set0 >= 1)
>   set 1           # vectors [pre_vectors+set0..pre_vectors+set0+set1) (set1 >= 1)
>   ...
>   post_vectors    # vectors [pre_vectors+set0..pre_vectors+set0+set1+setN+post_vectors)
> 
> where the vectors in set0 are spread across all CPUs, those in set1
> are independently spread across all CPUs, etc?
>
> I would guess there may be device-specific restrictions on the mapping
> of of these vectors to sets, so the PCI core probably can't assume the
> sets can be of arbitrary size, contiguous, etc.

I think it's fair to say the caller wants vectors allocated and each set
affinitized contiguously such that each set starts after the previous
one ends. That works great with how NVMe wants to use it, at least. If
there is really any other way a device driver wants it, I can't see how
that can be easily accomodated.