Re: [PATCH v3 4/5] iommu/arm-smmu-v3: Add host support for NVIDIA Grace CMDQ-V

public inbox for linux-tegra@vger.kernel.org
 help / color / mirror / Atom feed

From: Nicolin Chen <nicolinc@nvidia.com>
To: Robin Murphy <robin.murphy@arm.com>
Cc: <joro@8bytes.org>, <will@kernel.org>, <nicoleotsuka@gmail.com>,
	<thierry.reding@gmail.com>, <vdumpa@nvidia.com>,
	<nwatterson@nvidia.com>, <jean-philippe@linaro.org>,
	<thunder.leizhen@huawei.com>, <chenxiang66@hisilicon.com>,
	<Jonathan.Cameron@huawei.com>, <yuzenghui@huawei.com>,
	<linux-kernel@vger.kernel.org>,
	<iommu@lists.linux-foundation.org>,
	<linux-arm-kernel@lists.infradead.org>,
	<linux-tegra@vger.kernel.org>, <jgg@nvidia.com>
Subject: Re: [PATCH v3 4/5] iommu/arm-smmu-v3: Add host support for NVIDIA Grace CMDQ-V
Date: Tue, 21 Dec 2021 14:00:39 -0800	[thread overview]
Message-ID: <20211221220037.GA6292@Asurada-Nvidia> (raw)
In-Reply-To: <7e68fa19-90b1-bbb5-9991-36b5d35278fa@arm.com>

On Tue, Dec 21, 2021 at 06:55:20PM +0000, Robin Murphy wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 2021-12-20 19:27, Nicolin Chen wrote:
> > Hi Robin,
> > 
> > Thank you for the reply!
> > 
> > On Mon, Dec 20, 2021 at 06:42:26PM +0000, Robin Murphy wrote:
> > > On 2021-11-19 07:19, Nicolin Chen wrote:
> > > > From: Nate Watterson <nwatterson@nvidia.com>
> > > > 
> > > > NVIDIA's Grace Soc has a CMDQ-Virtualization (CMDQV) hardware,
> > > > which extends the standard ARM SMMU v3 IP to support multiple
> > > > VCMDQs with virtualization capabilities. In-kernel of host OS,
> > > > they're used to reduce contention on a single queue. In terms
> > > > of command queue, they are very like the standard CMDQ/ECMDQs,
> > > > but only support CS_NONE in the CS field of CMD_SYNC command.
> > > > 
> > > > This patch adds a new nvidia-grace-cmdqv file and inserts its
> > > > structure pointer into the existing arm_smmu_device, and then
> > > > adds related function calls in the arm-smmu-v3 driver.
> > > > 
> > > > In the CMDQV driver itself, this patch only adds minimal part
> > > > for host kernel support. Upon probe(), VINTF0 is reserved for
> > > > in-kernel use. And some of the VCMDQs are assigned to VINTF0.
> > > > Then the driver will select one of VCMDQs in the VINTF0 based
> > > > on the CPU currently executing, to issue commands.
> > > 
> > > Is there a tangible difference to DMA API or VFIO performance?
> > 
> > Our testing environment is currently running on a single-core
> > CPU, so unfortunately we don't have a perf data at this point.
> 
> OK, as for the ECMDQ patches I think we'll need some investigation with
> real workloads to judge whether we can benefit from these things enough
> to justify the complexity, and whether the design is right.
> 
> My gut feeling is that if these multi-queue schemes really can live up
> to their promise of making contention negligible, then they should
> further stand to benefit from bypassing the complex lock-free command
> batching in favour of something more lightweight, which could change the
> direction of much of the refactoring.

Makes sense. We will share our perf data once we have certain
level of support on our test environment.

> > > [...]
> > > > +struct arm_smmu_cmdq *nvidia_grace_cmdqv_get_cmdq(struct arm_smmu_device *smmu)
> > > > +{
> > > > +     struct nvidia_grace_cmdqv *cmdqv = smmu->nvidia_grace_cmdqv;
> > > > +     struct nvidia_grace_cmdqv_vintf *vintf0 = &cmdqv->vintf0;
> > > > +     u16 qidx;
> > > > +
> > > > +     /* Check error status of vintf0 */
> > > > +     if (!FIELD_GET(VINTF_STATUS, vintf0->status))
> > > > +             return &smmu->cmdq;
> > > > +
> > > > +     /*
> > > > +      * Select a vcmdq to use. Here we use a temporal solution to
> > > > +      * balance out traffic on cmdq issuing: each cmdq has its own
> > > > +      * lock, if all cpus issue cmdlist using the same cmdq, only
> > > > +      * one CPU at a time can enter the process, while the others
> > > > +      * will be spinning at the same lock.
> > > > +      */
> > > > +     qidx = smp_processor_id() % cmdqv->num_vcmdqs_per_vintf;
> > > 
> > > How does ordering work between queues? Do they follow a global order
> > > such that a sync on any queue is guaranteed to complete all prior
> > > commands on all queues?
> > 
> > CMDQV internal scheduler would insert a SYNC when (for example)
> > switching from VCMDQ0 to VCMDQ1 while last command in VCMDQ0 is
> > not SYNC. HW has a configuration bit in the register to disable
> > this feature, which is by default enabled.
> 
> Interesting, thanks. So it sounds like this is something you can get
> away with for the moment, but may need to revisit once people chasing
> real-world performance start wanting to turn that bit off.

Yea, we have limitations on both testing setup and available
clients for an in-depth perf measurement at this moment. But
we surely will do as you mentioned. Anyway, this is just for
initial support.

> > > The challenge to make ECMDQ useful to Linux is how to make sure that all
> > > the commands expected to be within scope of a future CMND_SYNC plus that
> > > sync itself all get issued on the same queue, so I'd be mildly surprised
> > > if you didn't have the same problem.
> > 
> > PATCH-3 in this series actually helps align the command queues,
> > between issued commands and SYNC, if bool sync == true. Yet, if
> > doing something like issue->issue->issue_with_sync, it could be
> > tricker.
> 
> Indeed between the iommu_iotlb_gather mechanism and low-level command
> batching things are already a lot more concentrated than they could be,
> but arm_smmu_cmdq_batch_add() and its callers stand out as examples of
> where we'd still be vulnerable to preemption. What I haven't even tried
> to reason about yet is assumptions in the higher-level APIs, e.g. if
> io-pgtable might chuck out a TLBI during an iommu_unmap() which we
> implicitly expect a later iommu_iotlb_sync() to cover.

Though I might have oversimplified the situation here, I see
the arm_smmu_cmdq_batch_add() calls are typically followed by
arm_smmu_cmdq_batch_submit(). Could we just add a SYNC in the
_batch_submit() to all the queues that it previously touched
in the _batch_add()?

> I've been thinking that in many ways per-domain queues make quite a bit
> of sense and would be easier to manage than per-CPU ones - plus that's
> pretty much the usage model once we get to VMs anyway - but that fails
> to help the significant cases like networking and storage where many
> CPUs are servicing a big monolithic device in a single domain :(

Yea, and it's hard to assume which client would use CMDQ more
frequently, in order to balance or assign more queues to that
client, which feels like a QoS conundrum.

Thanks
Nic

next prev parent reply	other threads:[~2021-12-21 22:00 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-19  7:19 [PATCH v3 0/5] iommu/arm-smmu-v3: Add NVIDIA Grace CMDQ-V Support Nicolin Chen
2021-11-19  7:19 ` [PATCH v3 1/5] iommu/arm-smmu-v3: Add CS_NONE quirk Nicolin Chen
2021-11-19  7:19 ` [PATCH v3 2/5] iommu/arm-smmu-v3: Make arm_smmu_cmdq_init reusable Nicolin Chen
2021-11-19  7:19 ` [PATCH v3 3/5] iommu/arm-smmu-v3: Pass cmdq pointer in arm_smmu_cmdq_issue_cmdlist() Nicolin Chen
2021-11-19  7:19 ` [PATCH v3 4/5] iommu/arm-smmu-v3: Add host support for NVIDIA Grace CMDQ-V Nicolin Chen
2021-12-20 18:42   ` Robin Murphy
2021-12-20 19:27     ` Nicolin Chen
2021-12-21 18:55       ` Robin Murphy
2021-12-21 22:00         ` Nicolin Chen [this message]
2021-12-22 11:57           ` Robin Murphy
2021-11-19  7:19 ` [PATCH v3 5/5] iommu/nvidia-grace-cmdqv: Limit CMDs for guest owned VINTF Nicolin Chen
2021-12-22 12:32   ` Robin Murphy
2021-12-22 22:52     ` Nicolin Chen
2021-12-23 11:14       ` Robin Murphy
2021-12-24  8:02         ` Nicolin Chen
2021-12-24 12:13           ` Robin Murphy
2021-12-28  5:49             ` Nicolin Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211221220037.GA6292@Asurada-Nvidia \
    --to=nicolinc@nvidia.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=chenxiang66@hisilicon.com \
    --cc=iommu@lists.linux-foundation.org \
    --cc=jean-philippe@linaro.org \
    --cc=jgg@nvidia.com \
    --cc=joro@8bytes.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-tegra@vger.kernel.org \
    --cc=nicoleotsuka@gmail.com \
    --cc=nwatterson@nvidia.com \
    --cc=robin.murphy@arm.com \
    --cc=thierry.reding@gmail.com \
    --cc=thunder.leizhen@huawei.com \
    --cc=vdumpa@nvidia.com \
    --cc=will@kernel.org \
    --cc=yuzenghui@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox