From: Marc Zyngier <maz@kernel.org>
To: John Garry <john.garry@huawei.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>,
Robin Murphy <robin.murphy@arm.com>,
Ming Lei <ming.lei@redhat.com>,
iommu@lists.linux-foundation.org, Will Deacon <will@kernel.org>
Subject: Re: arm-smmu-v3 high cpu usage for NVMe
Date: Fri, 20 Mar 2020 16:33:27 +0000 [thread overview]
Message-ID: <5198fcffc8ad6233e0274ebff9e9aa5f@kernel.org> (raw)
In-Reply-To: <b412fc9c-6266-e320-0769-f214d7752675@huawei.com>
Hi John,
On 2020-03-20 16:20, John Garry wrote:
>>>
>>>>
>>>> I've run a bunch of netperf instances on multiple cores and
>>>> collecting
>>>> SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty
>>>> consistently.
>>>>
>>>> - 6.07% arm_smmu_iotlb_sync
>>>> - 5.74% arm_smmu_tlb_inv_range
>>>> 5.09% arm_smmu_cmdq_issue_cmdlist
>>>> 0.28% __pi_memset
>>>> 0.08% __pi_memcpy
>>>> 0.08% arm_smmu_atc_inv_domain.constprop.37
>>>> 0.07% arm_smmu_cmdq_build_cmd
>>>> 0.01% arm_smmu_cmdq_batch_add
>>>> 0.31% __pi_memset
>>>>
>>>> So arm_smmu_atc_inv_domain() takes about 1.4% of
>>>> arm_smmu_iotlb_sync(),
>>>> when ATS is not used. According to the annotations, the load from
>>>> the
>>>> atomic_read(), that checks whether the domain uses ATS, is 77% of
>>>> the
>>>> samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm
>>>> not sure
>>>> there is much room for optimization there.
>>>
>>> Well I did originally suggest using RCU protection to scan the list
>>> of
>>> devices, instead of reading an atomic and checking for non-zero
>>> value. But
>>> that would be an optimsation for ATS also, and there was no ATS
>>> devices at
>>> the time (to verify performance).
>>
>> Heh, I have yet to get my hands on one. Currently I can't evaluate ATS
>> performance, but I agree that using RCU to scan the list should get
>> better
>> results when using ATS.
>>
>> When ATS isn't in use however, I suspect reading nr_ats_masters should
>> be
>> more efficient than taking the RCU lock + reading an "ats_devices"
>> list
>> (since the smmu_domain->devices list also serves context descriptor
>> invalidation, even when ATS isn't in use). I'll run some tests
>> however, to
>> see if I can micro-optimize this case, but I don't expect noticeable
>> improvements.
>
> ok, cheers. I, too, would not expect a significant improvement there.
>
> JFYI, I've been playing for "perf annotate" today and it's giving
> strange results for my NVMe testing. So "report" looks somewhat sane,
> if not a worryingly high % for arm_smmu_cmdq_issue_cmdlist():
>
>
> 55.39% irq/342-nvme0q1 [kernel.kallsyms] [k]
> arm_smmu_cmdq_issue_cmdlist
> 9.74% irq/342-nvme0q1 [kernel.kallsyms] [k]
> _raw_spin_unlock_irqrestore
> 2.02% irq/342-nvme0q1 [kernel.kallsyms] [k] nvme_irq
> 1.86% irq/342-nvme0q1 [kernel.kallsyms] [k] fput_many
> 1.73% irq/342-nvme0q1 [kernel.kallsyms] [k]
> arm_smmu_atc_inv_domain.constprop.42
> 1.67% irq/342-nvme0q1 [kernel.kallsyms] [k] __arm_lpae_unmap
> 1.49% irq/342-nvme0q1 [kernel.kallsyms] [k] aio_complete_rw
>
> But "annotate" consistently tells me that a specific instruction
> consumes ~99% of the load for the enqueue function:
>
> : /* 5. If we are inserting a CMD_SYNC,
> we must wait for it to complete */
> : if (sync) {
> 0.00 : ffff80001071c948: ldr w0, [x29, #108]
> : int ret = 0;
> 0.00 : ffff80001071c94c: mov w24, #0x0 // #0
> : if (sync) {
> 0.00 : ffff80001071c950: cbnz w0, ffff80001071c990
> <arm_smmu_cmdq_issue_cmdlist+0x420>
> : arch_local_irq_restore():
> 0.00 : ffff80001071c954: msr daif, x21
> : arm_smmu_cmdq_issue_cmdlist():
> : }
> : }
> :
> : local_irq_restore(flags);
> : return ret;
> : }
> 99.51 : ffff80001071c958: adrp x0, ffff800011909000
> <page_wait_table+0x14c0>
This is likely the side effect of the re-enabling of interrupts (msr
daif, x21)
on the previous instruction which causes the perf interrupt to fire
right after.
Time to enable pseudo-NMIs in the PMUv3 driver...
M.
--
Jazz is not dead. It just smells funny...
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
next prev parent reply other threads:[~2020-03-20 16:33 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-08-21 15:17 [PATCH v2 0/8] Sort out SMMUv3 ATC invalidation and locking Will Deacon
2019-08-21 15:17 ` [PATCH v2 1/8] iommu/arm-smmu-v3: Document ordering guarantees of command insertion Will Deacon
2019-08-21 15:17 ` [PATCH v2 2/8] iommu/arm-smmu-v3: Disable detection of ATS and PRI Will Deacon
2019-08-21 15:17 ` Will Deacon
2019-08-21 15:36 ` Robin Murphy
2019-08-21 15:36 ` Robin Murphy
2019-08-21 15:17 ` [PATCH v2 3/8] iommu/arm-smmu-v3: Remove boolean bitfield for 'ats_enabled' flag Will Deacon
2019-08-21 15:17 ` [PATCH v2 4/8] iommu/arm-smmu-v3: Don't issue CMD_SYNC for zero-length invalidations Will Deacon
2019-08-21 15:17 ` [PATCH v2 5/8] iommu/arm-smmu-v3: Rework enabling/disabling of ATS for PCI masters Will Deacon
2019-08-21 15:50 ` Robin Murphy
2019-08-21 15:17 ` [PATCH v2 6/8] iommu/arm-smmu-v3: Fix ATC invalidation ordering wrt main TLBs Will Deacon
2019-08-21 16:25 ` Robin Murphy
2019-08-21 15:17 ` [PATCH v2 7/8] iommu/arm-smmu-v3: Avoid locking on invalidation path when not using ATS Will Deacon
2019-08-22 12:36 ` Robin Murphy
2019-08-21 15:17 ` [PATCH v2 8/8] Revert "iommu/arm-smmu-v3: Disable detection of ATS and PRI" Will Deacon
2020-01-02 17:44 ` arm-smmu-v3 high cpu usage for NVMe John Garry
2020-03-18 20:53 ` Will Deacon
2020-03-19 12:54 ` John Garry
2020-03-19 18:43 ` Jean-Philippe Brucker
2020-03-20 10:41 ` John Garry
2020-03-20 11:18 ` Jean-Philippe Brucker
2020-03-20 16:20 ` John Garry
2020-03-20 16:33 ` Marc Zyngier [this message]
2020-03-23 9:03 ` John Garry
2020-03-23 9:16 ` Marc Zyngier
2020-03-24 9:18 ` John Garry
2020-03-24 10:43 ` Marc Zyngier
2020-03-24 11:55 ` John Garry
2020-03-24 12:07 ` Robin Murphy
2020-03-24 12:37 ` John Garry
2020-03-25 15:31 ` John Garry
2020-05-22 14:52 ` John Garry
2020-05-25 5:57 ` Song Bao Hua (Barry Song)
[not found] ` <482c00d5-8e6d-1484-820e-1e89851ad5aa@huawei.com>
2020-04-06 15:11 ` John Garry
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5198fcffc8ad6233e0274ebff9e9aa5f@kernel.org \
--to=maz@kernel.org \
--cc=iommu@lists.linux-foundation.org \
--cc=jean-philippe@linaro.org \
--cc=john.garry@huawei.com \
--cc=ming.lei@redhat.com \
--cc=robin.murphy@arm.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.