All of lore.kernel.org
 help / color / mirror / Atom feed
From: Marc Zyngier <maz@kernel.org>
To: John Garry <john.garry@huawei.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>,
	Robin Murphy <robin.murphy@arm.com>,
	Ming Lei <ming.lei@redhat.com>,
	iommu@lists.linux-foundation.org, Will Deacon <will@kernel.org>
Subject: Re: arm-smmu-v3 high cpu usage for NVMe
Date: Fri, 20 Mar 2020 16:33:27 +0000	[thread overview]
Message-ID: <5198fcffc8ad6233e0274ebff9e9aa5f@kernel.org> (raw)
In-Reply-To: <b412fc9c-6266-e320-0769-f214d7752675@huawei.com>

Hi John,

On 2020-03-20 16:20, John Garry wrote:
>>> 
>>>> 
>>>> I've run a bunch of netperf instances on multiple cores and 
>>>> collecting
>>>> SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty
>>>> consistently.
>>>> 
>>>> - 6.07% arm_smmu_iotlb_sync
>>>>      - 5.74% arm_smmu_tlb_inv_range
>>>>           5.09% arm_smmu_cmdq_issue_cmdlist
>>>>           0.28% __pi_memset
>>>>           0.08% __pi_memcpy
>>>>           0.08% arm_smmu_atc_inv_domain.constprop.37
>>>>           0.07% arm_smmu_cmdq_build_cmd
>>>>           0.01% arm_smmu_cmdq_batch_add
>>>>        0.31% __pi_memset
>>>> 
>>>> So arm_smmu_atc_inv_domain() takes about 1.4% of 
>>>> arm_smmu_iotlb_sync(),
>>>> when ATS is not used. According to the annotations, the load from 
>>>> the
>>>> atomic_read(), that checks whether the domain uses ATS, is 77% of 
>>>> the
>>>> samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm 
>>>> not sure
>>>> there is much room for optimization there.
>>> 
>>> Well I did originally suggest using RCU protection to scan the list 
>>> of
>>> devices, instead of reading an atomic and checking for non-zero 
>>> value. But
>>> that would be an optimsation for ATS also, and there was no ATS 
>>> devices at
>>> the time (to verify performance).
>> 
>> Heh, I have yet to get my hands on one. Currently I can't evaluate ATS
>> performance, but I agree that using RCU to scan the list should get 
>> better
>> results when using ATS.
>> 
>> When ATS isn't in use however, I suspect reading nr_ats_masters should 
>> be
>> more efficient than taking the RCU lock + reading an "ats_devices" 
>> list
>> (since the smmu_domain->devices list also serves context descriptor
>> invalidation, even when ATS isn't in use). I'll run some tests 
>> however, to
>> see if I can micro-optimize this case, but I don't expect noticeable
>> improvements.
> 
> ok, cheers. I, too, would not expect a significant improvement there.
> 
> JFYI, I've been playing for "perf annotate" today and it's giving
> strange results for my NVMe testing. So "report" looks somewhat sane,
> if not a worryingly high % for arm_smmu_cmdq_issue_cmdlist():
> 
> 
>     55.39%  irq/342-nvme0q1  [kernel.kallsyms]  [k] 
> arm_smmu_cmdq_issue_cmdlist
>      9.74%  irq/342-nvme0q1  [kernel.kallsyms]  [k] 
> _raw_spin_unlock_irqrestore
>      2.02%  irq/342-nvme0q1  [kernel.kallsyms]  [k] nvme_irq
>      1.86%  irq/342-nvme0q1  [kernel.kallsyms]  [k] fput_many
>      1.73%  irq/342-nvme0q1  [kernel.kallsyms]  [k]
> arm_smmu_atc_inv_domain.constprop.42
>      1.67%  irq/342-nvme0q1  [kernel.kallsyms]  [k] __arm_lpae_unmap
>      1.49%  irq/342-nvme0q1  [kernel.kallsyms]  [k] aio_complete_rw
> 
> But "annotate" consistently tells me that a specific instruction
> consumes ~99% of the load for the enqueue function:
> 
>          :                      /* 5. If we are inserting a CMD_SYNC,
> we must wait for it to complete */
>          :                      if (sync) {
>     0.00 :   ffff80001071c948:       ldr     w0, [x29, #108]
>          :                      int ret = 0;
>     0.00 :   ffff80001071c94c:       mov     w24, #0x0      // #0
>          :                      if (sync) {
>     0.00 :   ffff80001071c950:       cbnz    w0, ffff80001071c990
> <arm_smmu_cmdq_issue_cmdlist+0x420>
>          :                      arch_local_irq_restore():
>     0.00 :   ffff80001071c954:       msr     daif, x21
>          :                      arm_smmu_cmdq_issue_cmdlist():
>          :                      }
>          :                      }
>          :
>          :                      local_irq_restore(flags);
>          :                      return ret;
>          :                      }
>    99.51 :   ffff80001071c958:       adrp    x0, ffff800011909000
> <page_wait_table+0x14c0>

This is likely the side effect of the re-enabling of interrupts (msr 
daif, x21)
on the previous instruction which causes the perf interrupt to fire 
right after.

Time to enable pseudo-NMIs in the PMUv3 driver...

          M.
-- 
Jazz is not dead. It just smells funny...
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

  reply	other threads:[~2020-03-20 16:33 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-21 15:17 [PATCH v2 0/8] Sort out SMMUv3 ATC invalidation and locking Will Deacon
2019-08-21 15:17 ` [PATCH v2 1/8] iommu/arm-smmu-v3: Document ordering guarantees of command insertion Will Deacon
2019-08-21 15:17 ` [PATCH v2 2/8] iommu/arm-smmu-v3: Disable detection of ATS and PRI Will Deacon
2019-08-21 15:17   ` Will Deacon
2019-08-21 15:36   ` Robin Murphy
2019-08-21 15:36     ` Robin Murphy
2019-08-21 15:17 ` [PATCH v2 3/8] iommu/arm-smmu-v3: Remove boolean bitfield for 'ats_enabled' flag Will Deacon
2019-08-21 15:17 ` [PATCH v2 4/8] iommu/arm-smmu-v3: Don't issue CMD_SYNC for zero-length invalidations Will Deacon
2019-08-21 15:17 ` [PATCH v2 5/8] iommu/arm-smmu-v3: Rework enabling/disabling of ATS for PCI masters Will Deacon
2019-08-21 15:50   ` Robin Murphy
2019-08-21 15:17 ` [PATCH v2 6/8] iommu/arm-smmu-v3: Fix ATC invalidation ordering wrt main TLBs Will Deacon
2019-08-21 16:25   ` Robin Murphy
2019-08-21 15:17 ` [PATCH v2 7/8] iommu/arm-smmu-v3: Avoid locking on invalidation path when not using ATS Will Deacon
2019-08-22 12:36   ` Robin Murphy
2019-08-21 15:17 ` [PATCH v2 8/8] Revert "iommu/arm-smmu-v3: Disable detection of ATS and PRI" Will Deacon
2020-01-02 17:44 ` arm-smmu-v3 high cpu usage for NVMe John Garry
2020-03-18 20:53   ` Will Deacon
2020-03-19 12:54     ` John Garry
2020-03-19 18:43       ` Jean-Philippe Brucker
2020-03-20 10:41         ` John Garry
2020-03-20 11:18           ` Jean-Philippe Brucker
2020-03-20 16:20             ` John Garry
2020-03-20 16:33               ` Marc Zyngier [this message]
2020-03-23  9:03                 ` John Garry
2020-03-23  9:16                   ` Marc Zyngier
2020-03-24  9:18                     ` John Garry
2020-03-24 10:43                       ` Marc Zyngier
2020-03-24 11:55                         ` John Garry
2020-03-24 12:07                           ` Robin Murphy
2020-03-24 12:37                             ` John Garry
2020-03-25 15:31                               ` John Garry
2020-05-22 14:52           ` John Garry
2020-05-25  5:57             ` Song Bao Hua (Barry Song)
     [not found]     ` <482c00d5-8e6d-1484-820e-1e89851ad5aa@huawei.com>
2020-04-06 15:11       ` John Garry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5198fcffc8ad6233e0274ebff9e9aa5f@kernel.org \
    --to=maz@kernel.org \
    --cc=iommu@lists.linux-foundation.org \
    --cc=jean-philippe@linaro.org \
    --cc=john.garry@huawei.com \
    --cc=ming.lei@redhat.com \
    --cc=robin.murphy@arm.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.