* [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable caching of bypass entries
2016-11-02 13:35 Nipun Gupta
@ 2016-11-02 11:21 ` Robin Murphy
2016-11-02 19:26 ` Nipun Gupta
0 siblings, 1 reply; 8+ messages in thread
From: Robin Murphy @ 2016-11-02 11:21 UTC (permalink / raw)
To: linux-arm-kernel
Hi Nipun,
On 02/11/16 13:35, Nipun Gupta wrote:
> The SMTNMB_TLBEN in the Auxiliary Configuration Register (ACR) provides an
> option to enable the updation of TLB in case of bypass transactions due to
> no stream match in the stream match table. This reduces the latencies of
> the subsequent transactions with the same stream-id which bypasses the SMMU.
> This provides a significant performance benefit for certain networking
> workloads.
...at the cost of increased TLB contention against other workloads.
However, in the general case we'd expect the system to be fully
described, so if there aren't any unmatched Stream IDs there hopefully
shouldn't be an impact to leaving this switched on. I'd be interested to
see some actual performance numbers, though - you already know my
opinion about unsubstantiated quotes from the MMU-500 TRM.
> Signed-off-by: Nipun Gupta <nipun.gupta@nxp.com>
> ---
> drivers/iommu/arm-smmu.c | 21 +++++++++++++++------
> 1 file changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> index ce2a9d4..7010a5c 100644
> --- a/drivers/iommu/arm-smmu.c
> +++ b/drivers/iommu/arm-smmu.c
> @@ -246,6 +246,7 @@ enum arm_smmu_s2cr_privcfg {
>
> #define ARM_MMU500_ACTLR_CPRE (1 << 1)
>
> +#define ACR_SMTNMB_TLBEN (1 << 8)
ACR is entirely implementation-defined, so there are no generic field
names. Please follow the naming convention handily demonstrated in the
subsequent context line.
> #define ARM_MMU500_ACR_CACHE_LOCK (1 << 26)
Actually, can we also please keep these in descending order of bit
position like everything else?
> #define CB_PAR_F (1 << 0)
> @@ -1569,18 +1570,26 @@ static void arm_smmu_device_reset(struct arm_smmu_device *smmu)
> for (i = 0; i < smmu->num_mapping_groups; ++i)
> arm_smmu_write_sme(smmu, i);
>
> + /* Get the major rev required for configuring ACR */
That comment is nonsense.
> + reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
> + major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
> +
> /*
> * Before clearing ARM_MMU500_ACTLR_CPRE, need to
> * clear CACHE_LOCK bit of ACR first. And, CACHE_LOCK
> * bit is only present in MMU-500r2 onwards.
> */
> - reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
> - major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
> - if ((smmu->model == ARM_MMU500) && (major >= 2)) {
> - reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
> + reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
> + if ((smmu->model == ARM_MMU500) && (major >= 2))
> reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
> - writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
> - }
> +
> + /*
> + * Set the SMTNMB_TLBEN in ACR so that the transactions which
> + * bypass with SMMU due to no stream match found in the SMR table
> + * are updated in the TLB's.
Or simply, e.g. "Allow unmatched Stream IDs to allocate bypass TLB
entries for reduced latency". It's already clear from the code what
bit's being set where, we only need to remember *why*.
> + */
> + reg |= ACR_SMTNMB_TLBEN;
> + writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
Are you sure it's perfectly fine to set that implementation-defined bit
on any SMMU implementation other than the two-and-a-half ARM Ltd. ones
which happen to share the same meaning? I'm certainly not.
The correct flow would be something like this:
if (smmu->model == ARM_MMU500) {
reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
if (major >= 2)
reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
reg |= ACR_SMTNMB_TLBEN;
writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
}
The shape of the current code avoids an extra level of indentation, but
once you have to have the nested conditional anyway, it might as well
all be predicated appropriately.
Robin.
> /* Make sure all context banks are disabled and clear CB_FSR */
> for (i = 0; i < smmu->num_context_banks; ++i) {
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable caching of bypass entries
@ 2016-11-02 13:35 Nipun Gupta
2016-11-02 11:21 ` Robin Murphy
0 siblings, 1 reply; 8+ messages in thread
From: Nipun Gupta @ 2016-11-02 13:35 UTC (permalink / raw)
To: linux-arm-kernel
The SMTNMB_TLBEN in the Auxiliary Configuration Register (ACR) provides an
option to enable the updation of TLB in case of bypass transactions due to
no stream match in the stream match table. This reduces the latencies of
the subsequent transactions with the same stream-id which bypasses the SMMU.
This provides a significant performance benefit for certain networking
workloads.
Signed-off-by: Nipun Gupta <nipun.gupta@nxp.com>
---
drivers/iommu/arm-smmu.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index ce2a9d4..7010a5c 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -246,6 +246,7 @@ enum arm_smmu_s2cr_privcfg {
#define ARM_MMU500_ACTLR_CPRE (1 << 1)
+#define ACR_SMTNMB_TLBEN (1 << 8)
#define ARM_MMU500_ACR_CACHE_LOCK (1 << 26)
#define CB_PAR_F (1 << 0)
@@ -1569,18 +1570,26 @@ static void arm_smmu_device_reset(struct arm_smmu_device *smmu)
for (i = 0; i < smmu->num_mapping_groups; ++i)
arm_smmu_write_sme(smmu, i);
+ /* Get the major rev required for configuring ACR */
+ reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
+ major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
+
/*
* Before clearing ARM_MMU500_ACTLR_CPRE, need to
* clear CACHE_LOCK bit of ACR first. And, CACHE_LOCK
* bit is only present in MMU-500r2 onwards.
*/
- reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
- major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
- if ((smmu->model == ARM_MMU500) && (major >= 2)) {
- reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
+ reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
+ if ((smmu->model == ARM_MMU500) && (major >= 2))
reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
- writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
- }
+
+ /*
+ * Set the SMTNMB_TLBEN in ACR so that the transactions which
+ * bypass with SMMU due to no stream match found in the SMR table
+ * are updated in the TLB's.
+ */
+ reg |= ACR_SMTNMB_TLBEN;
+ writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
/* Make sure all context banks are disabled and clear CB_FSR */
for (i = 0; i < smmu->num_context_banks; ++i) {
--
1.9.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable caching of bypass entries
2016-11-02 11:21 ` Robin Murphy
@ 2016-11-02 19:26 ` Nipun Gupta
2016-11-03 11:08 ` Robin Murphy
0 siblings, 1 reply; 8+ messages in thread
From: Nipun Gupta @ 2016-11-02 19:26 UTC (permalink / raw)
To: linux-arm-kernel
Hi Robin,
> -----Original Message-----
> From: Robin Murphy [mailto:robin.murphy at arm.com]
> Sent: Wednesday, November 02, 2016 16:51
> To: Nipun Gupta <nipun.gupta@nxp.com>; will.deacon at arm.com; linux-arm-
> kernel at lists.infradead.org; iommu at lists.linux-foundation.org
> Cc: Stuart Yoder <stuart.yoder@nxp.com>
> Subject: Re: [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable
> caching of bypass entries
>
> Hi Nipun,
>
> On 02/11/16 13:35, Nipun Gupta wrote:
> > The SMTNMB_TLBEN in the Auxiliary Configuration Register (ACR)
> > provides an option to enable the updation of TLB in case of bypass
> > transactions due to no stream match in the stream match table. This
> > reduces the latencies of the subsequent transactions with the same stream-id
> which bypasses the SMMU.
> > This provides a significant performance benefit for certain networking
> > workloads.
>
> ...at the cost of increased TLB contention against other workloads.
> However, in the general case we'd expect the system to be fully described, so if
> there aren't any unmatched Stream IDs there hopefully shouldn't be an impact
> to leaving this switched on. I'd be interested to see some actual performance
> numbers, though - you already know my opinion about unsubstantiated quotes
> from the MMU-500 TRM.
With this change we have seen substantial performance improvement of ~9-10%
with DPDK l3fwd application (http://dpdk.org/doc/guides/sample_app_ug/l3_forward.html)
on NXP's LS2088a platform (single core as well as multi-core). Also, with ODP reflector application
(loopback mode - NXP in-house) we have seen 5% improvement in performance on
LS1088 platform.
W.r.t. the read latencies, they are reduced to avg. ~50 platform cycles from avg. ~140
platform cycles per memory read transactions which follow this bypass path (on LS2088
with DPDK l3fwd application).
(Apologies, I cannot share the DPDK/ODP exact performance numbers on the mailing list).
>
> > Signed-off-by: Nipun Gupta <nipun.gupta@nxp.com>
> > ---
> > drivers/iommu/arm-smmu.c | 21 +++++++++++++++------
> > 1 file changed, 15 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c index
> > ce2a9d4..7010a5c 100644
> > --- a/drivers/iommu/arm-smmu.c
> > +++ b/drivers/iommu/arm-smmu.c
> > @@ -246,6 +246,7 @@ enum arm_smmu_s2cr_privcfg {
> >
> > #define ARM_MMU500_ACTLR_CPRE (1 << 1)
> >
> > +#define ACR_SMTNMB_TLBEN (1 << 8)
>
> ACR is entirely implementation-defined, so there are no generic field names.
> Please follow the naming convention handily demonstrated in the subsequent
> context line.
>
> > #define ARM_MMU500_ACR_CACHE_LOCK (1 << 26)
>
> Actually, can we also please keep these in descending order of bit position like
> everything else?
>
> > #define CB_PAR_F (1 << 0)
> > @@ -1569,18 +1570,26 @@ static void arm_smmu_device_reset(struct
> arm_smmu_device *smmu)
> > for (i = 0; i < smmu->num_mapping_groups; ++i)
> > arm_smmu_write_sme(smmu, i);
> >
> > + /* Get the major rev required for configuring ACR */
>
> That comment is nonsense.
>
> > + reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
> > + major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
> > +
> > /*
> > * Before clearing ARM_MMU500_ACTLR_CPRE, need to
> > * clear CACHE_LOCK bit of ACR first. And, CACHE_LOCK
> > * bit is only present in MMU-500r2 onwards.
> > */
> > - reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
> > - major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
> > - if ((smmu->model == ARM_MMU500) && (major >= 2)) {
> > - reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
> > + reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
> > + if ((smmu->model == ARM_MMU500) && (major >= 2))
> > reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
> > - writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
> > - }
> > +
> > + /*
> > + * Set the SMTNMB_TLBEN in ACR so that the transactions which
> > + * bypass with SMMU due to no stream match found in the SMR table
> > + * are updated in the TLB's.
>
> Or simply, e.g. "Allow unmatched Stream IDs to allocate bypass TLB entries for
> reduced latency". It's already clear from the code what bit's being set where, we
> only need to remember *why*.
>
> > + */
> > + reg |= ACR_SMTNMB_TLBEN;
> > + writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
>
> Are you sure it's perfectly fine to set that implementation-defined bit on any
> SMMU implementation other than the two-and-a-half ARM Ltd. ones which
> happen to share the same meaning? I'm certainly not.
>
> The correct flow would be something like this:
>
> if (smmu->model == ARM_MMU500) {
> reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
> major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
> reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
> if (major >= 2)
> reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
> reg |= ACR_SMTNMB_TLBEN;
> writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
> }
>
> The shape of the current code avoids an extra level of indentation, but once you
> have to have the nested conditional anyway, it might as well all be predicated
> appropriately.
>
Thank you for providing the useful comments. I would incorporate them all in next version :).
Regards,
Nipun
> Robin.
>
> > /* Make sure all context banks are disabled and clear CB_FSR */
> > for (i = 0; i < smmu->num_context_banks; ++i) {
> >
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable caching of bypass entries
2016-11-02 19:26 ` Nipun Gupta
@ 2016-11-03 11:08 ` Robin Murphy
2016-11-03 18:24 ` Nipun Gupta
0 siblings, 1 reply; 8+ messages in thread
From: Robin Murphy @ 2016-11-03 11:08 UTC (permalink / raw)
To: linux-arm-kernel
On 02/11/16 19:26, Nipun Gupta wrote:
>
> Hi Robin,
>
>> -----Original Message-----
>> From: Robin Murphy [mailto:robin.murphy at arm.com]
>> Sent: Wednesday, November 02, 2016 16:51
>> To: Nipun Gupta <nipun.gupta@nxp.com>; will.deacon at arm.com; linux-arm-
>> kernel at lists.infradead.org; iommu at lists.linux-foundation.org
>> Cc: Stuart Yoder <stuart.yoder@nxp.com>
>> Subject: Re: [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable
>> caching of bypass entries
>>
>> Hi Nipun,
>>
>> On 02/11/16 13:35, Nipun Gupta wrote:
>>> The SMTNMB_TLBEN in the Auxiliary Configuration Register (ACR)
>>> provides an option to enable the updation of TLB in case of bypass
>>> transactions due to no stream match in the stream match table. This
>>> reduces the latencies of the subsequent transactions with the same stream-id
>> which bypasses the SMMU.
>>> This provides a significant performance benefit for certain networking
>>> workloads.
>>
>> ...at the cost of increased TLB contention against other workloads.
>> However, in the general case we'd expect the system to be fully described, so if
>> there aren't any unmatched Stream IDs there hopefully shouldn't be an impact
>> to leaving this switched on. I'd be interested to see some actual performance
>> numbers, though - you already know my opinion about unsubstantiated quotes
>> from the MMU-500 TRM.
>
> With this change we have seen substantial performance improvement of ~9-10%
> with DPDK l3fwd application (http://dpdk.org/doc/guides/sample_app_ug/l3_forward.html)
> on NXP's LS2088a platform (single core as well as multi-core). Also, with ODP reflector application
> (loopback mode - NXP in-house) we have seen 5% improvement in performance on
> LS1088 platform.
>
> W.r.t. the read latencies, they are reduced to avg. ~50 platform cycles from avg. ~140
> platform cycles per memory read transactions which follow this bypass path (on LS2088
> with DPDK l3fwd application).
>
> (Apologies, I cannot share the DPDK/ODP exact performance numbers on the mailing list).
That's understandable, and I'm not sure I'd know how to interpret them
anyway ;) I reckon the percentages make a sufficiently compelling
qualification of the improvement, so it would be good to have that
summarised in the commit log.
>>
>>> Signed-off-by: Nipun Gupta <nipun.gupta@nxp.com>
>>> ---
>>> drivers/iommu/arm-smmu.c | 21 +++++++++++++++------
>>> 1 file changed, 15 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c index
>>> ce2a9d4..7010a5c 100644
>>> --- a/drivers/iommu/arm-smmu.c
>>> +++ b/drivers/iommu/arm-smmu.c
>>> @@ -246,6 +246,7 @@ enum arm_smmu_s2cr_privcfg {
>>>
>>> #define ARM_MMU500_ACTLR_CPRE (1 << 1)
>>>
>>> +#define ACR_SMTNMB_TLBEN (1 << 8)
>>
>> ACR is entirely implementation-defined, so there are no generic field names.
>> Please follow the naming convention handily demonstrated in the subsequent
>> context line.
>>
>>> #define ARM_MMU500_ACR_CACHE_LOCK (1 << 26)
>>
>> Actually, can we also please keep these in descending order of bit position like
>> everything else?
>>
>>> #define CB_PAR_F (1 << 0)
>>> @@ -1569,18 +1570,26 @@ static void arm_smmu_device_reset(struct
>> arm_smmu_device *smmu)
>>> for (i = 0; i < smmu->num_mapping_groups; ++i)
>>> arm_smmu_write_sme(smmu, i);
>>>
>>> + /* Get the major rev required for configuring ACR */
>>
>> That comment is nonsense.
>>
>>> + reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
>>> + major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
>>> +
>>> /*
>>> * Before clearing ARM_MMU500_ACTLR_CPRE, need to
>>> * clear CACHE_LOCK bit of ACR first. And, CACHE_LOCK
>>> * bit is only present in MMU-500r2 onwards.
>>> */
>>> - reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
>>> - major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
>>> - if ((smmu->model == ARM_MMU500) && (major >= 2)) {
>>> - reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
>>> + reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
>>> + if ((smmu->model == ARM_MMU500) && (major >= 2))
>>> reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
>>> - writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
>>> - }
>>> +
>>> + /*
>>> + * Set the SMTNMB_TLBEN in ACR so that the transactions which
>>> + * bypass with SMMU due to no stream match found in the SMR table
>>> + * are updated in the TLB's.
>>
>> Or simply, e.g. "Allow unmatched Stream IDs to allocate bypass TLB entries for
>> reduced latency". It's already clear from the code what bit's being set where, we
>> only need to remember *why*.
>>
>>> + */
>>> + reg |= ACR_SMTNMB_TLBEN;
>>> + writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
>>
>> Are you sure it's perfectly fine to set that implementation-defined bit on any
>> SMMU implementation other than the two-and-a-half ARM Ltd. ones which
>> happen to share the same meaning? I'm certainly not.
>>
>> The correct flow would be something like this:
>>
>> if (smmu->model == ARM_MMU500) {
>> reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
>> major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
>> reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
>> if (major >= 2)
>> reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
>> reg |= ACR_SMTNMB_TLBEN;
>> writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
>> }
>>
>> The shape of the current code avoids an extra level of indentation, but once you
>> have to have the nested conditional anyway, it might as well all be predicated
>> appropriately.
>>
>
> Thank you for providing the useful comments. I would incorporate them all in next version :).
Cool. Just for clarity (I realise I was thinking it, but never said it
outright), whilst MMU-40x do share the same feature with the same ACR
bit definition as MMU-500, I'd be inclined not to bother with them.
Since the monolithic microarchitecture means there's normally a separate
MMU-40x per device, if you don't want translation for that device you
can simply not probe the thing to turn it on in the first place.
Robin.
>
> Regards,
> Nipun
>
>> Robin.
>>
>>> /* Make sure all context banks are disabled and clear CB_FSR */
>>> for (i = 0; i < smmu->num_context_banks; ++i) {
>>>
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable caching of bypass entries
2016-11-03 11:08 ` Robin Murphy
@ 2016-11-03 18:24 ` Nipun Gupta
0 siblings, 0 replies; 8+ messages in thread
From: Nipun Gupta @ 2016-11-03 18:24 UTC (permalink / raw)
To: linux-arm-kernel
Hi Robin,
> -----Original Message-----
> From: Robin Murphy [mailto:robin.murphy at arm.com]
> Sent: Thursday, November 03, 2016 16:39
> To: Nipun Gupta <nipun.gupta@nxp.com>; will.deacon at arm.com; linux-arm-
> kernel at lists.infradead.org; iommu at lists.linux-foundation.org
> Cc: Stuart Yoder <stuart.yoder@nxp.com>
> Subject: Re: [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable
> caching of bypass entries
>
> On 02/11/16 19:26, Nipun Gupta wrote:
> >
> > Hi Robin,
> >
> >> -----Original Message-----
> >> From: Robin Murphy [mailto:robin.murphy at arm.com]
> >> Sent: Wednesday, November 02, 2016 16:51
> >> To: Nipun Gupta <nipun.gupta@nxp.com>; will.deacon at arm.com; linux-arm-
> >> kernel at lists.infradead.org; iommu at lists.linux-foundation.org
> >> Cc: Stuart Yoder <stuart.yoder@nxp.com>
> >> Subject: Re: [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to
> enable
> >> caching of bypass entries
> >>
> >> Hi Nipun,
> >>
> >> On 02/11/16 13:35, Nipun Gupta wrote:
> >>> The SMTNMB_TLBEN in the Auxiliary Configuration Register (ACR)
> >>> provides an option to enable the updation of TLB in case of bypass
> >>> transactions due to no stream match in the stream match table. This
> >>> reduces the latencies of the subsequent transactions with the same stream-
> id
> >> which bypasses the SMMU.
> >>> This provides a significant performance benefit for certain networking
> >>> workloads.
> >>
> >> ...at the cost of increased TLB contention against other workloads.
> >> However, in the general case we'd expect the system to be fully described, so
> if
> >> there aren't any unmatched Stream IDs there hopefully shouldn't be an
> impact
> >> to leaving this switched on. I'd be interested to see some actual performance
> >> numbers, though - you already know my opinion about unsubstantiated
> quotes
> >> from the MMU-500 TRM.
> >
> > With this change we have seen substantial performance improvement of ~9-
> 10%
> > with DPDK l3fwd application
> (http://dpdk.org/doc/guides/sample_app_ug/l3_forward.html)
> > on NXP's LS2088a platform (single core as well as multi-core). Also, with ODP
> reflector application
> > (loopback mode - NXP in-house) we have seen 5% improvement in
> performance on
> > LS1088 platform.
> >
> > W.r.t. the read latencies, they are reduced to avg. ~50 platform cycles from
> avg. ~140
> > platform cycles per memory read transactions which follow this bypass path
> (on LS2088
> > with DPDK l3fwd application).
> >
> > (Apologies, I cannot share the DPDK/ODP exact performance numbers on the
> mailing list).
>
> That's understandable, and I'm not sure I'd know how to interpret them
> anyway ;) I reckon the percentages make a sufficiently compelling
> qualification of the improvement, so it would be good to have that
> summarised in the commit log.
Sure, I'll add a part of it in the commit log.
>
> >>
> >>> Signed-off-by: Nipun Gupta <nipun.gupta@nxp.com>
> >>> ---
> >>> drivers/iommu/arm-smmu.c | 21 +++++++++++++++------
> >>> 1 file changed, 15 insertions(+), 6 deletions(-)
> >>>
> >>> diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c index
> >>> ce2a9d4..7010a5c 100644
> >>> --- a/drivers/iommu/arm-smmu.c
> >>> +++ b/drivers/iommu/arm-smmu.c
> >>> @@ -246,6 +246,7 @@ enum arm_smmu_s2cr_privcfg {
> >>>
> >>> #define ARM_MMU500_ACTLR_CPRE (1 << 1)
> >>>
> >>> +#define ACR_SMTNMB_TLBEN (1 << 8)
> >>
> >> ACR is entirely implementation-defined, so there are no generic field names.
> >> Please follow the naming convention handily demonstrated in the subsequent
> >> context line.
> >>
> >>> #define ARM_MMU500_ACR_CACHE_LOCK (1 << 26)
> >>
> >> Actually, can we also please keep these in descending order of bit position
> like
> >> everything else?
> >>
> >>> #define CB_PAR_F (1 << 0)
> >>> @@ -1569,18 +1570,26 @@ static void arm_smmu_device_reset(struct
> >> arm_smmu_device *smmu)
> >>> for (i = 0; i < smmu->num_mapping_groups; ++i)
> >>> arm_smmu_write_sme(smmu, i);
> >>>
> >>> + /* Get the major rev required for configuring ACR */
> >>
> >> That comment is nonsense.
> >>
> >>> + reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
> >>> + major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
> >>> +
> >>> /*
> >>> * Before clearing ARM_MMU500_ACTLR_CPRE, need to
> >>> * clear CACHE_LOCK bit of ACR first. And, CACHE_LOCK
> >>> * bit is only present in MMU-500r2 onwards.
> >>> */
> >>> - reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
> >>> - major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
> >>> - if ((smmu->model == ARM_MMU500) && (major >= 2)) {
> >>> - reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
> >>> + reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
> >>> + if ((smmu->model == ARM_MMU500) && (major >= 2))
> >>> reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
> >>> - writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
> >>> - }
> >>> +
> >>> + /*
> >>> + * Set the SMTNMB_TLBEN in ACR so that the transactions which
> >>> + * bypass with SMMU due to no stream match found in the SMR table
> >>> + * are updated in the TLB's.
> >>
> >> Or simply, e.g. "Allow unmatched Stream IDs to allocate bypass TLB entries
> for
> >> reduced latency". It's already clear from the code what bit's being set where,
> we
> >> only need to remember *why*.
> >>
> >>> + */
> >>> + reg |= ACR_SMTNMB_TLBEN;
> >>> + writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
> >>
> >> Are you sure it's perfectly fine to set that implementation-defined bit on any
> >> SMMU implementation other than the two-and-a-half ARM Ltd. ones which
> >> happen to share the same meaning? I'm certainly not.
> >>
> >> The correct flow would be something like this:
> >>
> >> if (smmu->model == ARM_MMU500) {
> >> reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
> >> major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
> >> reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
> >> if (major >= 2)
> >> reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
> >> reg |= ACR_SMTNMB_TLBEN;
> >> writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
> >> }
> >>
> >> The shape of the current code avoids an extra level of indentation, but once
> you
> >> have to have the nested conditional anyway, it might as well all be predicated
> >> appropriately.
> >>
> >
> > Thank you for providing the useful comments. I would incorporate them all in
> next version :).
>
> Cool. Just for clarity (I realise I was thinking it, but never said it
> outright), whilst MMU-40x do share the same feature with the same ACR
> bit definition as MMU-500, I'd be inclined not to bother with them.
> Since the monolithic microarchitecture means there's normally a separate
> MMU-40x per device, if you don't want translation for that device you
> can simply not probe the thing to turn it on in the first place.
>
This seems a pretty decent reason to have this bit set only for MMU-500.
I'll send a patch v2 soon.
Thanks,
Nipun
> Robin.
>
> >
> > Regards,
> > Nipun
> >
> >> Robin.
> >>
> >>> /* Make sure all context banks are disabled and clear CB_FSR */
> >>> for (i = 0; i < smmu->num_context_banks; ++i) {
> >>>
> >
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable caching of bypass entries
2016-11-04 0:27 [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable caching of bypass entries Nipun Gupta
@ 2016-11-03 20:57 ` Stuart Yoder
2016-11-04 3:52 ` Nipun Gupta
0 siblings, 1 reply; 8+ messages in thread
From: Stuart Yoder @ 2016-11-03 20:57 UTC (permalink / raw)
To: linux-arm-kernel
> -----Original Message-----
> From: Nipun Gupta [mailto:nipun.gupta at nxp.com]
> Sent: Thursday, November 03, 2016 7:27 PM
> To: robin.murphy at arm.com; will.deacon at arm.com; linux-arm-kernel at lists.infradead.org; iommu at lists.linux-
> foundation.org
> Cc: Stuart Yoder <stuart.yoder@nxp.com>; Nipun Gupta <nipun.gupta@nxp.com>
> Subject: [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable caching of bypass entries
When you respin a patch, put the version number in the patch subject:
[PATCH v2] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable caching of bypass entries
Stuart
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable caching of bypass entries
@ 2016-11-04 0:27 Nipun Gupta
2016-11-03 20:57 ` Stuart Yoder
0 siblings, 1 reply; 8+ messages in thread
From: Nipun Gupta @ 2016-11-04 0:27 UTC (permalink / raw)
To: linux-arm-kernel
The SMTNMB_TLBEN in the Auxiliary Configuration Register (ACR) provides an
option to enable the updation of TLB in case of bypass transactions due to
no stream match in the stream match table. This reduces the latencies of
the subsequent transactions with the same stream-id which bypasses the SMMU.
This provides a significant performance benefit for certain networking
workloads.
With this change substantial performance improvement of ~9% is observed with
DPDK l3fwd application (http://dpdk.org/doc/guides/sample_app_ug/l3_forward.html)
on NXP's LS2088a platform.
Signed-off-by: Nipun Gupta <nipun.gupta@nxp.com>
---
Changes for v2:
- Incorporated Robin's comments on v1 related to
Setting SMTNMB_TLBEN in ACR only for MMU-500 as ACR is implementation dependent
Code comments and Naming convention
drivers/iommu/arm-smmu.c | 25 ++++++++++++++++---------
1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index ce2a9d4..05901be 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -247,6 +247,7 @@ enum arm_smmu_s2cr_privcfg {
#define ARM_MMU500_ACTLR_CPRE (1 << 1)
#define ARM_MMU500_ACR_CACHE_LOCK (1 << 26)
+#define ARM_MMU500_ACR_SMTNMB_TLBEN (1 << 8)
#define CB_PAR_F (1 << 0)
@@ -1569,16 +1570,22 @@ static void arm_smmu_device_reset(struct arm_smmu_device *smmu)
for (i = 0; i < smmu->num_mapping_groups; ++i)
arm_smmu_write_sme(smmu, i);
- /*
- * Before clearing ARM_MMU500_ACTLR_CPRE, need to
- * clear CACHE_LOCK bit of ACR first. And, CACHE_LOCK
- * bit is only present in MMU-500r2 onwards.
- */
- reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
- major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
- if ((smmu->model == ARM_MMU500) && (major >= 2)) {
+ if (smmu->model == ARM_MMU500) {
+ /*
+ * Before clearing ARM_MMU500_ACTLR_CPRE, need to
+ * clear CACHE_LOCK bit of ACR first. And, CACHE_LOCK
+ * bit is only present in MMU-500r2 onwards.
+ */
+ reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
+ major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
- reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
+ if (major >= 2)
+ reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
+ /*
+ * Allow unmatched Stream IDs to allocate bypass
+ * TLB entries for reduced latency.
+ */
+ reg |= ARM_MMU500_ACR_SMTNMB_TLBEN;
writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
}
--
1.9.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable caching of bypass entries
2016-11-03 20:57 ` Stuart Yoder
@ 2016-11-04 3:52 ` Nipun Gupta
0 siblings, 0 replies; 8+ messages in thread
From: Nipun Gupta @ 2016-11-04 3:52 UTC (permalink / raw)
To: linux-arm-kernel
> -----Original Message-----
> From: Stuart Yoder
> Sent: Friday, November 04, 2016 2:27
> To: Nipun Gupta <nipun.gupta@nxp.com>; robin.murphy at arm.com;
> will.deacon at arm.com; linux-arm-kernel at lists.infradead.org;
> iommu at lists.linux-foundation.org
> Cc: Nipun Gupta <nipun.gupta@nxp.com>
> Subject: RE: [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable
> caching of bypass entries
>
>
>
> > -----Original Message-----
> > From: Nipun Gupta [mailto:nipun.gupta at nxp.com]
> > Sent: Thursday, November 03, 2016 7:27 PM
> > To: robin.murphy at arm.com; will.deacon at arm.com;
> > linux-arm-kernel at lists.infradead.org; iommu at lists.linux-
> > foundation.org
> > Cc: Stuart Yoder <stuart.yoder@nxp.com>; Nipun Gupta
> > <nipun.gupta@nxp.com>
> > Subject: [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable
> > caching of bypass entries
>
> When you respin a patch, put the version number in the patch subject:
>
> [PATCH v2] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable caching
> of bypass entries
I added that first, but in the last moment used git format-patch after minor updates
and somehow missed it in the final patch :/. Re-spinning v3 with correct version number.
Thanks,
Nipun
>
> Stuart
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2016-11-04 3:52 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-11-04 0:27 [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable caching of bypass entries Nipun Gupta
2016-11-03 20:57 ` Stuart Yoder
2016-11-04 3:52 ` Nipun Gupta
-- strict thread matches above, loose matches on Subject: below --
2016-11-02 13:35 Nipun Gupta
2016-11-02 11:21 ` Robin Murphy
2016-11-02 19:26 ` Nipun Gupta
2016-11-03 11:08 ` Robin Murphy
2016-11-03 18:24 ` Nipun Gupta
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox