All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jonathan Cameron <Jonathan.Cameron-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
To: Nate Watterson <nwatters-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
Cc: Hanjun Guo <guohanjun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>,
	Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>,
	linux-kernel
	<linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Xinwei Hu <huxinwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>,
	zhouyoujun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org,
	iommu
	<iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>,
	Zefan Li <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>,
	Tianhong Ding
	<dingtianhong-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>,
	"Leizhen (ThunderTown)"
	<thunder.leizhen-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>,
	linux-arm-kernel
	<linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org>
Subject: Re: [PATCH 1/5] iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock confliction
Date: Fri, 21 Jul 2017 18:57:15 +0800	[thread overview]
Message-ID: <20170721185715.0000533a@huawei.com> (raw)
In-Reply-To: <c1d85f28-c57b-4414-3504-16afb3a19ce0-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>

On Thu, 20 Jul 2017 15:07:05 -0400
Nate Watterson <nwatters-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org> wrote:

> Hi Jonathan,
> 
> [...]
> >>>>>        
> >>> Hi All,
> >>>
> >>> I'm a bit of late entry to this discussion.  Just been running some more
> >>> detailed tests on our d05 boards and wanted to bring some more numbers to
> >>> the discussion.
> >>>
> >>> All tests against 4.12 with the following additions:
> >>> * Robin's series removing the io-pgtable spinlock (and a few recent fixes)
> >>> * Cherry picked updates to the sas driver, merged prior to 4.13-rc1
> >>> * An additional HNS (network card) bug fix that will be upstreamed shortly.
> >>>
> >>> I've broken the results down into this patch and this patch + the remainder
> >>> of the set. As leizhen mentioned we got a nice little performance
> >>> bump from Robin's series so that was applied first (as it's in mainline now)
> >>>
> >>> SAS tests were fio with noop scheduler, 4k block size and various io depths
> >>> 1 process per disk.  Note this is probably a different setup to leizhen's
> >>> original numbers.
> >>>
> >>> Precentages are off the performance seen with the smmu disabled.
> >>> SAS
> >>> 4.12 - none of this series.
> >>> SMMU disabled
> >>> read io-depth 32 -   384K IOPS (100%)
> >>> read io-depth 2048 - 950K IOPS (100%)
> >>> rw io-depth 32 -     166K IOPS (100%)
> >>> rw io-depth 2048 -   340K IOPS (100%)
> >>>
> >>> SMMU enabled
> >>> read io-depth 32 -   201K IOPS (52%)
> >>> read io-depth 2048 - 306K IOPS (32%)
> >>> rw io-depth 32 -     99K  IOPS (60%)
> >>> rw io-depth 2048 -   150K IOPS (44%)
> >>>
> >>> Robin's recent series with fixes as seen on list (now merged)
> >>> SMMU enabled.
> >>> read io-depth 32 -   208K IOPS (54%)
> >>> read io-depth 2048 - 335K IOPS (35%)
> >>> rw io-depth 32 -     105K IOPS (63%)
> >>> rw io-depth 2048 -   165K IOPS (49%)
> >>>
> >>> 4.12 + Robin's series + just this patch SMMU enabled
> >>>
> >>> (iommu/arm-smmu-v3: put of the execution of TLBI* to reduce lock conflict)
> >>>
> >>> read io-depth 32 -   225K IOPS (59%)
> >>> read io-depth 2048 - 365K IOPS (38%)
> >>> rw io-depth 32 -     110K IOPS (66%)
> >>> rw io-depth 2048 -   179K IOPS (53%)
> >>>
> >>> 4.12 + Robin's series + Second part of this series
> >>>
> >>> (iommu/arm-smmu-v3: put of the execution of TLBI* to reduce lock conflict)
> >>> (iommu: add a new member unmap_tlb_sync into struct iommu_ops)
> >>> (iommu/arm-smmu-v3: add supprot for unmap an iova range with only on tlb sync)
> >>> (iommu/arm-smmu: add support for unmap of a memory range with only one tlb sync)
> >>>
> >>> read io-depth 32 -    225K IOPS (59%)
> >>> read io-depth 2048 -  833K IOPS (88%)
> >>> rw io-depth 32 -      112K IOPS (67%)
> >>> rw io-depth 2048 -    220K IOPS (65%)
> >>>
> >>> Robin's series gave us small gains across the board (3-5% recovered)
> >>> relative to the no smmu performance (which we are taking as the ideal case)
> >>>
> >>> This first patch gets us back another 2-5% of the no smmu performance
> >>>
> >>> The next few patches get us very little advantage on the small io-depths
> >>> but make a large difference to the larger io-depths - in particular the
> >>> read IOPS which is over twice as fast as without the series.
> >>>
> >>> For HNS it seems that we are less dependent on the SMMU performance and
> >>> can reach the non SMMU speed.
> >>>
> >>> Tests with
> >>> iperf -t 30 -i 10 -c IPADDRESS -P 3 last 10 seconds taken to avoid any
> >>> initial variability.
> >>>
> >>> The server end of the link was always running with smmu v3 disabled
> >>> so as to act as a fast sink of the data. Some variation seen across
> >>> repeat runs.
> >>>
> >>> Mainline v4.12 + network card fix
> >>> NO SMMU
> >>> 9.42 GBits/sec
> >>>
> >>> SMMU
> >>> 4.36 GBits/sec (46%)
> >>>
> >>> Robin's io-pgtable spinlock series
> >>>
> >>> 6.68 to 7.34 (71% - 78% variation across runs)
> >>>
> >>> Just this patch SMMU enabled
> >>>
> >>> (iommu/arm-smmu-v3: put of the execution of TLBI* to reduce lock conflict)
> >>>
> >>> 7.96-8.8 GBits/sec (85% - 94%  some variation across runs)
> >>>
> >>> Full series
> >>>
> >>> (iommu/arm-smmu-v3: put of the execution of TLBI* to reduce lock conflict)
> >>> (iommu: add a new member unmap_tlb_sync into struct iommu_ops)
> >>> (iommu/arm-smmu-v3: add supprot for unmap an iova range with only on tlb sync)
> >>> (iommu/arm-smmu: add support for unmap of a memory range with only one tlb sync)
> >>>
> >>> 9.42 GBits/Sec (100%)
> >>>
> >>> So HNS test shows a greater boost from Robin's series and this first patch.
> >>> This is most likely because the HNS test is not putting as high a load on
> >>> the SMMU and associated code as the SAS test.
> >>>
> >>> In both cases however, this shows that both parts of this patch
> >>> series are beneficial.
> >>>
> >>> So on to the questions ;)
> >>>
> >>> Will, you mentioned that along with Robin and Nate you were working on
> >>> a somewhat related strategy to improve the performance.  Any ETA on that?  
> >>
> >> The strategy I was working on is basically equivalent to the second
> >> part of the series. I will test your patches out sometime this week, and
> >> I'll also try to have our performance team run it through their whole
> >> suite.  
> > 
> > Thanks, that's excellent.  Look forward to hearing how it goes.  
> 
> I tested the patches with 4 NVME drives connected to a single SMMU and
> the results seem to be inline with those you've reported.
> 
> FIO - 512k blocksize / io-depth 32 / 1 thread per drive
>   Baseline 4.13-rc1 w/SMMU enabled: 25% of SMMU bypass performance
>   Baseline + Patch 1              : 28%
>   Baseline + Patches 2-5          : 86%
>   Baseline + Complete series      : 100% [!!]
> 
> I saw performance improvements across all of the other FIO profiles I
> tested, although not always as substantial as was seen in the 512k/32/1
> case. The performance of some of the profiles, especially those with
> many threads per drive, remains woeful (often below 20%), but hopefully
> Robin's iova series will help improve that.
Excellent.  Thanks for the info and running the tests.

Even with both series we are still seeing some reduction in over the no-smmu
performance, but to a much lesser extent. 

Jonathan
> 
> > 
> > Particularly useful would be to know if there are particular performance tests
> > that show up anything interesting that we might want to replicate.
> > 
> > Jonathan and Leizhen  
> >>  
> >>>
> >>> As you might imagine, with the above numbers we are very keen to try and
> >>> move forward with this as quickly as possible.
> >>>
> >>> If you want additional testing we would be happy to help.
> >>>
> >>> Thanks,
> >>>
> >>> Jonathan  
> [...]
> 
> -Nate
> 

WARNING: multiple messages have this Message-ID (diff)
From: Jonathan.Cameron@huawei.com (Jonathan Cameron)
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH 1/5] iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock confliction
Date: Fri, 21 Jul 2017 18:57:15 +0800	[thread overview]
Message-ID: <20170721185715.0000533a@huawei.com> (raw)
In-Reply-To: <c1d85f28-c57b-4414-3504-16afb3a19ce0@codeaurora.org>

On Thu, 20 Jul 2017 15:07:05 -0400
Nate Watterson <nwatters@codeaurora.org> wrote:

> Hi Jonathan,
> 
> [...]
> >>>>>        
> >>> Hi All,
> >>>
> >>> I'm a bit of late entry to this discussion.  Just been running some more
> >>> detailed tests on our d05 boards and wanted to bring some more numbers to
> >>> the discussion.
> >>>
> >>> All tests against 4.12 with the following additions:
> >>> * Robin's series removing the io-pgtable spinlock (and a few recent fixes)
> >>> * Cherry picked updates to the sas driver, merged prior to 4.13-rc1
> >>> * An additional HNS (network card) bug fix that will be upstreamed shortly.
> >>>
> >>> I've broken the results down into this patch and this patch + the remainder
> >>> of the set. As leizhen mentioned we got a nice little performance
> >>> bump from Robin's series so that was applied first (as it's in mainline now)
> >>>
> >>> SAS tests were fio with noop scheduler, 4k block size and various io depths
> >>> 1 process per disk.  Note this is probably a different setup to leizhen's
> >>> original numbers.
> >>>
> >>> Precentages are off the performance seen with the smmu disabled.
> >>> SAS
> >>> 4.12 - none of this series.
> >>> SMMU disabled
> >>> read io-depth 32 -   384K IOPS (100%)
> >>> read io-depth 2048 - 950K IOPS (100%)
> >>> rw io-depth 32 -     166K IOPS (100%)
> >>> rw io-depth 2048 -   340K IOPS (100%)
> >>>
> >>> SMMU enabled
> >>> read io-depth 32 -   201K IOPS (52%)
> >>> read io-depth 2048 - 306K IOPS (32%)
> >>> rw io-depth 32 -     99K  IOPS (60%)
> >>> rw io-depth 2048 -   150K IOPS (44%)
> >>>
> >>> Robin's recent series with fixes as seen on list (now merged)
> >>> SMMU enabled.
> >>> read io-depth 32 -   208K IOPS (54%)
> >>> read io-depth 2048 - 335K IOPS (35%)
> >>> rw io-depth 32 -     105K IOPS (63%)
> >>> rw io-depth 2048 -   165K IOPS (49%)
> >>>
> >>> 4.12 + Robin's series + just this patch SMMU enabled
> >>>
> >>> (iommu/arm-smmu-v3: put of the execution of TLBI* to reduce lock conflict)
> >>>
> >>> read io-depth 32 -   225K IOPS (59%)
> >>> read io-depth 2048 - 365K IOPS (38%)
> >>> rw io-depth 32 -     110K IOPS (66%)
> >>> rw io-depth 2048 -   179K IOPS (53%)
> >>>
> >>> 4.12 + Robin's series + Second part of this series
> >>>
> >>> (iommu/arm-smmu-v3: put of the execution of TLBI* to reduce lock conflict)
> >>> (iommu: add a new member unmap_tlb_sync into struct iommu_ops)
> >>> (iommu/arm-smmu-v3: add supprot for unmap an iova range with only on tlb sync)
> >>> (iommu/arm-smmu: add support for unmap of a memory range with only one tlb sync)
> >>>
> >>> read io-depth 32 -    225K IOPS (59%)
> >>> read io-depth 2048 -  833K IOPS (88%)
> >>> rw io-depth 32 -      112K IOPS (67%)
> >>> rw io-depth 2048 -    220K IOPS (65%)
> >>>
> >>> Robin's series gave us small gains across the board (3-5% recovered)
> >>> relative to the no smmu performance (which we are taking as the ideal case)
> >>>
> >>> This first patch gets us back another 2-5% of the no smmu performance
> >>>
> >>> The next few patches get us very little advantage on the small io-depths
> >>> but make a large difference to the larger io-depths - in particular the
> >>> read IOPS which is over twice as fast as without the series.
> >>>
> >>> For HNS it seems that we are less dependent on the SMMU performance and
> >>> can reach the non SMMU speed.
> >>>
> >>> Tests with
> >>> iperf -t 30 -i 10 -c IPADDRESS -P 3 last 10 seconds taken to avoid any
> >>> initial variability.
> >>>
> >>> The server end of the link was always running with smmu v3 disabled
> >>> so as to act as a fast sink of the data. Some variation seen across
> >>> repeat runs.
> >>>
> >>> Mainline v4.12 + network card fix
> >>> NO SMMU
> >>> 9.42 GBits/sec
> >>>
> >>> SMMU
> >>> 4.36 GBits/sec (46%)
> >>>
> >>> Robin's io-pgtable spinlock series
> >>>
> >>> 6.68 to 7.34 (71% - 78% variation across runs)
> >>>
> >>> Just this patch SMMU enabled
> >>>
> >>> (iommu/arm-smmu-v3: put of the execution of TLBI* to reduce lock conflict)
> >>>
> >>> 7.96-8.8 GBits/sec (85% - 94%  some variation across runs)
> >>>
> >>> Full series
> >>>
> >>> (iommu/arm-smmu-v3: put of the execution of TLBI* to reduce lock conflict)
> >>> (iommu: add a new member unmap_tlb_sync into struct iommu_ops)
> >>> (iommu/arm-smmu-v3: add supprot for unmap an iova range with only on tlb sync)
> >>> (iommu/arm-smmu: add support for unmap of a memory range with only one tlb sync)
> >>>
> >>> 9.42 GBits/Sec (100%)
> >>>
> >>> So HNS test shows a greater boost from Robin's series and this first patch.
> >>> This is most likely because the HNS test is not putting as high a load on
> >>> the SMMU and associated code as the SAS test.
> >>>
> >>> In both cases however, this shows that both parts of this patch
> >>> series are beneficial.
> >>>
> >>> So on to the questions ;)
> >>>
> >>> Will, you mentioned that along with Robin and Nate you were working on
> >>> a somewhat related strategy to improve the performance.  Any ETA on that?  
> >>
> >> The strategy I was working on is basically equivalent to the second
> >> part of the series. I will test your patches out sometime this week, and
> >> I'll also try to have our performance team run it through their whole
> >> suite.  
> > 
> > Thanks, that's excellent.  Look forward to hearing how it goes.  
> 
> I tested the patches with 4 NVME drives connected to a single SMMU and
> the results seem to be inline with those you've reported.
> 
> FIO - 512k blocksize / io-depth 32 / 1 thread per drive
>   Baseline 4.13-rc1 w/SMMU enabled: 25% of SMMU bypass performance
>   Baseline + Patch 1              : 28%
>   Baseline + Patches 2-5          : 86%
>   Baseline + Complete series      : 100% [!!]
> 
> I saw performance improvements across all of the other FIO profiles I
> tested, although not always as substantial as was seen in the 512k/32/1
> case. The performance of some of the profiles, especially those with
> many threads per drive, remains woeful (often below 20%), but hopefully
> Robin's iova series will help improve that.
Excellent.  Thanks for the info and running the tests.

Even with both series we are still seeing some reduction in over the no-smmu
performance, but to a much lesser extent. 

Jonathan
> 
> > 
> > Particularly useful would be to know if there are particular performance tests
> > that show up anything interesting that we might want to replicate.
> > 
> > Jonathan and Leizhen  
> >>  
> >>>
> >>> As you might imagine, with the above numbers we are very keen to try and
> >>> move forward with this as quickly as possible.
> >>>
> >>> If you want additional testing we would be happy to help.
> >>>
> >>> Thanks,
> >>>
> >>> Jonathan  
> [...]
> 
> -Nate
> 

WARNING: multiple messages have this Message-ID (diff)
From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
To: Nate Watterson <nwatters@codeaurora.org>
Cc: John Garry <john.garry@huawei.com>,
	"Leizhen (ThunderTown)" <thunder.leizhen@huawei.com>,
	Will Deacon <will.deacon@arm.com>,
	"Joerg Roedel" <joro@8bytes.org>,
	linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
	iommu <iommu@lists.linux-foundation.org>,
	Robin Murphy <robin.murphy@arm.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Zefan Li <lizefan@huawei.com>, Xinwei Hu <huxinwei@huawei.com>,
	Tianhong Ding <dingtianhong@huawei.com>,
	Hanjun Guo <guohanjun@huawei.com>, <zhouyoujun@huawei.com>
Subject: Re: [PATCH 1/5] iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock confliction
Date: Fri, 21 Jul 2017 18:57:15 +0800	[thread overview]
Message-ID: <20170721185715.0000533a@huawei.com> (raw)
In-Reply-To: <c1d85f28-c57b-4414-3504-16afb3a19ce0@codeaurora.org>

On Thu, 20 Jul 2017 15:07:05 -0400
Nate Watterson <nwatters@codeaurora.org> wrote:

> Hi Jonathan,
> 
> [...]
> >>>>>        
> >>> Hi All,
> >>>
> >>> I'm a bit of late entry to this discussion.  Just been running some more
> >>> detailed tests on our d05 boards and wanted to bring some more numbers to
> >>> the discussion.
> >>>
> >>> All tests against 4.12 with the following additions:
> >>> * Robin's series removing the io-pgtable spinlock (and a few recent fixes)
> >>> * Cherry picked updates to the sas driver, merged prior to 4.13-rc1
> >>> * An additional HNS (network card) bug fix that will be upstreamed shortly.
> >>>
> >>> I've broken the results down into this patch and this patch + the remainder
> >>> of the set. As leizhen mentioned we got a nice little performance
> >>> bump from Robin's series so that was applied first (as it's in mainline now)
> >>>
> >>> SAS tests were fio with noop scheduler, 4k block size and various io depths
> >>> 1 process per disk.  Note this is probably a different setup to leizhen's
> >>> original numbers.
> >>>
> >>> Precentages are off the performance seen with the smmu disabled.
> >>> SAS
> >>> 4.12 - none of this series.
> >>> SMMU disabled
> >>> read io-depth 32 -   384K IOPS (100%)
> >>> read io-depth 2048 - 950K IOPS (100%)
> >>> rw io-depth 32 -     166K IOPS (100%)
> >>> rw io-depth 2048 -   340K IOPS (100%)
> >>>
> >>> SMMU enabled
> >>> read io-depth 32 -   201K IOPS (52%)
> >>> read io-depth 2048 - 306K IOPS (32%)
> >>> rw io-depth 32 -     99K  IOPS (60%)
> >>> rw io-depth 2048 -   150K IOPS (44%)
> >>>
> >>> Robin's recent series with fixes as seen on list (now merged)
> >>> SMMU enabled.
> >>> read io-depth 32 -   208K IOPS (54%)
> >>> read io-depth 2048 - 335K IOPS (35%)
> >>> rw io-depth 32 -     105K IOPS (63%)
> >>> rw io-depth 2048 -   165K IOPS (49%)
> >>>
> >>> 4.12 + Robin's series + just this patch SMMU enabled
> >>>
> >>> (iommu/arm-smmu-v3: put of the execution of TLBI* to reduce lock conflict)
> >>>
> >>> read io-depth 32 -   225K IOPS (59%)
> >>> read io-depth 2048 - 365K IOPS (38%)
> >>> rw io-depth 32 -     110K IOPS (66%)
> >>> rw io-depth 2048 -   179K IOPS (53%)
> >>>
> >>> 4.12 + Robin's series + Second part of this series
> >>>
> >>> (iommu/arm-smmu-v3: put of the execution of TLBI* to reduce lock conflict)
> >>> (iommu: add a new member unmap_tlb_sync into struct iommu_ops)
> >>> (iommu/arm-smmu-v3: add supprot for unmap an iova range with only on tlb sync)
> >>> (iommu/arm-smmu: add support for unmap of a memory range with only one tlb sync)
> >>>
> >>> read io-depth 32 -    225K IOPS (59%)
> >>> read io-depth 2048 -  833K IOPS (88%)
> >>> rw io-depth 32 -      112K IOPS (67%)
> >>> rw io-depth 2048 -    220K IOPS (65%)
> >>>
> >>> Robin's series gave us small gains across the board (3-5% recovered)
> >>> relative to the no smmu performance (which we are taking as the ideal case)
> >>>
> >>> This first patch gets us back another 2-5% of the no smmu performance
> >>>
> >>> The next few patches get us very little advantage on the small io-depths
> >>> but make a large difference to the larger io-depths - in particular the
> >>> read IOPS which is over twice as fast as without the series.
> >>>
> >>> For HNS it seems that we are less dependent on the SMMU performance and
> >>> can reach the non SMMU speed.
> >>>
> >>> Tests with
> >>> iperf -t 30 -i 10 -c IPADDRESS -P 3 last 10 seconds taken to avoid any
> >>> initial variability.
> >>>
> >>> The server end of the link was always running with smmu v3 disabled
> >>> so as to act as a fast sink of the data. Some variation seen across
> >>> repeat runs.
> >>>
> >>> Mainline v4.12 + network card fix
> >>> NO SMMU
> >>> 9.42 GBits/sec
> >>>
> >>> SMMU
> >>> 4.36 GBits/sec (46%)
> >>>
> >>> Robin's io-pgtable spinlock series
> >>>
> >>> 6.68 to 7.34 (71% - 78% variation across runs)
> >>>
> >>> Just this patch SMMU enabled
> >>>
> >>> (iommu/arm-smmu-v3: put of the execution of TLBI* to reduce lock conflict)
> >>>
> >>> 7.96-8.8 GBits/sec (85% - 94%  some variation across runs)
> >>>
> >>> Full series
> >>>
> >>> (iommu/arm-smmu-v3: put of the execution of TLBI* to reduce lock conflict)
> >>> (iommu: add a new member unmap_tlb_sync into struct iommu_ops)
> >>> (iommu/arm-smmu-v3: add supprot for unmap an iova range with only on tlb sync)
> >>> (iommu/arm-smmu: add support for unmap of a memory range with only one tlb sync)
> >>>
> >>> 9.42 GBits/Sec (100%)
> >>>
> >>> So HNS test shows a greater boost from Robin's series and this first patch.
> >>> This is most likely because the HNS test is not putting as high a load on
> >>> the SMMU and associated code as the SAS test.
> >>>
> >>> In both cases however, this shows that both parts of this patch
> >>> series are beneficial.
> >>>
> >>> So on to the questions ;)
> >>>
> >>> Will, you mentioned that along with Robin and Nate you were working on
> >>> a somewhat related strategy to improve the performance.  Any ETA on that?  
> >>
> >> The strategy I was working on is basically equivalent to the second
> >> part of the series. I will test your patches out sometime this week, and
> >> I'll also try to have our performance team run it through their whole
> >> suite.  
> > 
> > Thanks, that's excellent.  Look forward to hearing how it goes.  
> 
> I tested the patches with 4 NVME drives connected to a single SMMU and
> the results seem to be inline with those you've reported.
> 
> FIO - 512k blocksize / io-depth 32 / 1 thread per drive
>   Baseline 4.13-rc1 w/SMMU enabled: 25% of SMMU bypass performance
>   Baseline + Patch 1              : 28%
>   Baseline + Patches 2-5          : 86%
>   Baseline + Complete series      : 100% [!!]
> 
> I saw performance improvements across all of the other FIO profiles I
> tested, although not always as substantial as was seen in the 512k/32/1
> case. The performance of some of the profiles, especially those with
> many threads per drive, remains woeful (often below 20%), but hopefully
> Robin's iova series will help improve that.
Excellent.  Thanks for the info and running the tests.

Even with both series we are still seeing some reduction in over the no-smmu
performance, but to a much lesser extent. 

Jonathan
> 
> > 
> > Particularly useful would be to know if there are particular performance tests
> > that show up anything interesting that we might want to replicate.
> > 
> > Jonathan and Leizhen  
> >>  
> >>>
> >>> As you might imagine, with the above numbers we are very keen to try and
> >>> move forward with this as quickly as possible.
> >>>
> >>> If you want additional testing we would be happy to help.
> >>>
> >>> Thanks,
> >>>
> >>> Jonathan  
> [...]
> 
> -Nate
> 

  parent reply	other threads:[~2017-07-21 10:57 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-26 13:38 [PATCH 0/5] arm-smmu: performance optimization Zhen Lei
2017-06-26 13:38 ` Zhen Lei
2017-06-26 13:38 ` Zhen Lei
2017-06-26 13:38 ` [PATCH 1/5] iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock confliction Zhen Lei
2017-06-26 13:38   ` Zhen Lei
     [not found]   ` <1498484330-10840-2-git-send-email-thunder.leizhen-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2017-06-28  9:32     ` Will Deacon
2017-06-28  9:32       ` Will Deacon
2017-06-28  9:32       ` Will Deacon
2017-06-29  2:08       ` Leizhen (ThunderTown)
2017-06-29  2:08         ` Leizhen (ThunderTown)
2017-06-29  2:08         ` Leizhen (ThunderTown)
     [not found]         ` <5954610F.9020807-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2017-07-17 13:06           ` John Garry
2017-07-17 13:06             ` John Garry
2017-07-17 13:06             ` John Garry
2017-07-17 14:23             ` Jonathan Cameron
2017-07-17 14:23               ` Jonathan Cameron
2017-07-17 14:23               ` Jonathan Cameron
     [not found]               ` <20170717222337.0000508f-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2017-07-17 17:28                 ` Nate Watterson
2017-07-17 17:28                   ` Nate Watterson
2017-07-17 17:28                   ` Nate Watterson
     [not found]                   ` <3cec10c5-82ca-2c54-dfdb-ac73b16e5bc6-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
2017-07-18  9:20                     ` Jonathan Cameron
2017-07-18  9:20                       ` Jonathan Cameron
2017-07-18  9:20                       ` Jonathan Cameron
     [not found]                       ` <20170718172055.00006e84-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2017-07-20 19:07                         ` Nate Watterson
2017-07-20 19:07                           ` Nate Watterson
2017-07-20 19:07                           ` Nate Watterson
     [not found]                           ` <c1d85f28-c57b-4414-3504-16afb3a19ce0-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
2017-07-21 10:57                             ` Jonathan Cameron [this message]
2017-07-21 10:57                               ` Jonathan Cameron
2017-07-21 10:57                               ` Jonathan Cameron
2017-08-22 15:41     ` Joerg Roedel
2017-08-22 15:41       ` Joerg Roedel
2017-08-22 15:41       ` Joerg Roedel
2017-08-23  1:21       ` Leizhen (ThunderTown)
2017-08-23  1:21         ` Leizhen (ThunderTown)
2017-08-23  1:21         ` Leizhen (ThunderTown)
2017-06-26 13:38 ` [PATCH 2/5] iommu: add a new member unmap_tlb_sync into struct iommu_ops Zhen Lei
2017-06-26 13:38   ` Zhen Lei
     [not found] ` <1498484330-10840-1-git-send-email-thunder.leizhen-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2017-06-26 13:38   ` [PATCH 3/5] iommu/arm-smmu-v3: add support for unmap an iova range with only one tlb sync Zhen Lei
2017-06-26 13:38     ` Zhen Lei
2017-06-26 13:38     ` Zhen Lei
2017-06-26 13:38   ` [PATCH 4/5] iommu/arm-smmu: add support for unmap a memory " Zhen Lei
2017-06-26 13:38     ` Zhen Lei
2017-06-26 13:38     ` Zhen Lei
2017-06-26 13:38   ` [PATCH 5/5] iommu/io-pgtable: delete member tlb_sync_pending of struct io_pgtable Zhen Lei
2017-06-26 13:38     ` Zhen Lei
2017-06-26 13:38     ` Zhen Lei
2017-08-17 14:36   ` [PATCH 0/5] arm-smmu: performance optimization Will Deacon
2017-08-17 14:36     ` Will Deacon
2017-08-17 14:36     ` Will Deacon
     [not found]     ` <20170817143650.GB30338-5wv7dgnIgG8@public.gmane.org>
2017-08-18  3:19       ` Leizhen (ThunderTown)
2017-08-18  3:19         ` Leizhen (ThunderTown)
2017-08-18  3:19         ` Leizhen (ThunderTown)
2017-08-18  8:39         ` Will Deacon
2017-08-18  8:39           ` Will Deacon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170721185715.0000533a@huawei.com \
    --to=jonathan.cameron-hv44wf8li93qt0dzr+alfa@public.gmane.org \
    --cc=dingtianhong-hv44wF8Li93QT0dZR+AlfA@public.gmane.org \
    --cc=guohanjun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org \
    --cc=huxinwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org \
    --cc=iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \
    --cc=linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org \
    --cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org \
    --cc=nwatters-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org \
    --cc=thunder.leizhen-hv44wF8Li93QT0dZR+AlfA@public.gmane.org \
    --cc=will.deacon-5wv7dgnIgG8@public.gmane.org \
    --cc=zhouyoujun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.