From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Leizhen (ThunderTown)" Subject: Re: [PATCH v2 1/3] iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock confliction Date: Thu, 19 Oct 2017 11:00:45 +0800 Message-ID: <59E8155D.2070102@huawei.com> References: <1505221238-9428-1-git-send-email-thunder.leizhen@huawei.com> <1505221238-9428-2-git-send-email-thunder.leizhen@huawei.com> <20171018125849.GD4077@arm.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20171018125849.GD4077-5wv7dgnIgG8@public.gmane.org> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: iommu-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org Errors-To: iommu-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org To: Will Deacon Cc: Kefeng Wang , linux-kernel , Jinyue Li , iommu , Libin , Hanjun Guo , linux-arm-kernel List-Id: iommu@lists.linux-foundation.org On 2017/10/18 20:58, Will Deacon wrote: > Hi Thunder, > > On Tue, Sep 12, 2017 at 09:00:36PM +0800, Zhen Lei wrote: >> Because all TLBI commands should be followed by a SYNC command, to make >> sure that it has been completely finished. So we can just add the TLBI >> commands into the queue, and put off the execution until meet SYNC or >> other commands. To prevent the followed SYNC command waiting for a long >> time because of too many commands have been delayed, restrict the max >> delayed number. >> >> According to my test, I got the same performance data as I replaced writel >> with writel_relaxed in queue_inc_prod. >> >> Signed-off-by: Zhen Lei >> --- >> drivers/iommu/arm-smmu-v3.c | 42 +++++++++++++++++++++++++++++++++++++----- >> 1 file changed, 37 insertions(+), 5 deletions(-) > > If we want to go down the route of explicit command batching, I'd much > rather do it by implementing the iotlb_range_add callback in the driver, > and have a fixed-length array of batched ranges on the domain. We could I think even if iotlb_range_add callback is implemented, this patch is still valuable. The main purpose of this patch is to reduce dsb operation. So in the scenario with iotlb_range_add implemented: .iotlb_range_add: spin_lock_irqsave(&smmu->cmdq.lock, flags); ... add tlbi range-1 to cmq-queue ... add tlbi range-n to cmq-queue //n dsb ... spin_unlock_irqrestore(&smmu->cmdq.lock, flags); .iotlb_sync spin_lock_irqsave(&smmu->cmdq.lock, flags); ... add cmd_sync to cmq-queue dsb ... spin_unlock_irqrestore(&smmu->cmdq.lock, flags); Although iotlb_range_add can reduce n-1 dsb operations, but there are still 1 left. If n is not large enough, this patch is helpful. > potentially toggle this function pointer based on the compatible string too, > if it shows only to benefit some systems. [ On 2017/9/19 12:31, Nate Watterson wrote: I tested these (2) patches on QDF2400 hardware and saw performance improvements in line with those I reported when testing the original series. ] I'm not sure whether this patch can improve performance on QDF2400, because there are two patches. But at least it seems harmless, maybe the other hardware platforms are the same. > > Will > > . > -- Thanks! BestRegards From mboxrd@z Thu Jan 1 00:00:00 1970 From: thunder.leizhen@huawei.com (Leizhen (ThunderTown)) Date: Thu, 19 Oct 2017 11:00:45 +0800 Subject: [PATCH v2 1/3] iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock confliction In-Reply-To: <20171018125849.GD4077@arm.com> References: <1505221238-9428-1-git-send-email-thunder.leizhen@huawei.com> <1505221238-9428-2-git-send-email-thunder.leizhen@huawei.com> <20171018125849.GD4077@arm.com> Message-ID: <59E8155D.2070102@huawei.com> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On 2017/10/18 20:58, Will Deacon wrote: > Hi Thunder, > > On Tue, Sep 12, 2017 at 09:00:36PM +0800, Zhen Lei wrote: >> Because all TLBI commands should be followed by a SYNC command, to make >> sure that it has been completely finished. So we can just add the TLBI >> commands into the queue, and put off the execution until meet SYNC or >> other commands. To prevent the followed SYNC command waiting for a long >> time because of too many commands have been delayed, restrict the max >> delayed number. >> >> According to my test, I got the same performance data as I replaced writel >> with writel_relaxed in queue_inc_prod. >> >> Signed-off-by: Zhen Lei >> --- >> drivers/iommu/arm-smmu-v3.c | 42 +++++++++++++++++++++++++++++++++++++----- >> 1 file changed, 37 insertions(+), 5 deletions(-) > > If we want to go down the route of explicit command batching, I'd much > rather do it by implementing the iotlb_range_add callback in the driver, > and have a fixed-length array of batched ranges on the domain. We could I think even if iotlb_range_add callback is implemented, this patch is still valuable. The main purpose of this patch is to reduce dsb operation. So in the scenario with iotlb_range_add implemented: .iotlb_range_add: spin_lock_irqsave(&smmu->cmdq.lock, flags); ... add tlbi range-1 to cmq-queue ... add tlbi range-n to cmq-queue //n dsb ... spin_unlock_irqrestore(&smmu->cmdq.lock, flags); .iotlb_sync spin_lock_irqsave(&smmu->cmdq.lock, flags); ... add cmd_sync to cmq-queue dsb ... spin_unlock_irqrestore(&smmu->cmdq.lock, flags); Although iotlb_range_add can reduce n-1 dsb operations, but there are still 1 left. If n is not large enough, this patch is helpful. > potentially toggle this function pointer based on the compatible string too, > if it shows only to benefit some systems. [ On 2017/9/19 12:31, Nate Watterson wrote: I tested these (2) patches on QDF2400 hardware and saw performance improvements in line with those I reported when testing the original series. ] I'm not sure whether this patch can improve performance on QDF2400, because there are two patches. But at least it seems harmless, maybe the other hardware platforms are the same. > > Will > > . > -- Thanks! BestRegards From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751992AbdJSDCM (ORCPT ); Wed, 18 Oct 2017 23:02:12 -0400 Received: from szxga04-in.huawei.com ([45.249.212.190]:8948 "EHLO szxga04-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751963AbdJSDCI (ORCPT ); Wed, 18 Oct 2017 23:02:08 -0400 Subject: Re: [PATCH v2 1/3] iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock confliction To: Will Deacon References: <1505221238-9428-1-git-send-email-thunder.leizhen@huawei.com> <1505221238-9428-2-git-send-email-thunder.leizhen@huawei.com> <20171018125849.GD4077@arm.com> CC: Joerg Roedel , linux-arm-kernel , iommu , Robin Murphy , linux-kernel , Hanjun Guo , Libin , Jinyue Li , Kefeng Wang From: "Leizhen (ThunderTown)" Message-ID: <59E8155D.2070102@huawei.com> Date: Thu, 19 Oct 2017 11:00:45 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.5.1 MIME-Version: 1.0 In-Reply-To: <20171018125849.GD4077@arm.com> Content-Type: text/plain; charset="windows-1252" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.177.23.164] X-CFilter-Loop: Reflected X-Mirapoint-Virus-RAPID-Raw: score=unknown(0), refid=str=0001.0A020204.59E81585.00F7,ss=1,re=0.000,recu=0.000,reip=0.000,cl=1,cld=1,fgs=0, ip=0.0.0.0, so=2014-11-16 11:51:01, dmn=2013-03-21 17:37:32 X-Mirapoint-Loop-Id: 1fce624bdb2371485c5ddf412d1a56b7 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2017/10/18 20:58, Will Deacon wrote: > Hi Thunder, > > On Tue, Sep 12, 2017 at 09:00:36PM +0800, Zhen Lei wrote: >> Because all TLBI commands should be followed by a SYNC command, to make >> sure that it has been completely finished. So we can just add the TLBI >> commands into the queue, and put off the execution until meet SYNC or >> other commands. To prevent the followed SYNC command waiting for a long >> time because of too many commands have been delayed, restrict the max >> delayed number. >> >> According to my test, I got the same performance data as I replaced writel >> with writel_relaxed in queue_inc_prod. >> >> Signed-off-by: Zhen Lei >> --- >> drivers/iommu/arm-smmu-v3.c | 42 +++++++++++++++++++++++++++++++++++++----- >> 1 file changed, 37 insertions(+), 5 deletions(-) > > If we want to go down the route of explicit command batching, I'd much > rather do it by implementing the iotlb_range_add callback in the driver, > and have a fixed-length array of batched ranges on the domain. We could I think even if iotlb_range_add callback is implemented, this patch is still valuable. The main purpose of this patch is to reduce dsb operation. So in the scenario with iotlb_range_add implemented: .iotlb_range_add: spin_lock_irqsave(&smmu->cmdq.lock, flags); ... add tlbi range-1 to cmq-queue ... add tlbi range-n to cmq-queue //n dsb ... spin_unlock_irqrestore(&smmu->cmdq.lock, flags); .iotlb_sync spin_lock_irqsave(&smmu->cmdq.lock, flags); ... add cmd_sync to cmq-queue dsb ... spin_unlock_irqrestore(&smmu->cmdq.lock, flags); Although iotlb_range_add can reduce n-1 dsb operations, but there are still 1 left. If n is not large enough, this patch is helpful. > potentially toggle this function pointer based on the compatible string too, > if it shows only to benefit some systems. [ On 2017/9/19 12:31, Nate Watterson wrote: I tested these (2) patches on QDF2400 hardware and saw performance improvements in line with those I reported when testing the original series. ] I'm not sure whether this patch can improve performance on QDF2400, because there are two patches. But at least it seems harmless, maybe the other hardware platforms are the same. > > Will > > . > -- Thanks! BestRegards