From mboxrd@z Thu Jan 1 00:00:00 1970 From: linu.cherian@cavium.com (Linu Cherian) Date: Fri, 23 Jun 2017 17:04:05 +0530 Subject: [PATCH v2 5/8] iommu/io-pgtable-arm: Support lockless operation In-Reply-To: <7a8da378-19e2-2f35-877a-cc6ce389301a@arm.com> References: <20170623055326.GA2949@virtx40> <20170623085607.GC3718@virtx40> <7a8da378-19e2-2f35-877a-cc6ce389301a@arm.com> Message-ID: <20170623113405.GA5221@virtx40> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On Fri Jun 23, 2017 at 11:35:25AM +0100, Robin Murphy wrote: > On 23/06/17 09:56, Linu Cherian wrote: > > On Fri Jun 23, 2017 at 11:23:26AM +0530, Linu Cherian wrote: > >> > >> Robin, > >> Was trying to understand the new changes. Had few questions on > >> arm_lpae_install_table. > >> > >> On Thu Jun 22, 2017 at 04:53:54PM +0100, Robin Murphy wrote: > >>> For parallel I/O with multiple concurrent threads servicing the same > >>> device (or devices, if several share a domain), serialising page table > >>> updates becomes a massive bottleneck. On reflection, though, we don't > >>> strictly need to do that - for valid IOMMU API usage, there are in fact > >>> only two races that we need to guard against: multiple map requests for > >>> different blocks within the same region, when the intermediate-level > >>> table for that region does not yet exist; and multiple unmaps of > >>> different parts of the same block entry. Both of those are fairly easily > >>> solved by using a cmpxchg to install the new table, such that if we then > >>> find that someone else's table got there first, we can simply free ours > >>> and continue. > >>> > >>> Make the requisite changes such that we can withstand being called > >>> without the caller maintaining a lock. In theory, this opens up a few > >>> corners in which wildly misbehaving callers making nonsensical > >>> overlapping requests might lead to crashes instead of just unpredictable > >>> results, but correct code really does not deserve to pay a significant > >>> performance cost for the sake of masking bugs in theoretical broken code. > >>> > >>> Signed-off-by: Robin Murphy > >>> --- > >>> > >>> v2: > >>> - Fix barriers in install_table > >>> - Make a few more PTE accesses with {READ,WRITE}_ONCE() just in case > >>> - Minor cosmetics > >>> > >>> drivers/iommu/io-pgtable-arm.c | 72 +++++++++++++++++++++++++++++++++--------- > >>> 1 file changed, 57 insertions(+), 15 deletions(-) > >>> > >>> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c > >>> index 6334f51912ea..52700fa958c2 100644 > >>> --- a/drivers/iommu/io-pgtable-arm.c > >>> +++ b/drivers/iommu/io-pgtable-arm.c > >>> @@ -20,6 +20,7 @@ > >>> > >>> #define pr_fmt(fmt) "arm-lpae io-pgtable: " fmt > >>> > >>> +#include > >>> #include > >>> #include > >>> #include > >>> @@ -99,6 +100,8 @@ > >>> #define ARM_LPAE_PTE_ATTR_HI_MASK (((arm_lpae_iopte)6) << 52) > >>> #define ARM_LPAE_PTE_ATTR_MASK (ARM_LPAE_PTE_ATTR_LO_MASK | \ > >>> ARM_LPAE_PTE_ATTR_HI_MASK) > >>> +/* Software bit for solving coherency races */ > >>> +#define ARM_LPAE_PTE_SW_SYNC (((arm_lpae_iopte)1) << 55) > >>> > >>> /* Stage-1 PTE */ > >>> #define ARM_LPAE_PTE_AP_UNPRIV (((arm_lpae_iopte)1) << 6) > >>> @@ -249,15 +252,20 @@ static void __arm_lpae_free_pages(void *pages, size_t size, > >>> free_pages_exact(pages, size); > >>> } > >>> > >>> +static void __arm_lpae_sync_pte(arm_lpae_iopte *ptep, > >>> + struct io_pgtable_cfg *cfg) > >>> +{ > >>> + dma_sync_single_for_device(cfg->iommu_dev, __arm_lpae_dma_addr(ptep), > >>> + sizeof(*ptep), DMA_TO_DEVICE); > >>> +} > >>> + > >>> static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, arm_lpae_iopte pte, > >>> struct io_pgtable_cfg *cfg) > >>> { > >>> *ptep = pte; > >>> > >>> if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA)) > >>> - dma_sync_single_for_device(cfg->iommu_dev, > >>> - __arm_lpae_dma_addr(ptep), > >>> - sizeof(pte), DMA_TO_DEVICE); > >>> + __arm_lpae_sync_pte(ptep, cfg); > >>> } > >>> > >>> static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data, > >>> @@ -314,16 +322,30 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data, > >>> > >>> static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table, > >>> arm_lpae_iopte *ptep, > >>> + arm_lpae_iopte curr, > >>> struct io_pgtable_cfg *cfg) > >>> { > >>> - arm_lpae_iopte new; > >>> + arm_lpae_iopte old, new; > >>> > >>> new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE; > >>> if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS) > >>> new |= ARM_LPAE_PTE_NSTABLE; > >>> > >>> - __arm_lpae_set_pte(ptep, new, cfg); > >>> - return new; > >>> + /* Ensure the table itself is visible before its PTE can be */ > >>> + wmb(); > >> > >> Could you please give more hints on why this is required. > > In theory it's possible for the write to ptep to become visible before > the previous writes filling out the PTEs in table - if the IOMMU > happened to speculatively walk ptep while parts of table were still sat > in a write buffer somewhere, it could see the old contents of that page > and potentially allocate bogus TLB entries if the stale data happened to > look like valid PTEs. In the non-coherent case the DMA cache maintenance > is sufficient, but otherwise we need a barrier to order writing the > next-level PTEs strictly before writing the table PTE pointing to them, > such that the IOMMU cannot at any point see a partially-initialised table. Got it. Thanks a lot for explainig that. > > >>> + > >>> + old = cmpxchg64_relaxed(ptep, curr, new); > >>> + > >>> + if ((cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) || > >>> + (old & ARM_LPAE_PTE_SW_SYNC)) > >>> + return old; > >>> + > >>> + /* Even if it's not ours, there's no point waiting; just kick it */ > >>> + __arm_lpae_sync_pte(ptep, cfg); > >>> + if (old == curr) > >>> + WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC); > >> > >> How is it ensured that the cache operations are completed before we flag them with > >> ARM_LPAE_PTE_SW_SYNC. The previous version had a wmb() after the sync operation. > > The wmb() here was a silly oversight on my part - as Will reminded me, > dma_sync_*() already has its own barrier to ensure completion, which is > pretty obvious in retrospect because the entire streaming DMA would be > totally broken otherwise. > > >>> + > >>> + return old; > >>> } > >>> > >> > > > > > > Observed a performance drop of close to 1G, > > while testing on the 10G network interface with this patch series compared to v1. > > Is that consistent over multiple runs? I wouldn't expect many workloads > to be thrashing table and hugepage mappings in the same IOVA region, so > after a point we should tend to reach a fairly steady state where we're > only changing leaf PTEs. > > > Moving the wmb() as in v1 restores it back. > > Note that on a coherent platform like ThunderX that's as good as just > deleting it, because you'll never execute the case below. However, on > reflection I think it can at least safely be downgraded to dma_wmb() > (i.e. DMB) rather than a full DSB - would you be able to test what > difference that makes? The testing was done on Thunderx 1, which has a non coherent page table walk. Yes, downgrading to dma_wmb() helps. With this change, performance is back to v1. Thanks. -- Linu cherian