From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f180.google.com (mail-pl1-f180.google.com [209.85.214.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E641332E6B8 for ; Fri, 23 Jan 2026 09:48:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.180 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769161733; cv=none; b=p0SgngS0lSUU8YQP6mY3zkrST9U0EtCmpNUzqiE53yZmcx/y9s2Uv3MVCIQeR7wrb/O58qY90rik2WCguZKn6FZo4voIqT2D7LCdlTVWF8Vz/Ut5k7dLyWQOnigKOlrtisiNlxpfiIs4gHQ5j+Y1tEJngUsTnJoC9NPRgIVRENA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769161733; c=relaxed/simple; bh=/6nxxQU4OJO1e3GUjj1M5rD7cdgGwPOPiGFwn5oev8I=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=HRTrwqRi7jWVR7bLMPHsweSoORBIV+zgnn2aQ0sDE6jF935OqYdGTmotCLSqb5Mi3sC+hOyrSPMULF5lbJi3jG2rnslYHIyvbnIDvx2RDQtiDOhTcegWjydAItAIryB/jX8zFJvHqnhSA3syYIapRZj0d9ZmAl5qJzZ8NDEyQI0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=JwHnwXE7; arc=none smtp.client-ip=209.85.214.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="JwHnwXE7" Received: by mail-pl1-f180.google.com with SMTP id d9443c01a7336-2a35ae38bdfso98935ad.1 for ; Fri, 23 Jan 2026 01:48:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1769161725; x=1769766525; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=WduvPbclkt+XJxEjDiSJE0UN6koAij5CYcDCN1IFG3k=; b=JwHnwXE7Iz6IvWlDB2jJYeZ9WAF/z1CCVwWeR0ir14AldVLQI1ogR8C2QjTzpaeFh8 e+TNBJo5b0pnY64I+sWEF5Mdl4CTwGNqxc/wKifbsBOPnXn69DMn3OBwmw2KJPWRx+9T ot8PjaBShN4lL8Hf4kokDysJW8IWoFTijWxB4GH3Sq/zSAPHSTKUeJrGDeeOVdeVvwUv OEmAjjf6f9sWtPolGDTdcb20FbKpWCUH0f4mwp+vpKzlW+ADJMZs+/0sRPJgb7WqA2M3 OnodpCK62bIvuFRG/m4PgojkIeMMVwjtQoFyokL3mk/kTaS4kqr2zwg+tDEWp3Fr+OZx 5MNQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1769161725; x=1769766525; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=WduvPbclkt+XJxEjDiSJE0UN6koAij5CYcDCN1IFG3k=; b=A2Z5S1FGRVorl1+dmyvcmAm6Gtl8UXNCHt/FfkRwSmt+WbuwGK0eh2o7oPdNfGNPEu aKttqP3Jp95/LsmDiObEQs/0V5SGQgljYymyHz7mkeMtkPOe6F34digj/fmvsUmJDcyR cML/SGqi+zLsuSDHtqoqRshoSdZOgnZSv9raUgjLvJiJDycvYIkQdHazBURimlUoRB+t Lx2dQpFAcy7Bkp7tmshiVA3V7o8pAcDO+RmwCuJYCLcO0sbkBn8D/phYBFLPyp7kdbiC va9ujKPdaVOT/WOwqnguI2+6mP0xTzSn6vmABugDT7eLaE005klqUgSkVP639gzduIB6 A7fQ== X-Forwarded-Encrypted: i=1; AJvYcCWi+TGxzzzo/MVZwdtJxkNqfYMikc45pAxWP3e8/U/hCt6bKFxhGqR0ZOhZsh8MCs9QMTjzpLNiRQhLXXI=@vger.kernel.org X-Gm-Message-State: AOJu0YybBjln0SCAYup1ERskmGMBx44oKIdOCFU3nAhxAu/CrQ0/xn/3 2R2I6YKK8Uv2aRw9aFFj6gVc6fdtPS0Wav1u18f/e5EOUQcTJKQqv0KIzDZSYATAuB4XhR3zJ3V fbGuzlA== X-Gm-Gg: AZuq6aICeJbErav4g0dAMNblV47US11dWdbejeBElnHgs2YgxSrb/ID8DnxT4BAzT56 c7rlUAWsIRrm6SgE9YJxPBBUNJ1sSB92/7UA/Gf6k4dyDGhR3ovVHwEu2Z7rQNg7PAtO/LZdsl+ IRvecvYnFqVYMx/ECB6fRndrQW+yhPXJJxF6RUkJEtsAXgIKm3LivizkNtThzlW1FxbBgbXJ6Ha rcvT+fmxBdwftYGNFg3KZlgz6ljn70sykZuiQDaMgl34QxhlXxTW/2ghgIpG5V9oP+J0S85IetN SKXnVDz2feI/St+cpCSiZ8zFh+hjHeWUILLUQ88KNgA6UHcPuiODFMdCuTn7Q+hv4K6U7bn+8zN xGe5MYVoW4oFA9yD5yIYSJ3kGsAMvYCawVKUcxTKusJqDo+dqsLCDIJpUWOPBNocYSxN7XHavBA HQeVYzdFf4Wzot7qmGYePLYCChppD7tR1yIXxr47JawfRA1ntS4xgkfNP6GJw= X-Received: by 2002:a17:903:1b2b:b0:2a7:6c4e:5914 with SMTP id d9443c01a7336-2a80221cdf3mr2939255ad.6.1769161724232; Fri, 23 Jan 2026 01:48:44 -0800 (PST) Received: from google.com (222.245.187.35.bc.googleusercontent.com. [35.187.245.222]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2a802f978d2sm15529995ad.52.2026.01.23.01.48.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 23 Jan 2026 01:48:43 -0800 (PST) Date: Fri, 23 Jan 2026 09:48:37 +0000 From: Pranjal Shrivastava To: Nicolin Chen Cc: will@kernel.org, jean-philippe@linaro.org, robin.murphy@arm.com, joro@8bytes.org, jgg@nvidia.com, balbirs@nvidia.com, miko.lenczewski@arm.com, peterz@infradead.org, kevin.tian@intel.com, linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev, linux-kernel@vger.kernel.org Subject: Re: [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range() Message-ID: References: <06999367d001283744fd98eb7c1823afd516ce84.1766174731.git.nicolinc@nvidia.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <06999367d001283744fd98eb7c1823afd516ce84.1766174731.git.nicolinc@nvidia.com> On Fri, Dec 19, 2025 at 12:11:28PM -0800, Nicolin Chen wrote: > Each smmu_domain now has an arm_smmu_invs that specifies the invalidation > steps to perform after any change the IOPTEs. This includes supports for > basic ASID/VMID, the special case for nesting, and ATC invalidations. > > Introduce a new arm_smmu_domain_inv helper iterating smmu_domain->invs to > convert the invalidation array to commands. Any invalidation request with > no size specified means an entire flush over a range based one. > > Take advantage of the sorted array to compatible batch operations together > to the same SMMU. For instance, ATC invaliations for multiple SIDs can be > pushed as a batch. > > ATC invalidations must be completed before the driver disables ATS. Or the > device is permitted to ignore any racing invalidation that would cause an > SMMU timeout. The sequencing is done with a rwlock where holding the write > side of the rwlock means that there are no outstanding ATC invalidations. > If ATS is not used the rwlock is ignored, similar to the existing code. > > Co-developed-by: Jason Gunthorpe > Signed-off-by: Jason Gunthorpe > Reviewed-by: Jason Gunthorpe > Signed-off-by: Nicolin Chen > --- > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 9 + > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 258 +++++++++++++++++++- > 2 files changed, 254 insertions(+), 13 deletions(-) > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h > index f8dc96476c43..c3fee7f14480 100644 > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h > @@ -1086,6 +1086,15 @@ void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid, > int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain, > unsigned long iova, size_t size); > > +void arm_smmu_domain_inv_range(struct arm_smmu_domain *smmu_domain, > + unsigned long iova, size_t size, > + unsigned int granule, bool leaf); > + > +static inline void arm_smmu_domain_inv(struct arm_smmu_domain *smmu_domain) > +{ > + arm_smmu_domain_inv_range(smmu_domain, 0, 0, 0, false); > +} > + > void __arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu, > struct arm_smmu_cmdq *cmdq); > int arm_smmu_init_one_queue(struct arm_smmu_device *smmu, > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > index fb45359680d2..6e1082e6d164 100644 > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > @@ -2516,23 +2516,19 @@ static void arm_smmu_tlb_inv_context(void *cookie) > arm_smmu_atc_inv_domain(smmu_domain, 0, 0); > } > > -static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd, > - unsigned long iova, size_t size, > - size_t granule, > - struct arm_smmu_domain *smmu_domain) > +static void arm_smmu_cmdq_batch_add_range(struct arm_smmu_device *smmu, > + struct arm_smmu_cmdq_batch *cmds, > + struct arm_smmu_cmdq_ent *cmd, > + unsigned long iova, size_t size, > + size_t granule, size_t pgsize) > { > - struct arm_smmu_device *smmu = smmu_domain->smmu; > - unsigned long end = iova + size, num_pages = 0, tg = 0; > + unsigned long end = iova + size, num_pages = 0, tg = pgsize; > size_t inv_range = granule; > - struct arm_smmu_cmdq_batch cmds; > > if (!size) > return; > > if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) { > - /* Get the leaf page size */ > - tg = __ffs(smmu_domain->domain.pgsize_bitmap); > - > num_pages = size >> tg; > > /* Convert page size of 12,14,16 (log2) to 1,2,3 */ > @@ -2552,8 +2548,6 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd, > num_pages++; > } > > - arm_smmu_cmdq_batch_init(smmu, &cmds, cmd); > - > while (iova < end) { > if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) { > /* > @@ -2581,9 +2575,26 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd, > } > > cmd->tlbi.addr = iova; > - arm_smmu_cmdq_batch_add(smmu, &cmds, cmd); > + arm_smmu_cmdq_batch_add(smmu, cmds, cmd); > iova += inv_range; > } > +} > + > +static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd, > + unsigned long iova, size_t size, > + size_t granule, > + struct arm_smmu_domain *smmu_domain) > +{ > + struct arm_smmu_device *smmu = smmu_domain->smmu; > + struct arm_smmu_cmdq_batch cmds; > + size_t pgsize; > + > + /* Get the leaf page size */ > + pgsize = __ffs(smmu_domain->domain.pgsize_bitmap); > + > + arm_smmu_cmdq_batch_init(smmu, &cmds, cmd); > + arm_smmu_cmdq_batch_add_range(smmu, &cmds, cmd, iova, size, granule, > + pgsize); > arm_smmu_cmdq_batch_submit(smmu, &cmds); > } > > @@ -2639,6 +2650,193 @@ void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid, > __arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain); > } > > +static bool arm_smmu_inv_size_too_big(struct arm_smmu_device *smmu, size_t size, > + size_t granule) > +{ > + size_t max_tlbi_ops; > + > + /* 0 size means invalidate all */ > + if (!size || size == SIZE_MAX) > + return true; > + > + if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) > + return false; > + > + /* > + * Borrowed from the MAX_TLBI_OPS in arch/arm64/include/asm/tlbflush.h, > + * this is used as a threshold to replace "size_opcode" commands with a > + * single "nsize_opcode" command, when SMMU doesn't implement the range > + * invalidation feature, where there can be too many per-granule TLBIs, > + * resulting in a soft lockup. > + */ > + max_tlbi_ops = 1 << (ilog2(granule) - 3); > + return size >= max_tlbi_ops * granule; > +} > + > +/* Used by non INV_TYPE_ATS* invalidations */ > +static void arm_smmu_inv_to_cmdq_batch(struct arm_smmu_inv *inv, > + struct arm_smmu_cmdq_batch *cmds, > + struct arm_smmu_cmdq_ent *cmd, > + unsigned long iova, size_t size, > + unsigned int granule) > +{ > + if (arm_smmu_inv_size_too_big(inv->smmu, size, granule)) { > + cmd->opcode = inv->nsize_opcode; > + arm_smmu_cmdq_batch_add(inv->smmu, cmds, cmd); > + return; > + } > + > + cmd->opcode = inv->size_opcode; > + arm_smmu_cmdq_batch_add_range(inv->smmu, cmds, cmd, iova, size, granule, > + inv->pgsize); > +} > + > +static inline bool arm_smmu_invs_end_batch(struct arm_smmu_inv *cur, > + struct arm_smmu_inv *next) > +{ > + /* Changing smmu means changing command queue */ > + if (cur->smmu != next->smmu) > + return true; > + /* The batch for S2 TLBI must be done before nested S1 ASIDs */ > + if (cur->type != INV_TYPE_S2_VMID_S1_CLEAR && > + next->type == INV_TYPE_S2_VMID_S1_CLEAR) > + return true; > + /* ATS must be after a sync of the S1/S2 invalidations */ > + if (!arm_smmu_inv_is_ats(cur) && arm_smmu_inv_is_ats(next)) > + return true; > + return false; > +} > + > +static void __arm_smmu_domain_inv_range(struct arm_smmu_invs *invs, > + unsigned long iova, size_t size, > + unsigned int granule, bool leaf) > +{ > + struct arm_smmu_cmdq_batch cmds = {}; > + struct arm_smmu_inv *cur; > + struct arm_smmu_inv *end; > + > + cur = invs->inv; > + end = cur + READ_ONCE(invs->num_invs); > + /* Skip any leading entry marked as a trash */ > + for (; cur != end; cur++) > + if (refcount_read(&cur->users)) > + break; > + while (cur != end) { > + struct arm_smmu_device *smmu = cur->smmu; > + struct arm_smmu_cmdq_ent cmd = { > + /* > + * Pick size_opcode to run arm_smmu_get_cmdq(). This can > + * be changed to nsize_opcode, which would result in the > + * same CMDQ pointer. > + */ > + .opcode = cur->size_opcode, > + }; > + struct arm_smmu_inv *next; > + > + if (!cmds.num) > + arm_smmu_cmdq_batch_init(smmu, &cmds, &cmd); > + > + switch (cur->type) { > + case INV_TYPE_S1_ASID: > + cmd.tlbi.asid = cur->id; > + cmd.tlbi.leaf = leaf; > + arm_smmu_inv_to_cmdq_batch(cur, &cmds, &cmd, iova, size, > + granule); > + break; > + case INV_TYPE_S2_VMID: > + cmd.tlbi.vmid = cur->id; > + cmd.tlbi.leaf = leaf; > + arm_smmu_inv_to_cmdq_batch(cur, &cmds, &cmd, iova, size, > + granule); > + break; > + case INV_TYPE_S2_VMID_S1_CLEAR: > + /* CMDQ_OP_TLBI_S12_VMALL already flushed S1 entries */ > + if (arm_smmu_inv_size_too_big(cur->smmu, size, granule)) > + continue; > + cmd.tlbi.vmid = cur->id; > + arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd); > + break; > + case INV_TYPE_ATS: > + arm_smmu_atc_inv_to_cmd(cur->ssid, iova, size, &cmd); > + cmd.atc.sid = cur->id; > + arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd); > + break; > + case INV_TYPE_ATS_FULL: > + arm_smmu_atc_inv_to_cmd(IOMMU_NO_PASID, 0, 0, &cmd); > + cmd.atc.sid = cur->id; > + arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd); > + break; > + default: > + WARN_ON_ONCE(1); > + continue; > + } > + > + /* Skip any trash entry in-between */ > + for (next = cur + 1; next != end; next++) > + if (refcount_read(&next->users)) > + break; > + > + if (cmds.num && > + (next == end || arm_smmu_invs_end_batch(cur, next))) { > + arm_smmu_cmdq_batch_submit(smmu, &cmds); > + cmds.num = 0; > + } > + cur = next; > + } > +} > + > +void arm_smmu_domain_inv_range(struct arm_smmu_domain *smmu_domain, > + unsigned long iova, size_t size, > + unsigned int granule, bool leaf) > +{ > + struct arm_smmu_invs *invs; > + > + /* > + * An invalidation request must follow some IOPTE change and then load > + * an invalidation array. In the meantime, a domain attachment mutates > + * the array and then stores an STE/CD asking SMMU HW to acquire those > + * changed IOPTEs. In other word, these two are interdependent and can > + * race. > + * > + * In a race, the RCU design (with its underlying memory barriers) can > + * ensure the invalidation array to always get updated before loaded. > + * > + * smp_mb() is used here, paired with the smp_mb() following the array > + * update in a concurrent attach, to ensure: > + * - HW sees the new IOPTEs if it walks after STE installation > + * - Invalidation thread sees the updated array with the new ASID. > + * > + * [CPU0] | [CPU1] > + * | > + * change IOPTEs and TLB flush: | > + * arm_smmu_domain_inv_range() { | arm_smmu_install_new_domain_invs { > + * ... | rcu_assign_pointer(new_invs); > + * smp_mb(); // ensure IOPTEs | smp_mb(); // ensure new_invs > + * ... | kfree_rcu(old_invs, rcu); > + * // load invalidation array | } > + * invs = rcu_dereference(); | arm_smmu_install_ste_for_dev { > + * | STE = TTB0 // read new IOPTEs > + */ > + smp_mb(); > + > + rcu_read_lock(); > + invs = rcu_dereference(smmu_domain->invs); > + > + /* > + * Avoid locking unless ATS is being used. No ATC invalidation can be > + * going on after a domain is detached. > + */ > + if (invs->has_ats) { > + read_lock(&invs->rwlock); Shouldn't these be read_lock_irqsave for all rwlock variants here? Invalidations might happen in IRQ context as well.. > + __arm_smmu_domain_inv_range(invs, iova, size, granule, leaf); > + read_unlock(&invs->rwlock); > + } else { > + __arm_smmu_domain_inv_range(invs, iova, size, granule, leaf); > + } > + > + rcu_read_unlock(); > +} > + > static void arm_smmu_tlb_inv_page_nosync(struct iommu_iotlb_gather *gather, > unsigned long iova, size_t granule, > void *cookie) > @@ -3285,6 +3483,23 @@ arm_smmu_install_new_domain_invs(struct arm_smmu_attach_state *state) > return; > > rcu_assign_pointer(*invst->invs_ptr, invst->new_invs); > + /* > + * We are committed to updating the STE. Ensure the invalidation array > + * is visible to concurrent map/unmap threads, and acquire any racing > + * IOPTE updates. > + * > + * [CPU0] | [CPU1] > + * | > + * change IOPTEs and TLB flush: | > + * arm_smmu_domain_inv_range() { | arm_smmu_install_new_domain_invs { > + * ... | rcu_assign_pointer(new_invs); > + * smp_mb(); // ensure IOPTEs | smp_mb(); // ensure new_invs > + * ... | kfree_rcu(old_invs, rcu); > + * // load invalidation array | } > + * invs = rcu_dereference(); | arm_smmu_install_ste_for_dev { > + * | STE = TTB0 // read new IOPTEs > + */ > + smp_mb(); > kfree_rcu(invst->old_invs, rcu); > } > > @@ -3334,6 +3549,23 @@ arm_smmu_install_old_domain_invs(struct arm_smmu_attach_state *state) > return; > > rcu_assign_pointer(*invst->invs_ptr, new_invs); > + /* > + * We are committed to updating the STE. Ensure the invalidation array > + * is visible to concurrent map/unmap threads, and acquire any racing > + * IOPTE updates. > + * > + * [CPU0] | [CPU1] > + * | > + * change IOPTEs and TLB flush: | > + * arm_smmu_domain_inv_range() { | arm_smmu_install_old_domain_invs { > + * ... | rcu_assign_pointer(new_invs); > + * smp_mb(); // ensure IOPTEs | smp_mb(); // ensure new_invs > + * ... | kfree_rcu(old_invs, rcu); > + * // load invalidation array | } > + * invs = rcu_dereference(); | arm_smmu_install_ste_for_dev { > + * | STE = TTB0 // read new IOPTEs > + */ > + smp_mb(); > kfree_rcu(old_invs, rcu); > } > For INV_TYPE_S1_ASID, the new code loops and checks size_too_big via arm_smmu_inv_to_cmdq_batch. However, for INV_TYPE_ATS, it issues a single command for the entire range. While this matches the current driver, are we confident arm_smmu_atc_inv_to_cmd handles all massive sizes correctly without needing a similar loop or "too big" fallback? Thanks, Praan