* [PATCH 1/3] iommu/arm-smmu-v3: Specialise CMD_SYNC handling
2017-08-18 17:33 [PATCH 0/3] SMMUv3 CMD_SYNC optimisation Robin Murphy
@ 2017-08-18 17:33 ` Robin Murphy
2017-08-18 17:33 ` [PATCH 2/3] iommu/arm-smmu-v3: Forget about cmdq-sync interrupt Robin Murphy
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Robin Murphy @ 2017-08-18 17:33 UTC (permalink / raw)
To: linux-arm-kernel
CMD_SYNC already has a bit of special treatment here and there, but as
we're about to extend it with more functionality for completing outside
the CMDQ lock, things are going to get rather messy if we keep trying to
cram everything into a single generic command interface. Instead, let's
break out the issuing of CMD_SYNC into its own specific helper where
upcoming changes will have room to breathe.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
drivers/iommu/arm-smmu-v3.c | 46 +++++++++++++++++++++++++++++----------------
1 file changed, 30 insertions(+), 16 deletions(-)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index e67ba6c40faf..bbcb33e8f55c 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -936,13 +936,23 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu)
queue_write(Q_ENT(q, cons), cmd, q->ent_dwords);
}
+static void arm_smmu_cmdq_insert_cmd(struct arm_smmu_device *smmu, u64 *cmd,
+ bool wfe)
+{
+ struct arm_smmu_queue *q = &smmu->cmdq.q;
+
+ while (queue_insert_raw(q, cmd) == -ENOSPC) {
+ if (queue_poll_cons(q, false, wfe))
+ dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
+ }
+}
+
static void arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
struct arm_smmu_cmdq_ent *ent)
{
u64 cmd[CMDQ_ENT_DWORDS];
unsigned long flags;
bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
- struct arm_smmu_queue *q = &smmu->cmdq.q;
if (arm_smmu_cmdq_build_cmd(cmd, ent)) {
dev_warn(smmu->dev, "ignoring unknown CMDQ opcode 0x%x\n",
@@ -951,12 +961,22 @@ static void arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
}
spin_lock_irqsave(&smmu->cmdq.lock, flags);
- while (queue_insert_raw(q, cmd) == -ENOSPC) {
- if (queue_poll_cons(q, false, wfe))
- dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
- }
+ arm_smmu_cmdq_insert_cmd(smmu, cmd, wfe);
+ spin_unlock_irqrestore(&smmu->cmdq.lock, flags);
+}
- if (ent->opcode == CMDQ_OP_CMD_SYNC && queue_poll_cons(q, true, wfe))
+static void arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
+{
+ u64 cmd[CMDQ_ENT_DWORDS];
+ unsigned long flags;
+ bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
+ struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC };
+
+ arm_smmu_cmdq_build_cmd(cmd, ent);
+
+ spin_lock_irqsave(&smmu->cmdq.lock, flags);
+ arm_smmu_cmdq_insert_cmd(smmu, cmd, wfe);
+ if (queue_poll_cons(&smmu->cmdq.q, true, wfe))
dev_err_ratelimited(smmu->dev, "CMD_SYNC timeout\n");
spin_unlock_irqrestore(&smmu->cmdq.lock, flags);
}
@@ -1029,8 +1049,7 @@ static void arm_smmu_sync_ste_for_sid(struct arm_smmu_device *smmu, u32 sid)
};
arm_smmu_cmdq_issue_cmd(smmu, &cmd);
- cmd.opcode = CMDQ_OP_CMD_SYNC;
- arm_smmu_cmdq_issue_cmd(smmu, &cmd);
+ arm_smmu_cmdq_issue_sync(smmu);
}
static void arm_smmu_write_strtab_ent(struct arm_smmu_device *smmu, u32 sid,
@@ -1352,10 +1371,7 @@ static irqreturn_t arm_smmu_combined_irq_handler(int irq, void *dev)
/* IO_PGTABLE API */
static void __arm_smmu_tlb_sync(struct arm_smmu_device *smmu)
{
- struct arm_smmu_cmdq_ent cmd;
-
- cmd.opcode = CMDQ_OP_CMD_SYNC;
- arm_smmu_cmdq_issue_cmd(smmu, &cmd);
+ arm_smmu_cmdq_issue_sync(smmu);
}
static void arm_smmu_tlb_sync(void *cookie)
@@ -2399,8 +2415,7 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
/* Invalidate any cached configuration */
cmd.opcode = CMDQ_OP_CFGI_ALL;
arm_smmu_cmdq_issue_cmd(smmu, &cmd);
- cmd.opcode = CMDQ_OP_CMD_SYNC;
- arm_smmu_cmdq_issue_cmd(smmu, &cmd);
+ arm_smmu_cmdq_issue_sync(smmu);
/* Invalidate any stale TLB entries */
if (smmu->features & ARM_SMMU_FEAT_HYP) {
@@ -2410,8 +2425,7 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
cmd.opcode = CMDQ_OP_TLBI_NSNH_ALL;
arm_smmu_cmdq_issue_cmd(smmu, &cmd);
- cmd.opcode = CMDQ_OP_CMD_SYNC;
- arm_smmu_cmdq_issue_cmd(smmu, &cmd);
+ arm_smmu_cmdq_issue_sync(smmu);
/* Event queue */
writeq_relaxed(smmu->evtq.q.q_base, smmu->base + ARM_SMMU_EVTQ_BASE);
--
2.13.4.dirty
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH 2/3] iommu/arm-smmu-v3: Forget about cmdq-sync interrupt
2017-08-18 17:33 [PATCH 0/3] SMMUv3 CMD_SYNC optimisation Robin Murphy
2017-08-18 17:33 ` [PATCH 1/3] iommu/arm-smmu-v3: Specialise CMD_SYNC handling Robin Murphy
@ 2017-08-18 17:33 ` Robin Murphy
2017-08-18 17:33 ` [PATCH 3/3] iommu/arm-smmu-v3: Utilise CMD_SYNC MSI feature Robin Murphy
2017-08-25 4:44 ` [PATCH 0/3] SMMUv3 CMD_SYNC optimisation Nate Watterson
3 siblings, 0 replies; 5+ messages in thread
From: Robin Murphy @ 2017-08-18 17:33 UTC (permalink / raw)
To: linux-arm-kernel
The cmdq-sync interrupt is never going to be particularly useful, since
for stage 1 DMA at least we'll often need to wait for sync completion
within someone else's IRQ handler, thus have to implement polling
anyway. Beyond that, the overhead of taking an interrupt, then still
having to grovel around in the queue to figure out *which* sync command
completed, doesn't seem much more attractive than simple polling either.
Furthermore, if an implementation both has wired interrupts and supports
MSIs, then we don't want to be taking the IRQ unnecessarily if we're
using the MSI write to update memory. Let's just make life simpler by
not even bothering to claim it in the first place.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
drivers/iommu/arm-smmu-v3.c | 24 +-----------------------
1 file changed, 1 insertion(+), 23 deletions(-)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index bbcb33e8f55c..6fbc2e59f7c1 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1294,12 +1294,6 @@ static irqreturn_t arm_smmu_priq_thread(int irq, void *dev)
return IRQ_HANDLED;
}
-static irqreturn_t arm_smmu_cmdq_sync_handler(int irq, void *dev)
-{
- /* We don't actually use CMD_SYNC interrupts for anything */
- return IRQ_HANDLED;
-}
-
static int arm_smmu_device_disable(struct arm_smmu_device *smmu);
static irqreturn_t arm_smmu_gerror_handler(int irq, void *dev)
@@ -1332,10 +1326,8 @@ static irqreturn_t arm_smmu_gerror_handler(int irq, void *dev)
if (active & GERROR_MSI_EVTQ_ABT_ERR)
dev_warn(smmu->dev, "EVTQ MSI write aborted\n");
- if (active & GERROR_MSI_CMDQ_ABT_ERR) {
+ if (active & GERROR_MSI_CMDQ_ABT_ERR)
dev_warn(smmu->dev, "CMDQ MSI write aborted\n");
- arm_smmu_cmdq_sync_handler(irq, smmu->dev);
- }
if (active & GERROR_PRIQ_ABT_ERR)
dev_err(smmu->dev, "PRIQ write aborted -- events may have been lost\n");
@@ -1364,7 +1356,6 @@ static irqreturn_t arm_smmu_combined_irq_thread(int irq, void *dev)
static irqreturn_t arm_smmu_combined_irq_handler(int irq, void *dev)
{
arm_smmu_gerror_handler(irq, dev);
- arm_smmu_cmdq_sync_handler(irq, dev);
return IRQ_WAKE_THREAD;
}
@@ -2281,15 +2272,6 @@ static void arm_smmu_setup_unique_irqs(struct arm_smmu_device *smmu)
dev_warn(smmu->dev, "failed to enable evtq irq\n");
}
- irq = smmu->cmdq.q.irq;
- if (irq) {
- ret = devm_request_irq(smmu->dev, irq,
- arm_smmu_cmdq_sync_handler, 0,
- "arm-smmu-v3-cmdq-sync", smmu);
- if (ret < 0)
- dev_warn(smmu->dev, "failed to enable cmdq-sync irq\n");
- }
-
irq = smmu->gerr_irq;
if (irq) {
ret = devm_request_irq(smmu->dev, irq, arm_smmu_gerror_handler,
@@ -2797,10 +2779,6 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
if (irq > 0)
smmu->priq.q.irq = irq;
- irq = platform_get_irq_byname(pdev, "cmdq-sync");
- if (irq > 0)
- smmu->cmdq.q.irq = irq;
-
irq = platform_get_irq_byname(pdev, "gerror");
if (irq > 0)
smmu->gerr_irq = irq;
--
2.13.4.dirty
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH 3/3] iommu/arm-smmu-v3: Utilise CMD_SYNC MSI feature
2017-08-18 17:33 [PATCH 0/3] SMMUv3 CMD_SYNC optimisation Robin Murphy
2017-08-18 17:33 ` [PATCH 1/3] iommu/arm-smmu-v3: Specialise CMD_SYNC handling Robin Murphy
2017-08-18 17:33 ` [PATCH 2/3] iommu/arm-smmu-v3: Forget about cmdq-sync interrupt Robin Murphy
@ 2017-08-18 17:33 ` Robin Murphy
2017-08-25 4:44 ` [PATCH 0/3] SMMUv3 CMD_SYNC optimisation Nate Watterson
3 siblings, 0 replies; 5+ messages in thread
From: Robin Murphy @ 2017-08-18 17:33 UTC (permalink / raw)
To: linux-arm-kernel
As an IRQ, the CMD_SYNC interrupt is not particularly useful, not least
because we often need to wait for sync completion within someone else's
IRQ handler anyway. However, when MSIs and coherent accesses are
supported, we can have a lot more fun by not using it as an interrupt at
all. Following the example suggested in the architecture and using a
coherent write targeting memory, we can let waiters poll a status
variable outside the lock instead of having to stall the entire queue.
Furthermore, we can then take advantage of the exclusive monitor to
optimise the polling too. Since multiple sync commands are guaranteed to
complete in order, a simple incrementing sequence count is all we need
to unambiguously support overlapping waiters.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
drivers/iommu/arm-smmu-v3.c | 51 ++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 48 insertions(+), 3 deletions(-)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 6fbc2e59f7c1..a62ff5290c52 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -377,7 +377,16 @@
#define CMDQ_SYNC_0_CS_SHIFT 12
#define CMDQ_SYNC_0_CS_NONE (0UL << CMDQ_SYNC_0_CS_SHIFT)
+#define CMDQ_SYNC_0_CS_IRQ (1UL << CMDQ_SYNC_0_CS_SHIFT)
#define CMDQ_SYNC_0_CS_SEV (2UL << CMDQ_SYNC_0_CS_SHIFT)
+#define CMDQ_SYNC_0_MSH_SHIFT 22
+#define CMDQ_SYNC_0_MSH_ISH (3UL << CMDQ_SYNC_0_MSH_SHIFT)
+#define CMDQ_SYNC_0_MSIATTR_SHIFT 24
+#define CMDQ_SYNC_0_MSIATTR_OIWB (0xfUL << CMDQ_SYNC_0_MSIATTR_SHIFT)
+#define CMDQ_SYNC_0_MSIDATA_SHIFT 32
+#define CMDQ_SYNC_0_MSIDATA_MASK 0xffffffffUL
+#define CMDQ_SYNC_1_MSIADDR_SHIFT 0
+#define CMDQ_SYNC_1_MSIADDR_MASK 0xffffffffffffcUL
/* Event queue */
#define EVTQ_ENT_DWORDS 4
@@ -504,6 +513,11 @@ struct arm_smmu_cmdq_ent {
} pri;
#define CMDQ_OP_CMD_SYNC 0x46
+ struct {
+ bool msi;
+ u32 msidata;
+ u64 msiaddr;
+ } sync;
};
};
@@ -617,6 +631,9 @@ struct arm_smmu_device {
int gerr_irq;
int combined_irq;
+ atomic_t sync_nr;
+ u32 sync_done;
+
unsigned long ias; /* IPA */
unsigned long oas; /* PA */
unsigned long pgsize_bitmap;
@@ -878,7 +895,13 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
}
break;
case CMDQ_OP_CMD_SYNC:
- cmd[0] |= CMDQ_SYNC_0_CS_SEV;
+ if (ent->sync.msi)
+ cmd[0] |= CMDQ_SYNC_0_CS_IRQ;
+ else
+ cmd[0] |= CMDQ_SYNC_0_CS_SEV;
+ cmd[0] |= CMDQ_SYNC_0_MSH_ISH | CMDQ_SYNC_0_MSIATTR_OIWB;
+ cmd[0] |= (u64)ent->sync.msidata << CMDQ_SYNC_0_MSIDATA_SHIFT;
+ cmd[1] |= ent->sync.msiaddr & CMDQ_SYNC_1_MSIADDR_MASK;
break;
default:
return -ENOENT;
@@ -965,20 +988,41 @@ static void arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
spin_unlock_irqrestore(&smmu->cmdq.lock, flags);
}
+static int arm_smmu_cmdq_poll_sync(struct arm_smmu_device *smmu, u32 nr)
+{
+ ktime_t timeout = ktime_add_us(ktime_get(),
+ ARM_SMMU_CMDQ_DRAIN_TIMEOUT_US);
+ u32 val = smp_cond_load_acquire(&smmu->sync_done,
+ (int)(VAL - nr) >= 0 ||
+ !ktime_before(ktime_get(), timeout));
+
+ return (int)(val - nr) < 0 ? -ETIMEDOUT : 0;
+}
+
static void arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
{
u64 cmd[CMDQ_ENT_DWORDS];
unsigned long flags;
bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
+ bool msi = (smmu->features & ARM_SMMU_FEAT_MSI) &&
+ (smmu->features & ARM_SMMU_FEAT_COHERENCY);
struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC };
- arm_smmu_cmdq_build_cmd(cmd, ent);
+ if (msi) {
+ ent.sync.msidata = atomic_inc_return(&smmu->sync_nr);
+ ent.sync.msiaddr = virt_to_phys(&smmu->sync_done);
+ ent.sync.msi = true;
+ }
+ arm_smmu_cmdq_build_cmd(cmd, &ent);
spin_lock_irqsave(&smmu->cmdq.lock, flags);
arm_smmu_cmdq_insert_cmd(smmu, cmd, wfe);
- if (queue_poll_cons(&smmu->cmdq.q, true, wfe))
+ if (!msi && queue_poll_cons(&smmu->cmdq.q, true, wfe))
dev_err_ratelimited(smmu->dev, "CMD_SYNC timeout\n");
spin_unlock_irqrestore(&smmu->cmdq.lock, flags);
+
+ if (msi && arm_smmu_cmdq_poll_sync(smmu, ent.sync.msidata))
+ dev_err_ratelimited(smmu->dev, "CMD_SYNC timeout\n");
}
/* Context descriptor manipulation functions */
@@ -2154,6 +2198,7 @@ static int arm_smmu_init_structures(struct arm_smmu_device *smmu)
{
int ret;
+ atomic_set(&smmu->sync_nr, 0);
ret = arm_smmu_init_queues(smmu);
if (ret)
return ret;
--
2.13.4.dirty
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH 0/3] SMMUv3 CMD_SYNC optimisation
2017-08-18 17:33 [PATCH 0/3] SMMUv3 CMD_SYNC optimisation Robin Murphy
` (2 preceding siblings ...)
2017-08-18 17:33 ` [PATCH 3/3] iommu/arm-smmu-v3: Utilise CMD_SYNC MSI feature Robin Murphy
@ 2017-08-25 4:44 ` Nate Watterson
3 siblings, 0 replies; 5+ messages in thread
From: Nate Watterson @ 2017-08-25 4:44 UTC (permalink / raw)
To: linux-arm-kernel
Hi Robin,
On 8/18/2017 1:33 PM, Robin Murphy wrote:
> Hi all,
>
> Waiting for the command queue to drain for CMD_SYNC completion is likely
> a contention hotspot on high-core-count systems. If the SMMU is coherent
> and supports MSIs, though, we can use this cool feature (as suggested by
> the architecture, no less) to make syncs effectively non-blocking for
> anyone other than the caller.
>
> I don't have any hardware that supports MSIs, but this has at least
> passed muster on the Fast Model with cache modelling enabled - I'm hoping
> the Qualcomm machines have the appropriate configuration to actually test
> how well it works in reality. If it is worthwhile, I do have most of a
> plan for how we can do something similar in the non-MSI polling case (it's
> mostly a problem of handling the queue-wrapping edge cases correctly).
I tested this on QDF2400 hardware which supports MSI as a CMD_SYNC
completion signal. As with Thunder's "performance optimization" series,
I evaluated the patches using FIO with 4 NVME drives connected to a
single SMMU. Here's how they compared:
FIO - 512k blocksize / io-depth 32 / 1 thread per drive
Baseline 4.13-rc1 w/SMMU enabled: 25% of SMMU bypass performance
Baseline + Thunder Patch 1 : 28%
Baseline + CMD_SYNC Optimization: 36%
Baseline + Thunder Patches 2-5 : 86%
Baseline + Thunder Patches 1-5 : 100% [!!]
Seems like it would probably be worthwhile to implement this for the
non-MSI case also. Let me know if there are other workloads you're
particularly interested in, and I'll try to get those tested too.
-Nate
>
> Robin.
>
>
> Robin Murphy (3):
> iommu/arm-smmu-v3: Specialise CMD_SYNC handling
> iommu/arm-smmu-v3: Forget about cmdq-sync interrupt
> iommu/arm-smmu-v3: Utilise CMD_SYNC MSI feature
>
> drivers/iommu/arm-smmu-v3.c | 117 +++++++++++++++++++++++++++++---------------
> 1 file changed, 77 insertions(+), 40 deletions(-)
>
--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
^ permalink raw reply [flat|nested] 5+ messages in thread