* [PATCH v2 1/4] iommu/arm-smmu-v3: Specialise CMD_SYNC handling
2017-08-31 13:44 [PATCH v2 0/4] SMMUv3 CMD_SYNC optimisation Robin Murphy
@ 2017-08-31 13:44 ` Robin Murphy
2017-08-31 13:44 ` [PATCH v2 2/4] iommu/arm-smmu-v3: Forget about cmdq-sync interrupt Robin Murphy
` (4 subsequent siblings)
5 siblings, 0 replies; 13+ messages in thread
From: Robin Murphy @ 2017-08-31 13:44 UTC (permalink / raw)
To: linux-arm-kernel
CMD_SYNC already has a bit of special treatment here and there, but as
we're about to extend it with more functionality for completing outside
the CMDQ lock, things are going to get rather messy if we keep trying to
cram everything into a single generic command interface. Instead, let's
break out the issuing of CMD_SYNC into its own specific helper where
upcoming changes will have room to breathe.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
v2: Cosmetic changes to reduce churn in later patches
drivers/iommu/arm-smmu-v3.c | 54 +++++++++++++++++++++++++++++----------------
1 file changed, 35 insertions(+), 19 deletions(-)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 568c400eeaed..91f28aca72c0 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -936,13 +936,22 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu)
queue_write(Q_ENT(q, cons), cmd, q->ent_dwords);
}
+static void arm_smmu_cmdq_insert_cmd(struct arm_smmu_device *smmu, u64 *cmd)
+{
+ struct arm_smmu_queue *q = &smmu->cmdq.q;
+ bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
+
+ while (queue_insert_raw(q, cmd) == -ENOSPC) {
+ if (queue_poll_cons(q, false, wfe))
+ dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
+ }
+}
+
static void arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
struct arm_smmu_cmdq_ent *ent)
{
u64 cmd[CMDQ_ENT_DWORDS];
unsigned long flags;
- bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
- struct arm_smmu_queue *q = &smmu->cmdq.q;
if (arm_smmu_cmdq_build_cmd(cmd, ent)) {
dev_warn(smmu->dev, "ignoring unknown CMDQ opcode 0x%x\n",
@@ -951,16 +960,29 @@ static void arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
}
spin_lock_irqsave(&smmu->cmdq.lock, flags);
- while (queue_insert_raw(q, cmd) == -ENOSPC) {
- if (queue_poll_cons(q, false, wfe))
- dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
- }
-
- if (ent->opcode == CMDQ_OP_CMD_SYNC && queue_poll_cons(q, true, wfe))
- dev_err_ratelimited(smmu->dev, "CMD_SYNC timeout\n");
+ arm_smmu_cmdq_insert_cmd(smmu, cmd);
spin_unlock_irqrestore(&smmu->cmdq.lock, flags);
}
+static void arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
+{
+ u64 cmd[CMDQ_ENT_DWORDS];
+ unsigned long flags;
+ bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
+ struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC };
+ int ret;
+
+ arm_smmu_cmdq_build_cmd(cmd, &ent);
+
+ spin_lock_irqsave(&smmu->cmdq.lock, flags);
+ arm_smmu_cmdq_insert_cmd(smmu, cmd);
+ ret = queue_poll_cons(&smmu->cmdq.q, true, wfe);
+ spin_unlock_irqrestore(&smmu->cmdq.lock, flags);
+
+ if (ret)
+ dev_err_ratelimited(smmu->dev, "CMD_SYNC timeout\n");
+}
+
/* Context descriptor manipulation functions */
static u64 arm_smmu_cpu_tcr_to_cd(u64 tcr)
{
@@ -1029,8 +1051,7 @@ static void arm_smmu_sync_ste_for_sid(struct arm_smmu_device *smmu, u32 sid)
};
arm_smmu_cmdq_issue_cmd(smmu, &cmd);
- cmd.opcode = CMDQ_OP_CMD_SYNC;
- arm_smmu_cmdq_issue_cmd(smmu, &cmd);
+ arm_smmu_cmdq_issue_sync(smmu);
}
static void arm_smmu_write_strtab_ent(struct arm_smmu_device *smmu, u32 sid,
@@ -1352,10 +1373,7 @@ static irqreturn_t arm_smmu_combined_irq_handler(int irq, void *dev)
/* IO_PGTABLE API */
static void __arm_smmu_tlb_sync(struct arm_smmu_device *smmu)
{
- struct arm_smmu_cmdq_ent cmd;
-
- cmd.opcode = CMDQ_OP_CMD_SYNC;
- arm_smmu_cmdq_issue_cmd(smmu, &cmd);
+ arm_smmu_cmdq_issue_sync(smmu);
}
static void arm_smmu_tlb_sync(void *cookie)
@@ -2399,8 +2417,7 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
/* Invalidate any cached configuration */
cmd.opcode = CMDQ_OP_CFGI_ALL;
arm_smmu_cmdq_issue_cmd(smmu, &cmd);
- cmd.opcode = CMDQ_OP_CMD_SYNC;
- arm_smmu_cmdq_issue_cmd(smmu, &cmd);
+ arm_smmu_cmdq_issue_sync(smmu);
/* Invalidate any stale TLB entries */
if (smmu->features & ARM_SMMU_FEAT_HYP) {
@@ -2410,8 +2427,7 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
cmd.opcode = CMDQ_OP_TLBI_NSNH_ALL;
arm_smmu_cmdq_issue_cmd(smmu, &cmd);
- cmd.opcode = CMDQ_OP_CMD_SYNC;
- arm_smmu_cmdq_issue_cmd(smmu, &cmd);
+ arm_smmu_cmdq_issue_sync(smmu);
/* Event queue */
writeq_relaxed(smmu->evtq.q.q_base, smmu->base + ARM_SMMU_EVTQ_BASE);
--
2.13.4.dirty
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v2 2/4] iommu/arm-smmu-v3: Forget about cmdq-sync interrupt
2017-08-31 13:44 [PATCH v2 0/4] SMMUv3 CMD_SYNC optimisation Robin Murphy
2017-08-31 13:44 ` [PATCH v2 1/4] iommu/arm-smmu-v3: Specialise CMD_SYNC handling Robin Murphy
@ 2017-08-31 13:44 ` Robin Murphy
2017-08-31 13:44 ` [PATCH v2 3/4] iommu/arm-smmu-v3: Use CMD_SYNC completion MSI Robin Murphy
` (3 subsequent siblings)
5 siblings, 0 replies; 13+ messages in thread
From: Robin Murphy @ 2017-08-31 13:44 UTC (permalink / raw)
To: linux-arm-kernel
The cmdq-sync interrupt is never going to be particularly useful, since
for stage 1 DMA at least we'll often need to wait for sync completion
within someone else's IRQ handler, thus have to implement polling
anyway. Beyond that, the overhead of taking an interrupt, then still
having to grovel around in the queue to figure out *which* sync command
completed, doesn't seem much more attractive than simple polling either.
Furthermore, if an implementation both has wired interrupts and supports
MSIs, then we don't want to be taking the IRQ unnecessarily if we're
using the MSI write to update memory. Let's just make life simpler by
not even bothering to claim it in the first place.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
v2: No change
drivers/iommu/arm-smmu-v3.c | 24 +-----------------------
1 file changed, 1 insertion(+), 23 deletions(-)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 91f28aca72c0..f066725298cd 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1296,12 +1296,6 @@ static irqreturn_t arm_smmu_priq_thread(int irq, void *dev)
return IRQ_HANDLED;
}
-static irqreturn_t arm_smmu_cmdq_sync_handler(int irq, void *dev)
-{
- /* We don't actually use CMD_SYNC interrupts for anything */
- return IRQ_HANDLED;
-}
-
static int arm_smmu_device_disable(struct arm_smmu_device *smmu);
static irqreturn_t arm_smmu_gerror_handler(int irq, void *dev)
@@ -1334,10 +1328,8 @@ static irqreturn_t arm_smmu_gerror_handler(int irq, void *dev)
if (active & GERROR_MSI_EVTQ_ABT_ERR)
dev_warn(smmu->dev, "EVTQ MSI write aborted\n");
- if (active & GERROR_MSI_CMDQ_ABT_ERR) {
+ if (active & GERROR_MSI_CMDQ_ABT_ERR)
dev_warn(smmu->dev, "CMDQ MSI write aborted\n");
- arm_smmu_cmdq_sync_handler(irq, smmu->dev);
- }
if (active & GERROR_PRIQ_ABT_ERR)
dev_err(smmu->dev, "PRIQ write aborted -- events may have been lost\n");
@@ -1366,7 +1358,6 @@ static irqreturn_t arm_smmu_combined_irq_thread(int irq, void *dev)
static irqreturn_t arm_smmu_combined_irq_handler(int irq, void *dev)
{
arm_smmu_gerror_handler(irq, dev);
- arm_smmu_cmdq_sync_handler(irq, dev);
return IRQ_WAKE_THREAD;
}
@@ -2283,15 +2274,6 @@ static void arm_smmu_setup_unique_irqs(struct arm_smmu_device *smmu)
dev_warn(smmu->dev, "failed to enable evtq irq\n");
}
- irq = smmu->cmdq.q.irq;
- if (irq) {
- ret = devm_request_irq(smmu->dev, irq,
- arm_smmu_cmdq_sync_handler, 0,
- "arm-smmu-v3-cmdq-sync", smmu);
- if (ret < 0)
- dev_warn(smmu->dev, "failed to enable cmdq-sync irq\n");
- }
-
irq = smmu->gerr_irq;
if (irq) {
ret = devm_request_irq(smmu->dev, irq, arm_smmu_gerror_handler,
@@ -2799,10 +2781,6 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
if (irq > 0)
smmu->priq.q.irq = irq;
- irq = platform_get_irq_byname(pdev, "cmdq-sync");
- if (irq > 0)
- smmu->cmdq.q.irq = irq;
-
irq = platform_get_irq_byname(pdev, "gerror");
if (irq > 0)
smmu->gerr_irq = irq;
--
2.13.4.dirty
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v2 3/4] iommu/arm-smmu-v3: Use CMD_SYNC completion MSI
2017-08-31 13:44 [PATCH v2 0/4] SMMUv3 CMD_SYNC optimisation Robin Murphy
2017-08-31 13:44 ` [PATCH v2 1/4] iommu/arm-smmu-v3: Specialise CMD_SYNC handling Robin Murphy
2017-08-31 13:44 ` [PATCH v2 2/4] iommu/arm-smmu-v3: Forget about cmdq-sync interrupt Robin Murphy
@ 2017-08-31 13:44 ` Robin Murphy
2017-10-13 18:32 ` Will Deacon
2017-08-31 13:44 ` [PATCH v2 4/4] iommu/arm-smmu-v3: Poll for CMD_SYNC outside cmdq lock Robin Murphy
` (2 subsequent siblings)
5 siblings, 1 reply; 13+ messages in thread
From: Robin Murphy @ 2017-08-31 13:44 UTC (permalink / raw)
To: linux-arm-kernel
As an IRQ, the CMD_SYNC interrupt is not particularly useful, not least
because we often need to wait for sync completion within someone else's
IRQ handler anyway. However, when the SMMU is both coherent and supports
MSIs, we can have a lot more fun by not using it as an interrupt at all.
Following the example suggested in the architecture and using a write
targeting normal memory, we can let callers wait on a status variable
outside the lock instead of having to stall the entire queue or even
touch MMIO registers. Since multiple sync commands are guaranteed to
complete in order, a simple incrementing sequence count is all we need
to unambiguously support any realistic number of overlapping waiters.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
v2: Remove redundant 'bool msi' command member, other cosmetic tweaks
drivers/iommu/arm-smmu-v3.c | 47 +++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 45 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f066725298cd..311f482b93d5 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -377,7 +377,16 @@
#define CMDQ_SYNC_0_CS_SHIFT 12
#define CMDQ_SYNC_0_CS_NONE (0UL << CMDQ_SYNC_0_CS_SHIFT)
+#define CMDQ_SYNC_0_CS_IRQ (1UL << CMDQ_SYNC_0_CS_SHIFT)
#define CMDQ_SYNC_0_CS_SEV (2UL << CMDQ_SYNC_0_CS_SHIFT)
+#define CMDQ_SYNC_0_MSH_SHIFT 22
+#define CMDQ_SYNC_0_MSH_ISH (3UL << CMDQ_SYNC_0_MSH_SHIFT)
+#define CMDQ_SYNC_0_MSIATTR_SHIFT 24
+#define CMDQ_SYNC_0_MSIATTR_OIWB (0xfUL << CMDQ_SYNC_0_MSIATTR_SHIFT)
+#define CMDQ_SYNC_0_MSIDATA_SHIFT 32
+#define CMDQ_SYNC_0_MSIDATA_MASK 0xffffffffUL
+#define CMDQ_SYNC_1_MSIADDR_SHIFT 0
+#define CMDQ_SYNC_1_MSIADDR_MASK 0xffffffffffffcUL
/* Event queue */
#define EVTQ_ENT_DWORDS 4
@@ -409,6 +418,7 @@
/* High-level queue structures */
#define ARM_SMMU_POLL_TIMEOUT_US 100
#define ARM_SMMU_CMDQ_DRAIN_TIMEOUT_US 1000000 /* 1s! */
+#define ARM_SMMU_SYNC_TIMEOUT_US 1000000 /* 1s! */
#define MSI_IOVA_BASE 0x8000000
#define MSI_IOVA_LENGTH 0x100000
@@ -504,6 +514,10 @@ struct arm_smmu_cmdq_ent {
} pri;
#define CMDQ_OP_CMD_SYNC 0x46
+ struct {
+ u32 msidata;
+ u64 msiaddr;
+ } sync;
};
};
@@ -617,6 +631,9 @@ struct arm_smmu_device {
int gerr_irq;
int combined_irq;
+ atomic_t sync_nr;
+ u32 sync_count;
+
unsigned long ias; /* IPA */
unsigned long oas; /* PA */
unsigned long pgsize_bitmap;
@@ -878,7 +895,13 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
}
break;
case CMDQ_OP_CMD_SYNC:
- cmd[0] |= CMDQ_SYNC_0_CS_SEV;
+ if (ent->sync.msiaddr)
+ cmd[0] |= CMDQ_SYNC_0_CS_IRQ;
+ else
+ cmd[0] |= CMDQ_SYNC_0_CS_SEV;
+ cmd[0] |= CMDQ_SYNC_0_MSH_ISH | CMDQ_SYNC_0_MSIATTR_OIWB;
+ cmd[0] |= (u64)ent->sync.msidata << CMDQ_SYNC_0_MSIDATA_SHIFT;
+ cmd[1] |= ent->sync.msiaddr & CMDQ_SYNC_1_MSIADDR_MASK;
break;
default:
return -ENOENT;
@@ -964,21 +987,40 @@ static void arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
spin_unlock_irqrestore(&smmu->cmdq.lock, flags);
}
+static int arm_smmu_sync_poll_msi(struct arm_smmu_device *smmu, u32 sync_idx)
+{
+ ktime_t timeout = ktime_add_us(ktime_get(), ARM_SMMU_SYNC_TIMEOUT_US);
+ u32 val = smp_cond_load_acquire(&smmu->sync_count,
+ (int)(VAL - sync_idx) >= 0 ||
+ !ktime_before(ktime_get(), timeout));
+
+ return (int)(val - sync_idx) < 0 ? -ETIMEDOUT : 0;
+}
+
static void arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
{
u64 cmd[CMDQ_ENT_DWORDS];
unsigned long flags;
bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
+ bool msi = (smmu->features & ARM_SMMU_FEAT_MSI) &&
+ (smmu->features & ARM_SMMU_FEAT_COHERENCY);
struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC };
int ret;
+ if (msi) {
+ ent.sync.msidata = atomic_inc_return(&smmu->sync_nr);
+ ent.sync.msiaddr = virt_to_phys(&smmu->sync_count);
+ }
arm_smmu_cmdq_build_cmd(cmd, &ent);
spin_lock_irqsave(&smmu->cmdq.lock, flags);
arm_smmu_cmdq_insert_cmd(smmu, cmd);
- ret = queue_poll_cons(&smmu->cmdq.q, true, wfe);
+ if (!msi)
+ ret = queue_poll_cons(&smmu->cmdq.q, true, wfe);
spin_unlock_irqrestore(&smmu->cmdq.lock, flags);
+ if (msi)
+ ret = arm_smmu_sync_poll_msi(smmu, ent.sync.msidata);
if (ret)
dev_err_ratelimited(smmu->dev, "CMD_SYNC timeout\n");
}
@@ -2156,6 +2198,7 @@ static int arm_smmu_init_structures(struct arm_smmu_device *smmu)
{
int ret;
+ atomic_set(&smmu->sync_nr, 0);
ret = arm_smmu_init_queues(smmu);
if (ret)
return ret;
--
2.13.4.dirty
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v2 3/4] iommu/arm-smmu-v3: Use CMD_SYNC completion MSI
2017-08-31 13:44 ` [PATCH v2 3/4] iommu/arm-smmu-v3: Use CMD_SYNC completion MSI Robin Murphy
@ 2017-10-13 18:32 ` Will Deacon
2017-10-16 12:25 ` Robin Murphy
0 siblings, 1 reply; 13+ messages in thread
From: Will Deacon @ 2017-10-13 18:32 UTC (permalink / raw)
To: linux-arm-kernel
Hi Robin,
This mostly looks good. Just a few comments below.
On Thu, Aug 31, 2017 at 02:44:27PM +0100, Robin Murphy wrote:
> As an IRQ, the CMD_SYNC interrupt is not particularly useful, not least
> because we often need to wait for sync completion within someone else's
> IRQ handler anyway. However, when the SMMU is both coherent and supports
> MSIs, we can have a lot more fun by not using it as an interrupt at all.
> Following the example suggested in the architecture and using a write
> targeting normal memory, we can let callers wait on a status variable
> outside the lock instead of having to stall the entire queue or even
> touch MMIO registers. Since multiple sync commands are guaranteed to
> complete in order, a simple incrementing sequence count is all we need
> to unambiguously support any realistic number of overlapping waiters.
>
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> ---
>
> v2: Remove redundant 'bool msi' command member, other cosmetic tweaks
>
> drivers/iommu/arm-smmu-v3.c | 47 +++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 45 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index f066725298cd..311f482b93d5 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -377,7 +377,16 @@
>
> #define CMDQ_SYNC_0_CS_SHIFT 12
> #define CMDQ_SYNC_0_CS_NONE (0UL << CMDQ_SYNC_0_CS_SHIFT)
> +#define CMDQ_SYNC_0_CS_IRQ (1UL << CMDQ_SYNC_0_CS_SHIFT)
> #define CMDQ_SYNC_0_CS_SEV (2UL << CMDQ_SYNC_0_CS_SHIFT)
> +#define CMDQ_SYNC_0_MSH_SHIFT 22
> +#define CMDQ_SYNC_0_MSH_ISH (3UL << CMDQ_SYNC_0_MSH_SHIFT)
> +#define CMDQ_SYNC_0_MSIATTR_SHIFT 24
> +#define CMDQ_SYNC_0_MSIATTR_OIWB (0xfUL << CMDQ_SYNC_0_MSIATTR_SHIFT)
> +#define CMDQ_SYNC_0_MSIDATA_SHIFT 32
> +#define CMDQ_SYNC_0_MSIDATA_MASK 0xffffffffUL
> +#define CMDQ_SYNC_1_MSIADDR_SHIFT 0
> +#define CMDQ_SYNC_1_MSIADDR_MASK 0xffffffffffffcUL
>
> /* Event queue */
> #define EVTQ_ENT_DWORDS 4
> @@ -409,6 +418,7 @@
> /* High-level queue structures */
> #define ARM_SMMU_POLL_TIMEOUT_US 100
> #define ARM_SMMU_CMDQ_DRAIN_TIMEOUT_US 1000000 /* 1s! */
> +#define ARM_SMMU_SYNC_TIMEOUT_US 1000000 /* 1s! */
We only ever do this when waiting for the queue to drain, so may as well
just reuse the drain timeout.
> #define MSI_IOVA_BASE 0x8000000
> #define MSI_IOVA_LENGTH 0x100000
> @@ -504,6 +514,10 @@ struct arm_smmu_cmdq_ent {
> } pri;
>
> #define CMDQ_OP_CMD_SYNC 0x46
> + struct {
> + u32 msidata;
> + u64 msiaddr;
> + } sync;
> };
> };
>
> @@ -617,6 +631,9 @@ struct arm_smmu_device {
> int gerr_irq;
> int combined_irq;
>
> + atomic_t sync_nr;
> + u32 sync_count;
It's probably worth sticking these in separate cachelines so we don't
get spurious wakeups when sync_nr is incremented. (yes, I know it should
be the ERG, but that can be unreasonably huge!).
> +
> unsigned long ias; /* IPA */
> unsigned long oas; /* PA */
> unsigned long pgsize_bitmap;
> @@ -878,7 +895,13 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
> }
> break;
> case CMDQ_OP_CMD_SYNC:
> - cmd[0] |= CMDQ_SYNC_0_CS_SEV;
> + if (ent->sync.msiaddr)
> + cmd[0] |= CMDQ_SYNC_0_CS_IRQ;
> + else
> + cmd[0] |= CMDQ_SYNC_0_CS_SEV;
> + cmd[0] |= CMDQ_SYNC_0_MSH_ISH | CMDQ_SYNC_0_MSIATTR_OIWB;
> + cmd[0] |= (u64)ent->sync.msidata << CMDQ_SYNC_0_MSIDATA_SHIFT;
> + cmd[1] |= ent->sync.msiaddr & CMDQ_SYNC_1_MSIADDR_MASK;
> break;
> default:
> return -ENOENT;
> @@ -964,21 +987,40 @@ static void arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
> spin_unlock_irqrestore(&smmu->cmdq.lock, flags);
> }
>
> +static int arm_smmu_sync_poll_msi(struct arm_smmu_device *smmu, u32 sync_idx)
> +{
> + ktime_t timeout = ktime_add_us(ktime_get(), ARM_SMMU_SYNC_TIMEOUT_US);
> + u32 val = smp_cond_load_acquire(&smmu->sync_count,
> + (int)(VAL - sync_idx) >= 0 ||
> + !ktime_before(ktime_get(), timeout));
> +
> + return (int)(val - sync_idx) < 0 ? -ETIMEDOUT : 0;
There are some theoretical overflow issues here which I don't think will
ever occur in practice, but deserve@least a comment to explain why.
> +}
> +
> static void arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
> {
> u64 cmd[CMDQ_ENT_DWORDS];
> unsigned long flags;
> bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
> + bool msi = (smmu->features & ARM_SMMU_FEAT_MSI) &&
> + (smmu->features & ARM_SMMU_FEAT_COHERENCY);
I don't think this is sufficient for the case where we fail to setup MSIs
and fall back on legacy IRQs.
> struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC };
> int ret;
>
> + if (msi) {
> + ent.sync.msidata = atomic_inc_return(&smmu->sync_nr);
I don't think you need barrier semantics here.
> + ent.sync.msiaddr = virt_to_phys(&smmu->sync_count);
> + }
> arm_smmu_cmdq_build_cmd(cmd, &ent);
>
> spin_lock_irqsave(&smmu->cmdq.lock, flags);
> arm_smmu_cmdq_insert_cmd(smmu, cmd);
> - ret = queue_poll_cons(&smmu->cmdq.q, true, wfe);
> + if (!msi)
> + ret = queue_poll_cons(&smmu->cmdq.q, true, wfe);
> spin_unlock_irqrestore(&smmu->cmdq.lock, flags);
>
> + if (msi)
> + ret = arm_smmu_sync_poll_msi(smmu, ent.sync.msidata);
This looks like the queue polling should be wrapped up in a function that
returns with the spinlock released.
Will
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2 3/4] iommu/arm-smmu-v3: Use CMD_SYNC completion MSI
2017-10-13 18:32 ` Will Deacon
@ 2017-10-16 12:25 ` Robin Murphy
0 siblings, 0 replies; 13+ messages in thread
From: Robin Murphy @ 2017-10-16 12:25 UTC (permalink / raw)
To: linux-arm-kernel
On 13/10/17 19:32, Will Deacon wrote:
> Hi Robin,
>
> This mostly looks good. Just a few comments below.
>
> On Thu, Aug 31, 2017 at 02:44:27PM +0100, Robin Murphy wrote:
>> As an IRQ, the CMD_SYNC interrupt is not particularly useful, not least
>> because we often need to wait for sync completion within someone else's
>> IRQ handler anyway. However, when the SMMU is both coherent and supports
>> MSIs, we can have a lot more fun by not using it as an interrupt at all.
>> Following the example suggested in the architecture and using a write
>> targeting normal memory, we can let callers wait on a status variable
>> outside the lock instead of having to stall the entire queue or even
>> touch MMIO registers. Since multiple sync commands are guaranteed to
>> complete in order, a simple incrementing sequence count is all we need
>> to unambiguously support any realistic number of overlapping waiters.
>>
>> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
>> ---
>>
>> v2: Remove redundant 'bool msi' command member, other cosmetic tweaks
>>
>> drivers/iommu/arm-smmu-v3.c | 47 +++++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 45 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>> index f066725298cd..311f482b93d5 100644
>> --- a/drivers/iommu/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm-smmu-v3.c
>> @@ -377,7 +377,16 @@
>>
>> #define CMDQ_SYNC_0_CS_SHIFT 12
>> #define CMDQ_SYNC_0_CS_NONE (0UL << CMDQ_SYNC_0_CS_SHIFT)
>> +#define CMDQ_SYNC_0_CS_IRQ (1UL << CMDQ_SYNC_0_CS_SHIFT)
>> #define CMDQ_SYNC_0_CS_SEV (2UL << CMDQ_SYNC_0_CS_SHIFT)
>> +#define CMDQ_SYNC_0_MSH_SHIFT 22
>> +#define CMDQ_SYNC_0_MSH_ISH (3UL << CMDQ_SYNC_0_MSH_SHIFT)
>> +#define CMDQ_SYNC_0_MSIATTR_SHIFT 24
>> +#define CMDQ_SYNC_0_MSIATTR_OIWB (0xfUL << CMDQ_SYNC_0_MSIATTR_SHIFT)
>> +#define CMDQ_SYNC_0_MSIDATA_SHIFT 32
>> +#define CMDQ_SYNC_0_MSIDATA_MASK 0xffffffffUL
>> +#define CMDQ_SYNC_1_MSIADDR_SHIFT 0
>> +#define CMDQ_SYNC_1_MSIADDR_MASK 0xffffffffffffcUL
>>
>> /* Event queue */
>> #define EVTQ_ENT_DWORDS 4
>> @@ -409,6 +418,7 @@
>> /* High-level queue structures */
>> #define ARM_SMMU_POLL_TIMEOUT_US 100
>> #define ARM_SMMU_CMDQ_DRAIN_TIMEOUT_US 1000000 /* 1s! */
>> +#define ARM_SMMU_SYNC_TIMEOUT_US 1000000 /* 1s! */
>
> We only ever do this when waiting for the queue to drain, so may as well
> just reuse the drain timeout.
As you've discovered, we remove the "drain" case entirely in the end.
>> #define MSI_IOVA_BASE 0x8000000
>> #define MSI_IOVA_LENGTH 0x100000
>> @@ -504,6 +514,10 @@ struct arm_smmu_cmdq_ent {
>> } pri;
>>
>> #define CMDQ_OP_CMD_SYNC 0x46
>> + struct {
>> + u32 msidata;
>> + u64 msiaddr;
>> + } sync;
>> };
>> };
>>
>> @@ -617,6 +631,9 @@ struct arm_smmu_device {
>> int gerr_irq;
>> int combined_irq;
>>
>> + atomic_t sync_nr;
>> + u32 sync_count;
>
> It's probably worth sticking these in separate cachelines so we don't
> get spurious wakeups when sync_nr is incremented. (yes, I know it should
> be the ERG, but that can be unreasonably huge!).
Good point - we've got 8K of bitmaps embedded in the structure anyway,
so even maximum ERG separation is easily possible.
>> +
>> unsigned long ias; /* IPA */
>> unsigned long oas; /* PA */
>> unsigned long pgsize_bitmap;
>> @@ -878,7 +895,13 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
>> }
>> break;
>> case CMDQ_OP_CMD_SYNC:
>> - cmd[0] |= CMDQ_SYNC_0_CS_SEV;
>> + if (ent->sync.msiaddr)
>> + cmd[0] |= CMDQ_SYNC_0_CS_IRQ;
>> + else
>> + cmd[0] |= CMDQ_SYNC_0_CS_SEV;
>> + cmd[0] |= CMDQ_SYNC_0_MSH_ISH | CMDQ_SYNC_0_MSIATTR_OIWB;
>> + cmd[0] |= (u64)ent->sync.msidata << CMDQ_SYNC_0_MSIDATA_SHIFT;
>> + cmd[1] |= ent->sync.msiaddr & CMDQ_SYNC_1_MSIADDR_MASK;
>> break;
>> default:
>> return -ENOENT;
>> @@ -964,21 +987,40 @@ static void arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
>> spin_unlock_irqrestore(&smmu->cmdq.lock, flags);
>> }
>>
>> +static int arm_smmu_sync_poll_msi(struct arm_smmu_device *smmu, u32 sync_idx)
>> +{
>> + ktime_t timeout = ktime_add_us(ktime_get(), ARM_SMMU_SYNC_TIMEOUT_US);
>> + u32 val = smp_cond_load_acquire(&smmu->sync_count,
>> + (int)(VAL - sync_idx) >= 0 ||
>> + !ktime_before(ktime_get(), timeout));
>> +
>> + return (int)(val - sync_idx) < 0 ? -ETIMEDOUT : 0;
>
> There are some theoretical overflow issues here which I don't think will
> ever occur in practice, but deserve at least a comment to explain why.
Even if we did have 2^31 or more CPUs, the size of a queue is bounded at
2^20, so we can never have enough in-flight syncs to get near to an
overflow problem. I can certainly document that if you like.
>> +}
>> +
>> static void arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
>> {
>> u64 cmd[CMDQ_ENT_DWORDS];
>> unsigned long flags;
>> bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
>> + bool msi = (smmu->features & ARM_SMMU_FEAT_MSI) &&
>> + (smmu->features & ARM_SMMU_FEAT_COHERENCY);
>
> I don't think this is sufficient for the case where we fail to setup MSIs
> and fall back on legacy IRQs.
Remember this 'MSI' is going nowhere near the GIC, so the IRQ
configuration is irrelevant (especially after patch #2) - the feature
bits tell us "is the SMMU capable of generating sync-completion writes?"
and "are those writes coherent?", which is precisely what matters here.
>> struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC };
>> int ret;
>>
>> + if (msi) {
>> + ent.sync.msidata = atomic_inc_return(&smmu->sync_nr);
>
> I don't think you need barrier semantics here.
You mean atomic_inc_return_relaxed() would be sufficient? TBH I don't
think I'd given this any thought - I guess the coherency protocols make
it impossible to do an atomic op on stale data, so that seems reasonable.
>> + ent.sync.msiaddr = virt_to_phys(&smmu->sync_count);
>> + }
>> arm_smmu_cmdq_build_cmd(cmd, &ent);
>>
>> spin_lock_irqsave(&smmu->cmdq.lock, flags);
>> arm_smmu_cmdq_insert_cmd(smmu, cmd);
>> - ret = queue_poll_cons(&smmu->cmdq.q, true, wfe);
>> + if (!msi)
>> + ret = queue_poll_cons(&smmu->cmdq.q, true, wfe);
>> spin_unlock_irqrestore(&smmu->cmdq.lock, flags);
>>
>> + if (msi)
>> + ret = arm_smmu_sync_poll_msi(smmu, ent.sync.msidata);
>
> This looks like the queue polling should be wrapped up in a function that
> returns with the spinlock released.
I did ponder that, but I can't help finding such asymmetric interfaces
pretty grim, and things do get better again once both cases can wait
outside the lock.
Robin.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2 4/4] iommu/arm-smmu-v3: Poll for CMD_SYNC outside cmdq lock
2017-08-31 13:44 [PATCH v2 0/4] SMMUv3 CMD_SYNC optimisation Robin Murphy
` (2 preceding siblings ...)
2017-08-31 13:44 ` [PATCH v2 3/4] iommu/arm-smmu-v3: Use CMD_SYNC completion MSI Robin Murphy
@ 2017-08-31 13:44 ` Robin Murphy
2017-10-13 18:59 ` Will Deacon
2017-08-31 13:44 ` [RFT] iommu/arm-smmu-v3: Use burst-polling for sync completion Robin Murphy
2017-10-13 19:05 ` [PATCH v2 0/4] SMMUv3 CMD_SYNC optimisation Will Deacon
5 siblings, 1 reply; 13+ messages in thread
From: Robin Murphy @ 2017-08-31 13:44 UTC (permalink / raw)
To: linux-arm-kernel
Even without the MSI trick, we can still do a lot better than hogging
the entire queue while it drains. All we actually need to do for the
necessary guarantee of completion is wait for our particular command to
have been consumed, and as long as we keep track of where it is there is
no need to block other CPUs from adding further commands in the
meantime. There is one theoretical (but incredibly unlikely) edge case
to avoid, where cons has wrapped twice to still appear 'behind' the sync
position - this is easily disambiguated by adding a generation count to
the queue to indicate when prod wraps, since cons cannot wrap twice
without prod having wrapped at least once.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
v2: New
drivers/iommu/arm-smmu-v3.c | 81 ++++++++++++++++++++++++++++++++-------------
1 file changed, 58 insertions(+), 23 deletions(-)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 311f482b93d5..f5c5da553803 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -417,7 +417,6 @@
/* High-level queue structures */
#define ARM_SMMU_POLL_TIMEOUT_US 100
-#define ARM_SMMU_CMDQ_DRAIN_TIMEOUT_US 1000000 /* 1s! */
#define ARM_SMMU_SYNC_TIMEOUT_US 1000000 /* 1s! */
#define MSI_IOVA_BASE 0x8000000
@@ -540,6 +539,7 @@ struct arm_smmu_queue {
struct arm_smmu_cmdq {
struct arm_smmu_queue q;
spinlock_t lock;
+ int generation;
};
struct arm_smmu_evtq {
@@ -770,21 +770,12 @@ static void queue_inc_prod(struct arm_smmu_queue *q)
writel(q->prod, q->prod_reg);
}
-/*
- * Wait for the SMMU to consume items. If drain is true, wait until the queue
- * is empty. Otherwise, wait until there is at least one free slot.
- */
-static int queue_poll_cons(struct arm_smmu_queue *q, bool drain, bool wfe)
+/* Wait for the SMMU to consume items, until there is at least one free slot */
+static int queue_poll_cons(struct arm_smmu_queue *q, bool wfe)
{
- ktime_t timeout;
- unsigned int delay = 1;
+ ktime_t timeout = ktime_add_us(ktime_get(), ARM_SMMU_POLL_TIMEOUT_US);
- /* Wait longer if it's queue drain */
- timeout = ktime_add_us(ktime_get(), drain ?
- ARM_SMMU_CMDQ_DRAIN_TIMEOUT_US :
- ARM_SMMU_POLL_TIMEOUT_US);
-
- while (queue_sync_cons(q), (drain ? !queue_empty(q) : queue_full(q))) {
+ while (queue_sync_cons(q), queue_full(q)) {
if (ktime_compare(ktime_get(), timeout) > 0)
return -ETIMEDOUT;
@@ -792,8 +783,7 @@ static int queue_poll_cons(struct arm_smmu_queue *q, bool drain, bool wfe)
wfe();
} else {
cpu_relax();
- udelay(delay);
- delay *= 2;
+ udelay(1);
}
}
@@ -959,15 +949,20 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu)
queue_write(Q_ENT(q, cons), cmd, q->ent_dwords);
}
-static void arm_smmu_cmdq_insert_cmd(struct arm_smmu_device *smmu, u64 *cmd)
+static u32 arm_smmu_cmdq_insert_cmd(struct arm_smmu_device *smmu, u64 *cmd)
{
struct arm_smmu_queue *q = &smmu->cmdq.q;
bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
while (queue_insert_raw(q, cmd) == -ENOSPC) {
- if (queue_poll_cons(q, false, wfe))
+ if (queue_poll_cons(q, wfe))
dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
}
+
+ if (Q_IDX(q, q->prod) == 0)
+ smmu->cmdq.generation++;
+
+ return q->prod;
}
static void arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
@@ -997,15 +992,54 @@ static int arm_smmu_sync_poll_msi(struct arm_smmu_device *smmu, u32 sync_idx)
return (int)(val - sync_idx) < 0 ? -ETIMEDOUT : 0;
}
+static int arm_smmu_sync_poll_cons(struct arm_smmu_device *smmu, u32 sync_idx,
+ int sync_gen)
+{
+ ktime_t timeout = ktime_add_us(ktime_get(), ARM_SMMU_SYNC_TIMEOUT_US);
+ struct arm_smmu_queue *q = &smmu->cmdq.q;
+ bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
+ unsigned int delay = 1;
+
+ do {
+ queue_sync_cons(q);
+ /*
+ * If we see updates quickly enough, cons has passed sync_idx,
+ * but not yet wrapped. At worst, cons might have actually
+ * wrapped an even number of times, but that still guarantees
+ * the original sync must have been consumed.
+ */
+ if (Q_IDX(q, q->cons) >= Q_IDX(q, sync_idx) &&
+ Q_WRP(q, sync_idx) == Q_WRP(q, q->cons))
+ return 0;
+ /*
+ * Otherwise, cons may have passed sync_idx and wrapped one or
+ * more times to appear behind it again, but in that case prod
+ * must also be one or more generations ahead.
+ */
+ if (Q_IDX(q, q->cons) < Q_IDX(q, sync_idx) &&
+ READ_ONCE(smmu->cmdq.generation) != sync_gen)
+ return 0;
+
+ if (wfe) {
+ wfe();
+ } else {
+ cpu_relax();
+ udelay(delay);
+ delay *= 2;
+ }
+ } while (ktime_before(ktime_get(), timeout));
+
+ return -ETIMEDOUT;
+}
+
static void arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
{
u64 cmd[CMDQ_ENT_DWORDS];
unsigned long flags;
- bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
bool msi = (smmu->features & ARM_SMMU_FEAT_MSI) &&
(smmu->features & ARM_SMMU_FEAT_COHERENCY);
struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC };
- int ret;
+ int ret, sync_idx, sync_gen;
if (msi) {
ent.sync.msidata = atomic_inc_return(&smmu->sync_nr);
@@ -1014,13 +1048,14 @@ static void arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
arm_smmu_cmdq_build_cmd(cmd, &ent);
spin_lock_irqsave(&smmu->cmdq.lock, flags);
- arm_smmu_cmdq_insert_cmd(smmu, cmd);
- if (!msi)
- ret = queue_poll_cons(&smmu->cmdq.q, true, wfe);
+ sync_idx = arm_smmu_cmdq_insert_cmd(smmu, cmd);
+ sync_gen = READ_ONCE(smmu->cmdq.generation);
spin_unlock_irqrestore(&smmu->cmdq.lock, flags);
if (msi)
ret = arm_smmu_sync_poll_msi(smmu, ent.sync.msidata);
+ else
+ ret = arm_smmu_sync_poll_cons(smmu, sync_idx, sync_gen);
if (ret)
dev_err_ratelimited(smmu->dev, "CMD_SYNC timeout\n");
}
--
2.13.4.dirty
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v2 4/4] iommu/arm-smmu-v3: Poll for CMD_SYNC outside cmdq lock
2017-08-31 13:44 ` [PATCH v2 4/4] iommu/arm-smmu-v3: Poll for CMD_SYNC outside cmdq lock Robin Murphy
@ 2017-10-13 18:59 ` Will Deacon
2017-10-16 13:12 ` Robin Murphy
0 siblings, 1 reply; 13+ messages in thread
From: Will Deacon @ 2017-10-13 18:59 UTC (permalink / raw)
To: linux-arm-kernel
Hi Robin,
Some of my comments on patch 3 are addressed here, but I'm really struggling
to convince myself that this algorithm is correct. My preference would
be to leave the code as it is for SMMUs that don't implement MSIs, but
comments below anyway because it's an interesting idea.
On Thu, Aug 31, 2017 at 02:44:28PM +0100, Robin Murphy wrote:
> Even without the MSI trick, we can still do a lot better than hogging
> the entire queue while it drains. All we actually need to do for the
> necessary guarantee of completion is wait for our particular command to
> have been consumed, and as long as we keep track of where it is there is
> no need to block other CPUs from adding further commands in the
> meantime. There is one theoretical (but incredibly unlikely) edge case
> to avoid, where cons has wrapped twice to still appear 'behind' the sync
> position - this is easily disambiguated by adding a generation count to
> the queue to indicate when prod wraps, since cons cannot wrap twice
> without prod having wrapped at least once.
>
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> ---
>
> v2: New
>
> drivers/iommu/arm-smmu-v3.c | 81 ++++++++++++++++++++++++++++++++-------------
> 1 file changed, 58 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index 311f482b93d5..f5c5da553803 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -417,7 +417,6 @@
>
> /* High-level queue structures */
> #define ARM_SMMU_POLL_TIMEOUT_US 100
> -#define ARM_SMMU_CMDQ_DRAIN_TIMEOUT_US 1000000 /* 1s! */
> #define ARM_SMMU_SYNC_TIMEOUT_US 1000000 /* 1s! */
>
> #define MSI_IOVA_BASE 0x8000000
> @@ -540,6 +539,7 @@ struct arm_smmu_queue {
> struct arm_smmu_cmdq {
> struct arm_smmu_queue q;
> spinlock_t lock;
> + int generation;
> };
>
> struct arm_smmu_evtq {
> @@ -770,21 +770,12 @@ static void queue_inc_prod(struct arm_smmu_queue *q)
> writel(q->prod, q->prod_reg);
> }
>
> -/*
> - * Wait for the SMMU to consume items. If drain is true, wait until the queue
> - * is empty. Otherwise, wait until there is at least one free slot.
> - */
> -static int queue_poll_cons(struct arm_smmu_queue *q, bool drain, bool wfe)
> +/* Wait for the SMMU to consume items, until there is at least one free slot */
> +static int queue_poll_cons(struct arm_smmu_queue *q, bool wfe)
> {
> - ktime_t timeout;
> - unsigned int delay = 1;
> + ktime_t timeout = ktime_add_us(ktime_get(), ARM_SMMU_POLL_TIMEOUT_US);
>
> - /* Wait longer if it's queue drain */
> - timeout = ktime_add_us(ktime_get(), drain ?
> - ARM_SMMU_CMDQ_DRAIN_TIMEOUT_US :
> - ARM_SMMU_POLL_TIMEOUT_US);
> -
> - while (queue_sync_cons(q), (drain ? !queue_empty(q) : queue_full(q))) {
> + while (queue_sync_cons(q), queue_full(q)) {
> if (ktime_compare(ktime_get(), timeout) > 0)
> return -ETIMEDOUT;
>
> @@ -792,8 +783,7 @@ static int queue_poll_cons(struct arm_smmu_queue *q, bool drain, bool wfe)
> wfe();
> } else {
> cpu_relax();
> - udelay(delay);
> - delay *= 2;
> + udelay(1);
> }
> }
>
> @@ -959,15 +949,20 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu)
> queue_write(Q_ENT(q, cons), cmd, q->ent_dwords);
> }
>
> -static void arm_smmu_cmdq_insert_cmd(struct arm_smmu_device *smmu, u64 *cmd)
> +static u32 arm_smmu_cmdq_insert_cmd(struct arm_smmu_device *smmu, u64 *cmd)
> {
> struct arm_smmu_queue *q = &smmu->cmdq.q;
> bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
>
> while (queue_insert_raw(q, cmd) == -ENOSPC) {
> - if (queue_poll_cons(q, false, wfe))
> + if (queue_poll_cons(q, wfe))
> dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
> }
> +
> + if (Q_IDX(q, q->prod) == 0)
> + smmu->cmdq.generation++;
The readers of generation are using READ_ONCE, so you're missing something
here.
> +
> + return q->prod;
> }
>
> static void arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
> @@ -997,15 +992,54 @@ static int arm_smmu_sync_poll_msi(struct arm_smmu_device *smmu, u32 sync_idx)
> return (int)(val - sync_idx) < 0 ? -ETIMEDOUT : 0;
> }
>
> +static int arm_smmu_sync_poll_cons(struct arm_smmu_device *smmu, u32 sync_idx,
> + int sync_gen)
> +{
> + ktime_t timeout = ktime_add_us(ktime_get(), ARM_SMMU_SYNC_TIMEOUT_US);
> + struct arm_smmu_queue *q = &smmu->cmdq.q;
> + bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
> + unsigned int delay = 1;
> +
> + do {
> + queue_sync_cons(q);
> + /*
> + * If we see updates quickly enough, cons has passed sync_idx,
> + * but not yet wrapped. At worst, cons might have actually
> + * wrapped an even number of times, but that still guarantees
> + * the original sync must have been consumed.
> + */
> + if (Q_IDX(q, q->cons) >= Q_IDX(q, sync_idx) &&
> + Q_WRP(q, sync_idx) == Q_WRP(q, q->cons))
Can you move this into a separate function please, like we have for
queue_full and queue_empty?
> + return 0;
> + /*
> + * Otherwise, cons may have passed sync_idx and wrapped one or
> + * more times to appear behind it again, but in that case prod
> + * must also be one or more generations ahead.
> + */
> + if (Q_IDX(q, q->cons) < Q_IDX(q, sync_idx) &&
> + READ_ONCE(smmu->cmdq.generation) != sync_gen)
There's another daft overflow case here which deserves a comment (and even
if it *did* happen, we'll recover gracefully).
> + return 0;
> +
> + if (wfe) {
> + wfe();
This is a bit scary... if we miss a generation update, just due to store
propagation latency (maybe it's buffered by the updater), I think we can
end up going into wfe when there's not an event pending. Using xchg
everywhere might work, but there's still a race on having somebody update
generation. The ordering here just looks generally suspicious to me because
you have the generation writer in arm_smmu_cmdq_insert_cmd effectively
doing:
Write prod
Write generation
and the reader in arm_smmu_sync_poll_cons doing:
Read cons
Read generation
so I can't see anything that gives you order to guarantee that the
generation is seen to be up-to-date.
But it's also Friday evening and my brain is pretty fried ;)
Will
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2 4/4] iommu/arm-smmu-v3: Poll for CMD_SYNC outside cmdq lock
2017-10-13 18:59 ` Will Deacon
@ 2017-10-16 13:12 ` Robin Murphy
0 siblings, 0 replies; 13+ messages in thread
From: Robin Murphy @ 2017-10-16 13:12 UTC (permalink / raw)
To: linux-arm-kernel
On 13/10/17 19:59, Will Deacon wrote:
> Hi Robin,
>
> Some of my comments on patch 3 are addressed here, but I'm really struggling
> to convince myself that this algorithm is correct. My preference would
> be to leave the code as it is for SMMUs that don't implement MSIs, but
> comments below anyway because it's an interesting idea.
>From scrounging up boot logs I can see that neither the Cavium nor
HiSilicon SMMUv3 implementations have MSI support (the one in D05
doesn't even have WFE), so there is a real motivation to improve
scalability on current systems - it's not *just* a cool feature!
> On Thu, Aug 31, 2017 at 02:44:28PM +0100, Robin Murphy wrote:
>> Even without the MSI trick, we can still do a lot better than hogging
>> the entire queue while it drains. All we actually need to do for the
>> necessary guarantee of completion is wait for our particular command to
>> have been consumed, and as long as we keep track of where it is there is
>> no need to block other CPUs from adding further commands in the
>> meantime. There is one theoretical (but incredibly unlikely) edge case
>> to avoid, where cons has wrapped twice to still appear 'behind' the sync
>> position - this is easily disambiguated by adding a generation count to
>> the queue to indicate when prod wraps, since cons cannot wrap twice
>> without prod having wrapped at least once.
>>
>> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
>> ---
>>
>> v2: New
>>
>> drivers/iommu/arm-smmu-v3.c | 81 ++++++++++++++++++++++++++++++++-------------
>> 1 file changed, 58 insertions(+), 23 deletions(-)
>>
>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>> index 311f482b93d5..f5c5da553803 100644
>> --- a/drivers/iommu/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm-smmu-v3.c
>> @@ -417,7 +417,6 @@
>>
>> /* High-level queue structures */
>> #define ARM_SMMU_POLL_TIMEOUT_US 100
>> -#define ARM_SMMU_CMDQ_DRAIN_TIMEOUT_US 1000000 /* 1s! */
>> #define ARM_SMMU_SYNC_TIMEOUT_US 1000000 /* 1s! */
>>
>> #define MSI_IOVA_BASE 0x8000000
>> @@ -540,6 +539,7 @@ struct arm_smmu_queue {
>> struct arm_smmu_cmdq {
>> struct arm_smmu_queue q;
>> spinlock_t lock;
>> + int generation;
>> };
>>
>> struct arm_smmu_evtq {
>> @@ -770,21 +770,12 @@ static void queue_inc_prod(struct arm_smmu_queue *q)
>> writel(q->prod, q->prod_reg);
>> }
>>
>> -/*
>> - * Wait for the SMMU to consume items. If drain is true, wait until the queue
>> - * is empty. Otherwise, wait until there is at least one free slot.
>> - */
>> -static int queue_poll_cons(struct arm_smmu_queue *q, bool drain, bool wfe)
>> +/* Wait for the SMMU to consume items, until there is at least one free slot */
>> +static int queue_poll_cons(struct arm_smmu_queue *q, bool wfe)
>> {
>> - ktime_t timeout;
>> - unsigned int delay = 1;
>> + ktime_t timeout = ktime_add_us(ktime_get(), ARM_SMMU_POLL_TIMEOUT_US);
>>
>> - /* Wait longer if it's queue drain */
>> - timeout = ktime_add_us(ktime_get(), drain ?
>> - ARM_SMMU_CMDQ_DRAIN_TIMEOUT_US :
>> - ARM_SMMU_POLL_TIMEOUT_US);
>> -
>> - while (queue_sync_cons(q), (drain ? !queue_empty(q) : queue_full(q))) {
>> + while (queue_sync_cons(q), queue_full(q)) {
>> if (ktime_compare(ktime_get(), timeout) > 0)
>> return -ETIMEDOUT;
>>
>> @@ -792,8 +783,7 @@ static int queue_poll_cons(struct arm_smmu_queue *q, bool drain, bool wfe)
>> wfe();
>> } else {
>> cpu_relax();
>> - udelay(delay);
>> - delay *= 2;
>> + udelay(1);
>> }
>> }
>>
>> @@ -959,15 +949,20 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu)
>> queue_write(Q_ENT(q, cons), cmd, q->ent_dwords);
>> }
>>
>> -static void arm_smmu_cmdq_insert_cmd(struct arm_smmu_device *smmu, u64 *cmd)
>> +static u32 arm_smmu_cmdq_insert_cmd(struct arm_smmu_device *smmu, u64 *cmd)
>> {
>> struct arm_smmu_queue *q = &smmu->cmdq.q;
>> bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
>>
>> while (queue_insert_raw(q, cmd) == -ENOSPC) {
>> - if (queue_poll_cons(q, false, wfe))
>> + if (queue_poll_cons(q, wfe))
>> dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
>> }
>> +
>> + if (Q_IDX(q, q->prod) == 0)
>> + smmu->cmdq.generation++;
>
> The readers of generation are using READ_ONCE, so you're missing something
> here.
Yeah, I was a bit back-and-forth on this. The readers want a READ_ONCE
if only to prevent it being hoisted out of the polling loop, but as long
as the update of generation is single-copy-atomic, the exact point at
which it occurs shouldn't matter so much, since it's only written under
the cmdq lock. I guess it depends how much we trust GCC's claim of the
atomicity of int - I have no great objection to
smmu->cmdq.generation = WRITE_ONCE(smmu->cmdq.generation + 1);
other than it being really long.
>> +
>> + return q->prod;
>> }
>>
>> static void arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
>> @@ -997,15 +992,54 @@ static int arm_smmu_sync_poll_msi(struct arm_smmu_device *smmu, u32 sync_idx)
>> return (int)(val - sync_idx) < 0 ? -ETIMEDOUT : 0;
>> }
>>
>> +static int arm_smmu_sync_poll_cons(struct arm_smmu_device *smmu, u32 sync_idx,
>> + int sync_gen)
>> +{
>> + ktime_t timeout = ktime_add_us(ktime_get(), ARM_SMMU_SYNC_TIMEOUT_US);
>> + struct arm_smmu_queue *q = &smmu->cmdq.q;
>> + bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
>> + unsigned int delay = 1;
>> +
>> + do {
>> + queue_sync_cons(q);
>> + /*
>> + * If we see updates quickly enough, cons has passed sync_idx,
>> + * but not yet wrapped. At worst, cons might have actually
>> + * wrapped an even number of times, but that still guarantees
>> + * the original sync must have been consumed.
>> + */
>> + if (Q_IDX(q, q->cons) >= Q_IDX(q, sync_idx) &&
>> + Q_WRP(q, sync_idx) == Q_WRP(q, q->cons))
>
> Can you move this into a separate function please, like we have for
> queue_full and queue_empty?
OK, but given that it's only half of the "has cons moved past this
index" operation, I'm not really sure what it could be called -
queue_ahead_not_wrapped() comes to mind, but still seems a bit cryptic.
>> + return 0;
>> + /*
>> + * Otherwise, cons may have passed sync_idx and wrapped one or
>> + * more times to appear behind it again, but in that case prod
>> + * must also be one or more generations ahead.
>> + */
>> + if (Q_IDX(q, q->cons) < Q_IDX(q, sync_idx) &&
>> + READ_ONCE(smmu->cmdq.generation) != sync_gen)
>
> There's another daft overflow case here which deserves a comment (and even
> if it *did* happen, we'll recover gracefully).
You mean exactly 2^32 queue generations passing between polls? Yeah, not
happening :P
>> + return 0;
>> +
>> + if (wfe) {
>> + wfe();
>
> This is a bit scary... if we miss a generation update, just due to store
> propagation latency (maybe it's buffered by the updater), I think we can
> end up going into wfe when there's not an event pending. Using xchg
> everywhere might work, but there's still a race on having somebody update
> generation. The ordering here just looks generally suspicious to me because
> you have the generation writer in arm_smmu_cmdq_insert_cmd effectively
> doing:
>
> Write prod
> Write generation
>
> and the reader in arm_smmu_sync_poll_cons doing:
>
> Read cons
> Read generation
>
> so I can't see anything that gives you order to guarantee that the
> generation is seen to be up-to-date.
On reflection I'm pretty sure the below should suffice, to piggy-back
off the DSB implicit in queue_insert_raw(). The readers only care about
the generation *after* cons has been observed to go backwards, so the
exact order of the generation and prod updates makes no practical
difference other than closing that race window.
Robin
----->8-----
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 12cdc5e50675..78ba8269c44c 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -959,14 +959,14 @@ static u32 arm_smmu_cmdq_insert_cmd(struct
arm_smmu_device *smmu, u64 *cmd)
struct arm_smmu_queue *q = &smmu->cmdq.q;
bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
+ if (Q_IDX(q, q->prod + 1) == 0)
+ smmu->cmdq.generation++;
+
while (queue_insert_raw(q, cmd) == -ENOSPC) {
if (queue_poll_cons(q, wfe))
dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
}
- if (Q_IDX(q, q->prod) == 0)
- smmu->cmdq.generation++;
-
return q->prod;
}
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFT] iommu/arm-smmu-v3: Use burst-polling for sync completion
2017-08-31 13:44 [PATCH v2 0/4] SMMUv3 CMD_SYNC optimisation Robin Murphy
` (3 preceding siblings ...)
2017-08-31 13:44 ` [PATCH v2 4/4] iommu/arm-smmu-v3: Poll for CMD_SYNC outside cmdq lock Robin Murphy
@ 2017-08-31 13:44 ` Robin Murphy
2017-10-13 19:05 ` [PATCH v2 0/4] SMMUv3 CMD_SYNC optimisation Will Deacon
5 siblings, 0 replies; 13+ messages in thread
From: Robin Murphy @ 2017-08-31 13:44 UTC (permalink / raw)
To: linux-arm-kernel
While CMD_SYNC is unlikely to complete immediately such that we never go
round the polling loop, with a lightly-loaded queue it may still do so
long before the delay period is up. If we have no better completion
notifier, use similar logic as we have for SMMUv2 to spin a number of
times before each backoff, so that we have more chance of catching syncs
which complete relatively quickly and avoid delaying unnecessarily.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
This is mostly here for theoretical completeness - unless it proves to
actually give a measurable benefit (I have no idea), I'd be inclined
not to consider it for merging.
drivers/iommu/arm-smmu-v3.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f5c5da553803..b92cd65f43f8 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -418,6 +418,7 @@
/* High-level queue structures */
#define ARM_SMMU_POLL_TIMEOUT_US 100
#define ARM_SMMU_SYNC_TIMEOUT_US 1000000 /* 1s! */
+#define ARM_SMMU_SYNC_SPIN_COUNT 10
#define MSI_IOVA_BASE 0x8000000
#define MSI_IOVA_LENGTH 0x100000
@@ -998,7 +999,7 @@ static int arm_smmu_sync_poll_cons(struct arm_smmu_device *smmu, u32 sync_idx,
ktime_t timeout = ktime_add_us(ktime_get(), ARM_SMMU_SYNC_TIMEOUT_US);
struct arm_smmu_queue *q = &smmu->cmdq.q;
bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
- unsigned int delay = 1;
+ unsigned int delay = 1, spin_cnt = 0;
do {
queue_sync_cons(q);
@@ -1022,10 +1023,13 @@ static int arm_smmu_sync_poll_cons(struct arm_smmu_device *smmu, u32 sync_idx,
if (wfe) {
wfe();
- } else {
+ } else if (++spin_cnt < ARM_SMMU_SYNC_SPIN_COUNT) {
cpu_relax();
+ continue;
+ } else {
udelay(delay);
delay *= 2;
+ spin_cnt = 0;
}
} while (ktime_before(ktime_get(), timeout));
--
2.13.4.dirty
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v2 0/4] SMMUv3 CMD_SYNC optimisation
2017-08-31 13:44 [PATCH v2 0/4] SMMUv3 CMD_SYNC optimisation Robin Murphy
` (4 preceding siblings ...)
2017-08-31 13:44 ` [RFT] iommu/arm-smmu-v3: Use burst-polling for sync completion Robin Murphy
@ 2017-10-13 19:05 ` Will Deacon
2017-10-16 13:18 ` Robin Murphy
2017-10-16 15:02 ` Will Deacon
5 siblings, 2 replies; 13+ messages in thread
From: Will Deacon @ 2017-10-13 19:05 UTC (permalink / raw)
To: linux-arm-kernel
Hi Robin,
On Thu, Aug 31, 2017 at 02:44:24PM +0100, Robin Murphy wrote:
> Since Nate reported a reasonable performance boost from the out-of-line
> MSI polling in v1 [1], I've now implemented the equivalent for cons
> polling as well - that has been boot-tested on D05 with some trivial I/O
> and at least doesn't seem to lock up or explode. There's also a little
> cosmetic tweaking to make the patches a bit cleaner as a series.
>
> Robin.
>
> [1] https://www.mail-archive.com/iommu at lists.linux-foundation.org/msg19657.html
>
> Robin Murphy (5):
> iommu/arm-smmu-v3: Specialise CMD_SYNC handling
> iommu/arm-smmu-v3: Forget about cmdq-sync interrupt
> iommu/arm-smmu-v3: Use CMD_SYNC completion interrupt
> iommu/arm-smmu-v3: Poll for CMD_SYNC outside cmdq lock
> iommu/arm-smmu-v3: Use burst-polling for sync completion
What's this final mythical patch about? I don't see it in the series.
Anyway, the first two patches look fine to me, but this doesn't apply
on top of my iommu/devel branch so they will need rebasing.
Cheers,
Will
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2 0/4] SMMUv3 CMD_SYNC optimisation
2017-10-13 19:05 ` [PATCH v2 0/4] SMMUv3 CMD_SYNC optimisation Will Deacon
@ 2017-10-16 13:18 ` Robin Murphy
2017-10-16 15:02 ` Will Deacon
1 sibling, 0 replies; 13+ messages in thread
From: Robin Murphy @ 2017-10-16 13:18 UTC (permalink / raw)
To: linux-arm-kernel
On 13/10/17 20:05, Will Deacon wrote:
> Hi Robin,
>
> On Thu, Aug 31, 2017 at 02:44:24PM +0100, Robin Murphy wrote:
>> Since Nate reported a reasonable performance boost from the out-of-line
>> MSI polling in v1 [1], I've now implemented the equivalent for cons
>> polling as well - that has been boot-tested on D05 with some trivial I/O
>> and at least doesn't seem to lock up or explode. There's also a little
>> cosmetic tweaking to make the patches a bit cleaner as a series.
>>
>> Robin.
>>
>> [1] https://www.mail-archive.com/iommu at lists.linux-foundation.org/msg19657.html
>>
>> Robin Murphy (5):
>> iommu/arm-smmu-v3: Specialise CMD_SYNC handling
>> iommu/arm-smmu-v3: Forget about cmdq-sync interrupt
>> iommu/arm-smmu-v3: Use CMD_SYNC completion interrupt
>> iommu/arm-smmu-v3: Poll for CMD_SYNC outside cmdq lock
>> iommu/arm-smmu-v3: Use burst-polling for sync completion
>
> What's this final mythical patch about? I don't see it in the series.
It's in the place 5/5 would be, I just tagged it as [RFT] since I have
zero evidence whether it's worth the bother or not.
> Anyway, the first two patches look fine to me, but this doesn't apply
> on top of my iommu/devel branch so they will need rebasing.
Will do.
Thanks for the review,
Robin.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2 0/4] SMMUv3 CMD_SYNC optimisation
2017-10-13 19:05 ` [PATCH v2 0/4] SMMUv3 CMD_SYNC optimisation Will Deacon
2017-10-16 13:18 ` Robin Murphy
@ 2017-10-16 15:02 ` Will Deacon
1 sibling, 0 replies; 13+ messages in thread
From: Will Deacon @ 2017-10-16 15:02 UTC (permalink / raw)
To: linux-arm-kernel
On Fri, Oct 13, 2017 at 08:05:21PM +0100, Will Deacon wrote:
> On Thu, Aug 31, 2017 at 02:44:24PM +0100, Robin Murphy wrote:
> > Since Nate reported a reasonable performance boost from the out-of-line
> > MSI polling in v1 [1], I've now implemented the equivalent for cons
> > polling as well - that has been boot-tested on D05 with some trivial I/O
> > and at least doesn't seem to lock up or explode. There's also a little
> > cosmetic tweaking to make the patches a bit cleaner as a series.
> >
> > Robin.
> >
> > [1] https://www.mail-archive.com/iommu at lists.linux-foundation.org/msg19657.html
> >
> > Robin Murphy (5):
> > iommu/arm-smmu-v3: Specialise CMD_SYNC handling
> > iommu/arm-smmu-v3: Forget about cmdq-sync interrupt
> > iommu/arm-smmu-v3: Use CMD_SYNC completion interrupt
> > iommu/arm-smmu-v3: Poll for CMD_SYNC outside cmdq lock
> > iommu/arm-smmu-v3: Use burst-polling for sync completion
>
> What's this final mythical patch about? I don't see it in the series.
>
> Anyway, the first two patches look fine to me, but this doesn't apply
> on top of my iommu/devel branch so they will need rebasing.
No idea what I was doing wrong on Friday, but the first two patches apply
cleanly now, so I'll push them out in a bit.
Sorry for the noise,
Will
^ permalink raw reply [flat|nested] 13+ messages in thread