* [PATCH v2 0/5] iommu/amd: Use 128-bit cmpxchg operation to update DTE
@ 2024-08-29 18:07 Suravee Suthikulpanit
2024-08-29 18:07 ` [PATCH v2 1/5] iommu/amd: Disable AMD IOMMU if CMPXCHG16B feature is not supported Suravee Suthikulpanit
` (4 more replies)
0 siblings, 5 replies; 16+ messages in thread
From: Suravee Suthikulpanit @ 2024-08-29 18:07 UTC (permalink / raw)
To: linux-kernel, iommu
Cc: joro, robin.murphy, vasant.hegde, ubizjak, jgg, jon.grimm,
santosh.shukla, pandoh, kumaranand, Suravee Suthikulpanit
The v2 series splits the v1 patch to address several concerns in review feedback
(https://lore.kernel.org/lkml/e937e26f-038a-6d01-76a9-76c86760ca4a@gmail.com/T/).
Changelog:
* Patch 1, 2, 5 are new.
* Patch 3:
- Change struct dev_table_entry to union,
- Fix update_dte256() per feedback from v1
- Add get_dte256() helper function
* Patch 4: Refactoring set_dte_entry
Thanks,
Suravee
Suravee Suthikulpanit (5):
iommu/amd: Disable AMD IOMMU if CMPXCHG16B feature is not supported
iommu/amd: Introduce rw_semaphore for Device Table Entry (DTE)
iommu/amd: Introduce helper functions to access and update 256-bit DTE
iommu/amd: Modify set_dte_entry() to use 256-bit DTE helpers
iommu/amd: Use 128-bit cmpxchg in set_dte_irq_entry()
drivers/iommu/amd/amd_iommu_types.h | 6 +-
drivers/iommu/amd/init.c | 23 ++-
drivers/iommu/amd/iommu.c | 237 ++++++++++++++++++----------
3 files changed, 171 insertions(+), 95 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 1/5] iommu/amd: Disable AMD IOMMU if CMPXCHG16B feature is not supported
2024-08-29 18:07 [PATCH v2 0/5] iommu/amd: Use 128-bit cmpxchg operation to update DTE Suravee Suthikulpanit
@ 2024-08-29 18:07 ` Suravee Suthikulpanit
2024-08-29 18:07 ` [PATCH v2 2/5] iommu/amd: Introduce rw_semaphore for Device Table Entry (DTE) Suravee Suthikulpanit
` (3 subsequent siblings)
4 siblings, 0 replies; 16+ messages in thread
From: Suravee Suthikulpanit @ 2024-08-29 18:07 UTC (permalink / raw)
To: linux-kernel, iommu
Cc: joro, robin.murphy, vasant.hegde, ubizjak, jgg, jon.grimm,
santosh.shukla, pandoh, kumaranand, Suravee Suthikulpanit
According to the AMD IOMMU spec, the IOMMU reads the entire DTE either
in two 128-bit transactions or a single 256-bit transaction. It is
recommended to update DTE using 128-bit operation followed by an
INVALIDATE_DEVTAB_ENTYRY command when the IV=1b or V=1b.
According to the AMD BIOS and Kernel Developer's Guide (BDKG) dated back
to family 10h Processor [1], which is the first introduction of AMD IOMMU,
AMD processor always has CPUID Fn0000_0001_ECX[CMPXCHG16B]=1.
Therefore, it is safe to assume cmpxchg128 is available with all AMD
processor w/ IOMMU.
In addition, the CMPXCHG16B feature has already been checked separately
before enabling the GA, XT, and GAM modes. Consolidate the detection logic,
and fail the IOMMU initialization if the feature is not supported.
[1] https://www.amd.com/content/dam/amd/en/documents/archived-tech-docs/programmer-references/31116.pdf
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
---
drivers/iommu/amd/init.c | 23 +++++++++--------------
1 file changed, 9 insertions(+), 14 deletions(-)
diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 6b15ce09e78d..983c09898a10 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -1762,13 +1762,8 @@ static int __init init_iommu_one(struct amd_iommu *iommu, struct ivhd_header *h,
else
iommu->mmio_phys_end = MMIO_CNTR_CONF_OFFSET;
- /*
- * Note: GA (128-bit IRTE) mode requires cmpxchg16b supports.
- * GAM also requires GA mode. Therefore, we need to
- * check cmpxchg16b support before enabling it.
- */
- if (!boot_cpu_has(X86_FEATURE_CX16) ||
- ((h->efr_attr & (0x1 << IOMMU_FEAT_GASUP_SHIFT)) == 0))
+ /* GAM requires GA mode. */
+ if ((h->efr_attr & (0x1 << IOMMU_FEAT_GASUP_SHIFT)) == 0)
amd_iommu_guest_ir = AMD_IOMMU_GUEST_IR_LEGACY;
break;
case 0x11:
@@ -1778,13 +1773,8 @@ static int __init init_iommu_one(struct amd_iommu *iommu, struct ivhd_header *h,
else
iommu->mmio_phys_end = MMIO_CNTR_CONF_OFFSET;
- /*
- * Note: GA (128-bit IRTE) mode requires cmpxchg16b supports.
- * XT, GAM also requires GA mode. Therefore, we need to
- * check cmpxchg16b support before enabling them.
- */
- if (!boot_cpu_has(X86_FEATURE_CX16) ||
- ((h->efr_reg & (0x1 << IOMMU_EFR_GASUP_SHIFT)) == 0)) {
+ /* XT and GAM require GA mode. */
+ if ((h->efr_reg & (0x1 << IOMMU_EFR_GASUP_SHIFT)) == 0) {
amd_iommu_guest_ir = AMD_IOMMU_GUEST_IR_LEGACY;
break;
}
@@ -3049,6 +3039,11 @@ static int __init early_amd_iommu_init(void)
return -EINVAL;
}
+ if (!boot_cpu_has(X86_FEATURE_CX16)) {
+ pr_err("Failed to initialize. The CMPXCHG16B feature is required.\n");
+ return -EINVAL;
+ }
+
/*
* Validate checksum here so we don't need to do it when
* we actually parse the table
--
2.34.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 2/5] iommu/amd: Introduce rw_semaphore for Device Table Entry (DTE)
2024-08-29 18:07 [PATCH v2 0/5] iommu/amd: Use 128-bit cmpxchg operation to update DTE Suravee Suthikulpanit
2024-08-29 18:07 ` [PATCH v2 1/5] iommu/amd: Disable AMD IOMMU if CMPXCHG16B feature is not supported Suravee Suthikulpanit
@ 2024-08-29 18:07 ` Suravee Suthikulpanit
2024-08-29 19:34 ` Jason Gunthorpe
2024-08-29 18:07 ` [PATCH v2 3/5] iommu/amd: Introduce helper functions to access and update 256-bit DTE Suravee Suthikulpanit
` (2 subsequent siblings)
4 siblings, 1 reply; 16+ messages in thread
From: Suravee Suthikulpanit @ 2024-08-29 18:07 UTC (permalink / raw)
To: linux-kernel, iommu
Cc: joro, robin.murphy, vasant.hegde, ubizjak, jgg, jon.grimm,
santosh.shukla, pandoh, kumaranand, Suravee Suthikulpanit
In preparation for 256-bit DTE update, each DTE access/update needs
to be protected using synchronication mechanism to prevent conflict.
Introduce a new rw-semaphore struct dev_data.dte_sem, which is per
device. Also update certain helper functions to use the new dte_sem.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
---
drivers/iommu/amd/amd_iommu_types.h | 1 +
drivers/iommu/amd/iommu.c | 10 ++++++++--
2 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/amd/amd_iommu_types.h b/drivers/iommu/amd/amd_iommu_types.h
index c9f9a598eb82..65f3a073999d 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -833,6 +833,7 @@ struct devid_map {
struct iommu_dev_data {
/*Protect against attach/detach races */
spinlock_t lock;
+ struct rw_semaphore dte_sem;
struct list_head list; /* For domain->dev_list */
struct llist_node dev_data_list; /* For global dev_data_list */
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 87c5385ce3f2..994ed02842b9 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -205,6 +205,7 @@ static struct iommu_dev_data *alloc_dev_data(struct amd_iommu *iommu, u16 devid)
return NULL;
spin_lock_init(&dev_data->lock);
+ init_rwsem(&dev_data->dte_sem);
dev_data->devid = devid;
ratelimit_default_init(&dev_data->rs);
@@ -1946,10 +1947,13 @@ static void set_dte_entry(struct amd_iommu *iommu,
}
}
-static void clear_dte_entry(struct amd_iommu *iommu, u16 devid)
+static void clear_dte_entry(struct amd_iommu *iommu, struct iommu_dev_data *dev_data)
{
+ u16 devid = dev_data->devid;
struct dev_table_entry *dev_table = get_dev_table(iommu);
+ down_write(&dev_data->dte_sem);
+
/* remove entry from the device table seen by the hardware */
dev_table[devid].data[0] = DTE_FLAG_V;
@@ -1959,6 +1963,8 @@ static void clear_dte_entry(struct amd_iommu *iommu, u16 devid)
dev_table[devid].data[1] &= DTE_FLAG_MASK;
amd_iommu_apply_erratum_63(iommu, devid);
+
+ up_write(&dev_data->dte_sem);
}
/* Update and flush DTE for the given device */
@@ -1969,7 +1975,7 @@ void amd_iommu_dev_update_dte(struct iommu_dev_data *dev_data, bool set)
if (set)
set_dte_entry(iommu, dev_data);
else
- clear_dte_entry(iommu, dev_data->devid);
+ clear_dte_entry(iommu, dev_data);
clone_aliases(iommu, dev_data->dev);
device_flush_dte(dev_data);
--
2.34.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 3/5] iommu/amd: Introduce helper functions to access and update 256-bit DTE
2024-08-29 18:07 [PATCH v2 0/5] iommu/amd: Use 128-bit cmpxchg operation to update DTE Suravee Suthikulpanit
2024-08-29 18:07 ` [PATCH v2 1/5] iommu/amd: Disable AMD IOMMU if CMPXCHG16B feature is not supported Suravee Suthikulpanit
2024-08-29 18:07 ` [PATCH v2 2/5] iommu/amd: Introduce rw_semaphore for Device Table Entry (DTE) Suravee Suthikulpanit
@ 2024-08-29 18:07 ` Suravee Suthikulpanit
2024-08-29 19:28 ` Jason Gunthorpe
2024-08-29 18:07 ` [PATCH v2 4/5] iommu/amd: Modify set_dte_entry() to use 256-bit DTE helpers Suravee Suthikulpanit
2024-08-29 18:07 ` [PATCH v2 5/5] iommu/amd: Use 128-bit cmpxchg in set_dte_irq_entry() Suravee Suthikulpanit
4 siblings, 1 reply; 16+ messages in thread
From: Suravee Suthikulpanit @ 2024-08-29 18:07 UTC (permalink / raw)
To: linux-kernel, iommu
Cc: joro, robin.murphy, vasant.hegde, ubizjak, jgg, jon.grimm,
santosh.shukla, pandoh, kumaranand, Suravee Suthikulpanit
The current implementation does not follow 128-bit write requirement
to update DTE as specified in the AMD I/O Virtualization Techonology
(IOMMU) Specification.
Therefore, modify the struct dev_table_entry to contain union of u128 data
array, and introduce two helper functions:
* update_dte256() to update DTE using two 128-bit cmpxchg
operations to update 256-bit DTE with the modified structure.
Also use the struct iommu_dev_data.dte_sem to synchronize 256-bit
data update.
* get_dte256() to copy 256-bit DTE to the provided structrue.
Also use the struct iommu_dev_data.dte_sem to synchronize 256-bit
data access.
Also, update existing code to use the new helper functions in this and
subsequent patches.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
---
drivers/iommu/amd/amd_iommu_types.h | 5 +-
drivers/iommu/amd/iommu.c | 81 +++++++++++++++++++++++------
2 files changed, 70 insertions(+), 16 deletions(-)
diff --git a/drivers/iommu/amd/amd_iommu_types.h b/drivers/iommu/amd/amd_iommu_types.h
index 65f3a073999d..2787d6af5a59 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -884,7 +884,10 @@ extern struct amd_iommu *amd_iommus[MAX_IOMMUS];
* Structure defining one entry in the device table
*/
struct dev_table_entry {
- u64 data[4];
+ union {
+ u64 data[4];
+ u128 data128[2];
+ };
};
/*
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 994ed02842b9..93bca5c68bca 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -85,6 +85,47 @@ static void set_dte_entry(struct amd_iommu *iommu,
*
****************************************************************************/
+static void update_dte256(struct amd_iommu *iommu, struct iommu_dev_data *dev_data,
+ struct dev_table_entry *new)
+{
+ struct dev_table_entry *dev_table = get_dev_table(iommu);
+ struct dev_table_entry *ptr = &dev_table[dev_data->devid];
+ struct dev_table_entry old;
+ u128 tmp;
+
+ down_write(&dev_data->dte_sem);
+
+ old.data128[0] = ptr->data128[0];
+ old.data128[1] = ptr->data128[1];
+
+ tmp = cmpxchg128(&ptr->data128[1], old.data128[1], new->data128[1]);
+ if (tmp == old.data128[1]) {
+ if (!try_cmpxchg128(&ptr->data128[0], &old.data128[0], new->data128[0])) {
+ /* Restore hi 128-bit */
+ cmpxchg128(&ptr->data128[1], new->data128[1], tmp);
+ pr_err("%s: Failed. devid=%#x, dte=%016llx:%016llx:%016llx:%016llx\n",
+ __func__, dev_data->devid, new->data[0], new->data[1],
+ new->data[2], new->data[3]);
+ }
+ }
+
+ up_write(&dev_data->dte_sem);
+}
+
+static void get_dte256(struct amd_iommu *iommu, struct iommu_dev_data *dev_data,
+ struct dev_table_entry *dte)
+{
+ struct dev_table_entry *ptr;
+ struct dev_table_entry *dev_table = get_dev_table(iommu);
+
+ ptr = &dev_table[dev_data->devid];
+
+ down_read(&dev_data->dte_sem);
+ dte->data128[0] = ptr->data128[0];
+ dte->data128[1] = ptr->data128[1];
+ up_read(&dev_data->dte_sem);
+}
+
static inline bool pdom_is_v2_pgtbl_mode(struct protection_domain *pdom)
{
return (pdom && (pdom->pd_mode == PD_MODE_V2));
@@ -233,8 +274,9 @@ static struct iommu_dev_data *search_dev_data(struct amd_iommu *iommu, u16 devid
static int clone_alias(struct pci_dev *pdev, u16 alias, void *data)
{
+ struct dev_table_entry dte;
struct amd_iommu *iommu;
- struct dev_table_entry *dev_table;
+ struct iommu_dev_data *dev_data;
u16 devid = pci_dev_id(pdev);
if (devid == alias)
@@ -244,11 +286,19 @@ static int clone_alias(struct pci_dev *pdev, u16 alias, void *data)
if (!iommu)
return 0;
+ dev_data = dev_iommu_priv_get(&pdev->dev);
+ if (!dev_data)
+ return -EINVAL;
+
+ get_dte256(iommu, dev_data, &dte);
+
+ /* Setup for alias */
+ dev_data = search_dev_data(iommu, alias);
+ if (!dev_data)
+ return -EINVAL;
+
+ update_dte256(iommu, dev_data, &dte);
amd_iommu_set_rlookup_table(iommu, alias);
- dev_table = get_dev_table(iommu);
- memcpy(dev_table[alias].data,
- dev_table[devid].data,
- sizeof(dev_table[alias].data));
return 0;
}
@@ -584,10 +634,13 @@ static void amd_iommu_uninit_device(struct device *dev)
static void dump_dte_entry(struct amd_iommu *iommu, u16 devid)
{
int i;
- struct dev_table_entry *dev_table = get_dev_table(iommu);
+ struct dev_table_entry dte;
+ struct iommu_dev_data *dev_data = find_dev_data(iommu, devid);
+
+ get_dte256(iommu, dev_data, &dte);
for (i = 0; i < 4; ++i)
- pr_err("DTE[%d]: %016llx\n", i, dev_table[devid].data[i]);
+ pr_err("DTE[%d]: %016llx\n", i, dte.data[i]);
}
static void dump_command(unsigned long phys_addr)
@@ -2667,12 +2720,10 @@ static int amd_iommu_set_dirty_tracking(struct iommu_domain *domain,
bool enable)
{
struct protection_domain *pdomain = to_pdomain(domain);
- struct dev_table_entry *dev_table;
struct iommu_dev_data *dev_data;
bool domain_flush = false;
struct amd_iommu *iommu;
unsigned long flags;
- u64 pte_root;
spin_lock_irqsave(&pdomain->lock, flags);
if (!(pdomain->dirty_tracking ^ enable)) {
@@ -2681,16 +2732,16 @@ static int amd_iommu_set_dirty_tracking(struct iommu_domain *domain,
}
list_for_each_entry(dev_data, &pdomain->dev_list, list) {
- iommu = get_amd_iommu_from_dev_data(dev_data);
+ struct dev_table_entry dte;
- dev_table = get_dev_table(iommu);
- pte_root = dev_table[dev_data->devid].data[0];
+ iommu = get_amd_iommu_from_dev_data(dev_data);
+ get_dte256(iommu, dev_data, &dte);
- pte_root = (enable ? pte_root | DTE_FLAG_HAD :
- pte_root & ~DTE_FLAG_HAD);
+ dte.data[0] = (enable ? dte.data[0] | DTE_FLAG_HAD :
+ dte.data[0] & ~DTE_FLAG_HAD);
/* Flush device DTE */
- dev_table[dev_data->devid].data[0] = pte_root;
+ update_dte256(iommu, dev_data, &dte);
device_flush_dte(dev_data);
domain_flush = true;
}
--
2.34.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 4/5] iommu/amd: Modify set_dte_entry() to use 256-bit DTE helpers
2024-08-29 18:07 [PATCH v2 0/5] iommu/amd: Use 128-bit cmpxchg operation to update DTE Suravee Suthikulpanit
` (2 preceding siblings ...)
2024-08-29 18:07 ` [PATCH v2 3/5] iommu/amd: Introduce helper functions to access and update 256-bit DTE Suravee Suthikulpanit
@ 2024-08-29 18:07 ` Suravee Suthikulpanit
2024-08-29 18:07 ` [PATCH v2 5/5] iommu/amd: Use 128-bit cmpxchg in set_dte_irq_entry() Suravee Suthikulpanit
4 siblings, 0 replies; 16+ messages in thread
From: Suravee Suthikulpanit @ 2024-08-29 18:07 UTC (permalink / raw)
To: linux-kernel, iommu
Cc: joro, robin.murphy, vasant.hegde, ubizjak, jgg, jon.grimm,
santosh.shukla, pandoh, kumaranand, Suravee Suthikulpanit
Also, the set_dte_entry() is used to program several DTE fields (e.g.
stage1 table, stage2 table, domain id, and etc.), which is difficult
to keep track with current implementation.
Therefore, separate logic for setting up the GCR3 Table Root Pointer,
GIOV, GV, GLX, and GuestPagingMode into another helper function
set_dte_gcr3_table().
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
---
drivers/iommu/amd/iommu.c | 121 +++++++++++++++++++++-----------------
1 file changed, 67 insertions(+), 54 deletions(-)
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 93bca5c68bca..a24986c2478b 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -1905,16 +1905,56 @@ int amd_iommu_clear_gcr3(struct iommu_dev_data *dev_data, ioasid_t pasid)
return ret;
}
+static void set_dte_gcr3_table(struct amd_iommu *iommu,
+ struct iommu_dev_data *dev_data,
+ struct dev_table_entry *target)
+{
+ struct gcr3_tbl_info *gcr3_info = &dev_data->gcr3_info;
+ u64 tmp, gcr3;
+
+ if (!gcr3_info->gcr3_tbl)
+ return;
+
+ pr_debug("%s: devid=%#x, glx=%#x, gcr3_tbl=%#llx\n",
+ __func__, dev_data->devid, gcr3_info->glx,
+ (unsigned long long)gcr3_info->gcr3_tbl);
+
+ tmp = gcr3_info->glx;
+ target->data[0] |= (tmp & DTE_GLX_MASK) << DTE_GLX_SHIFT;
+ if (pdom_is_v2_pgtbl_mode(dev_data->domain))
+ target->data[0] |= DTE_FLAG_GIOV;
+ target->data[0] |= DTE_FLAG_GV;
+
+ /* First mask out possible old values for GCR3 table */
+ tmp = DTE_GCR3_VAL_A(~0ULL) << DTE_GCR3_SHIFT_A;
+ target->data[0] &= ~tmp;
+ tmp = DTE_GCR3_VAL_B(~0ULL) << DTE_GCR3_SHIFT_B;
+ tmp |= DTE_GCR3_VAL_C(~0ULL) << DTE_GCR3_SHIFT_C;
+ target->data[1] &= ~tmp;
+
+ gcr3 = iommu_virt_to_phys(gcr3_info->gcr3_tbl);
+
+ /* Encode GCR3 table into DTE */
+ tmp = DTE_GCR3_VAL_A(gcr3) << DTE_GCR3_SHIFT_A;
+ target->data[0] |= tmp;
+ tmp = DTE_GCR3_VAL_B(gcr3) << DTE_GCR3_SHIFT_B;
+ tmp |= DTE_GCR3_VAL_C(gcr3) << DTE_GCR3_SHIFT_C;
+ target->data[1] |= tmp;
+
+ /* Mask out old values for GuestPagingMode */
+ target->data[2] &= ~(0x3ULL << DTE_GPT_LEVEL_SHIFT);
+ /* Guest page table can only support 4 and 5 levels */
+ if (amd_iommu_gpt_level == PAGE_MODE_5_LEVEL)
+ target->data[2] |= ((u64)GUEST_PGTABLE_5_LEVEL << DTE_GPT_LEVEL_SHIFT);
+}
+
static void set_dte_entry(struct amd_iommu *iommu,
struct iommu_dev_data *dev_data)
{
- u64 pte_root = 0;
- u64 flags = 0;
- u32 old_domid;
- u16 devid = dev_data->devid;
u16 domid;
+ u32 old_domid;
+ struct dev_table_entry new;
struct protection_domain *domain = dev_data->domain;
- struct dev_table_entry *dev_table = get_dev_table(iommu);
struct gcr3_tbl_info *gcr3_info = &dev_data->gcr3_info;
if (gcr3_info && gcr3_info->gcr3_tbl)
@@ -1922,73 +1962,46 @@ static void set_dte_entry(struct amd_iommu *iommu,
else
domid = domain->id;
+ /*
+ * Need to preserve the certain fields in DTE because it contains
+ * interrupt-remapping and other settings, which might be
+ * programmed earlier by other code.
+ */
+ get_dte256(iommu, dev_data, &new);
+
if (domain->iop.mode != PAGE_MODE_NONE)
- pte_root = iommu_virt_to_phys(domain->iop.root);
+ new.data[0] = iommu_virt_to_phys(domain->iop.root);
- pte_root |= (domain->iop.mode & DEV_ENTRY_MODE_MASK)
+ new.data[0] |= (domain->iop.mode & DEV_ENTRY_MODE_MASK)
<< DEV_ENTRY_MODE_SHIFT;
- pte_root |= DTE_FLAG_IR | DTE_FLAG_IW | DTE_FLAG_V;
+ new.data[0] |= DTE_FLAG_IR | DTE_FLAG_IW | DTE_FLAG_V;
/*
* When SNP is enabled, Only set TV bit when IOMMU
* page translation is in use.
*/
if (!amd_iommu_snp_en || (domid != 0))
- pte_root |= DTE_FLAG_TV;
-
- flags = dev_table[devid].data[1];
-
- if (dev_data->ats_enabled)
- flags |= DTE_FLAG_IOTLB;
+ new.data[0] |= DTE_FLAG_TV;
if (dev_data->ppr)
- pte_root |= 1ULL << DEV_ENTRY_PPR;
+ new.data[0] |= 1ULL << DEV_ENTRY_PPR;
if (domain->dirty_tracking)
- pte_root |= DTE_FLAG_HAD;
-
- if (gcr3_info && gcr3_info->gcr3_tbl) {
- u64 gcr3 = iommu_virt_to_phys(gcr3_info->gcr3_tbl);
- u64 glx = gcr3_info->glx;
- u64 tmp;
-
- pte_root |= DTE_FLAG_GV;
- pte_root |= (glx & DTE_GLX_MASK) << DTE_GLX_SHIFT;
-
- /* First mask out possible old values for GCR3 table */
- tmp = DTE_GCR3_VAL_B(~0ULL) << DTE_GCR3_SHIFT_B;
- flags &= ~tmp;
+ new.data[0] |= DTE_FLAG_HAD;
- tmp = DTE_GCR3_VAL_C(~0ULL) << DTE_GCR3_SHIFT_C;
- flags &= ~tmp;
-
- /* Encode GCR3 table into DTE */
- tmp = DTE_GCR3_VAL_A(gcr3) << DTE_GCR3_SHIFT_A;
- pte_root |= tmp;
-
- tmp = DTE_GCR3_VAL_B(gcr3) << DTE_GCR3_SHIFT_B;
- flags |= tmp;
-
- tmp = DTE_GCR3_VAL_C(gcr3) << DTE_GCR3_SHIFT_C;
- flags |= tmp;
-
- if (amd_iommu_gpt_level == PAGE_MODE_5_LEVEL) {
- dev_table[devid].data[2] |=
- ((u64)GUEST_PGTABLE_5_LEVEL << DTE_GPT_LEVEL_SHIFT);
- }
+ if (dev_data->ats_enabled)
+ new.data[1] |= DTE_FLAG_IOTLB;
+ else
+ new.data[1] &= ~DTE_FLAG_IOTLB;
- /* GIOV is supported with V2 page table mode only */
- if (pdom_is_v2_pgtbl_mode(domain))
- pte_root |= DTE_FLAG_GIOV;
- }
+ old_domid = new.data[1] & DEV_DOMID_MASK;
+ new.data[1] &= ~DEV_DOMID_MASK;
+ new.data[1] |= domid;
- flags &= ~DEV_DOMID_MASK;
- flags |= domid;
+ set_dte_gcr3_table(iommu, dev_data, &new);
- old_domid = dev_table[devid].data[1] & DEV_DOMID_MASK;
- dev_table[devid].data[1] = flags;
- dev_table[devid].data[0] = pte_root;
+ update_dte256(iommu, dev_data, &new);
/*
* A kdump kernel might be replacing a domain ID that was copied from
--
2.34.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 5/5] iommu/amd: Use 128-bit cmpxchg in set_dte_irq_entry()
2024-08-29 18:07 [PATCH v2 0/5] iommu/amd: Use 128-bit cmpxchg operation to update DTE Suravee Suthikulpanit
` (3 preceding siblings ...)
2024-08-29 18:07 ` [PATCH v2 4/5] iommu/amd: Modify set_dte_entry() to use 256-bit DTE helpers Suravee Suthikulpanit
@ 2024-08-29 18:07 ` Suravee Suthikulpanit
2024-08-29 19:40 ` Jason Gunthorpe
4 siblings, 1 reply; 16+ messages in thread
From: Suravee Suthikulpanit @ 2024-08-29 18:07 UTC (permalink / raw)
To: linux-kernel, iommu
Cc: joro, robin.murphy, vasant.hegde, ubizjak, jgg, jon.grimm,
santosh.shukla, pandoh, kumaranand, Suravee Suthikulpanit
Interrupt-remapping-related fields are in the top 128-bit of the Device
Table Entry (DTE), which should be updated using 128-bit write based on the
AMD I/O Virtualization Techonology (IOMMU) Specification.
Therefore, modify set_dte_irq_entry() to use 128-bit cmpxchg. Also, use
struct dev_data->dte_sem to synchronize DTE access.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
---
drivers/iommu/amd/iommu.c | 25 ++++++++++++++++---------
1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index a24986c2478b..4eb53bd40487 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3020,17 +3020,24 @@ static void iommu_flush_irt_and_complete(struct amd_iommu *iommu, u16 devid)
static void set_dte_irq_entry(struct amd_iommu *iommu, u16 devid,
struct irq_remap_table *table)
{
- u64 dte;
- struct dev_table_entry *dev_table = get_dev_table(iommu);
+ u128 new, old;
+ struct dev_table_entry *dte = &get_dev_table(iommu)[devid];
+ struct iommu_dev_data *dev_data = search_dev_data(iommu, devid);
+
+ if (dev_data)
+ down_write(&dev_data->dte_sem);
+
+ old = new = dte->data128[1];
+ new &= ~DTE_IRQ_PHYS_ADDR_MASK;
+ new |= iommu_virt_to_phys(table->table);
+ new |= DTE_IRQ_REMAP_INTCTL;
+ new |= DTE_INTTABLEN;
+ new |= DTE_IRQ_REMAP_ENABLE;
- dte = dev_table[devid].data[2];
- dte &= ~DTE_IRQ_PHYS_ADDR_MASK;
- dte |= iommu_virt_to_phys(table->table);
- dte |= DTE_IRQ_REMAP_INTCTL;
- dte |= DTE_INTTABLEN;
- dte |= DTE_IRQ_REMAP_ENABLE;
+ WARN_ON(!try_cmpxchg128(&dte->data128[1], &old, new));
- dev_table[devid].data[2] = dte;
+ if (dev_data)
+ up_write(&dev_data->dte_sem);
}
static struct irq_remap_table *get_irq_table(struct amd_iommu *iommu, u16 devid)
--
2.34.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2 3/5] iommu/amd: Introduce helper functions to access and update 256-bit DTE
2024-08-29 18:07 ` [PATCH v2 3/5] iommu/amd: Introduce helper functions to access and update 256-bit DTE Suravee Suthikulpanit
@ 2024-08-29 19:28 ` Jason Gunthorpe
2024-09-05 17:54 ` Suthikulpanit, Suravee
0 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2024-08-29 19:28 UTC (permalink / raw)
To: Suravee Suthikulpanit
Cc: linux-kernel, iommu, joro, robin.murphy, vasant.hegde, ubizjak,
jon.grimm, santosh.shukla, pandoh, kumaranand
On Thu, Aug 29, 2024 at 06:07:24PM +0000, Suravee Suthikulpanit wrote:
> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> index 994ed02842b9..93bca5c68bca 100644
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -85,6 +85,47 @@ static void set_dte_entry(struct amd_iommu *iommu,
> *
> ****************************************************************************/
>
> +static void update_dte256(struct amd_iommu *iommu, struct iommu_dev_data *dev_data,
> + struct dev_table_entry *new)
> +{
> + struct dev_table_entry *dev_table = get_dev_table(iommu);
> + struct dev_table_entry *ptr = &dev_table[dev_data->devid];
> + struct dev_table_entry old;
> + u128 tmp;
> +
> + down_write(&dev_data->dte_sem);
This locking is too narrow, you need the critical region to span from
the get_dte256() till the update_dte256() because the get is
retrieving the value written by set_dte_irq_entry(), and it must not
change while the new DTE is worked on.
I suggest you copy the IRQ data here in this function under the lock
from old to new and then store it so it is always fresh.
Ideally you would remove get_dte256() because the driver *really*
shouldn't be changing the DTE in some way that already assumes
something is in the DTE (for instance my remarks on the nesting work)
Really the only reason to read the DTE is the get the IRQ data..
> + old.data128[0] = ptr->data128[0];
> + old.data128[1] = ptr->data128[1];
> +
> + tmp = cmpxchg128(&ptr->data128[1], old.data128[1], new->data128[1]);
> + if (tmp == old.data128[1]) {
> + if (!try_cmpxchg128(&ptr->data128[0], &old.data128[0], new->data128[0])) {
> + /* Restore hi 128-bit */
> + cmpxchg128(&ptr->data128[1], new->data128[1], tmp);
I don't think you should restore, this should reflect a locking error
but we still need to move forward and put some kind of correct
data.. The code can't go backwards so it should try to move forwards..
On ordering, I don't know, is this OK?
If you are leaving/entering nesting mode I think you have to write the
[2] value in the right sequence, you don't want to have the viommu
enabled unless the host page table is setup properly. So [2] is
written last when enabling, and first when disabling. Flushes required
after each write to ensure the HW doesn't see a cross-128 word bit
tear.
GuestPagingMode also has to be sequenced correctly, the GCR3 table
pointer should be invalid when it is changed, meaning you have to
write it and flush before storing the GCR3 table, and the reverse to
undo it.
The ordering, including when DTE flushes are needed, is pretty
hard. This is much simpler than, say, ARM, so I think you could open
code it, but it should be a pretty sizable bit of logic to figure out
what to do.
> +static void get_dte256(struct amd_iommu *iommu, struct iommu_dev_data *dev_data,
> + struct dev_table_entry *dte)
> +{
> + struct dev_table_entry *ptr;
> + struct dev_table_entry *dev_table = get_dev_table(iommu);
> +
> + ptr = &dev_table[dev_data->devid];
> +
> + down_read(&dev_data->dte_sem);
> + dte->data128[0] = ptr->data128[0];
> + dte->data128[1] = ptr->data128[1];
> + up_read(&dev_data->dte_sem);
I don't think you need a rwsem either. As above, you shouldn't be
reading anyhow to build a DTE, and you can't allow the interrupt
update to run regardless, so a simple spinlock would be sufficient and
faster, I think.
> @@ -2681,16 +2732,16 @@ static int amd_iommu_set_dirty_tracking(struct iommu_domain *domain,
> }
>
> list_for_each_entry(dev_data, &pdomain->dev_list, list) {
> - iommu = get_amd_iommu_from_dev_data(dev_data);
> + struct dev_table_entry dte;
>
> - dev_table = get_dev_table(iommu);
> - pte_root = dev_table[dev_data->devid].data[0];
> + iommu = get_amd_iommu_from_dev_data(dev_data);
> + get_dte256(iommu, dev_data, &dte);
>
> - pte_root = (enable ? pte_root | DTE_FLAG_HAD :
> - pte_root & ~DTE_FLAG_HAD);
> + dte.data[0] = (enable ? dte.data[0] | DTE_FLAG_HAD :
> + dte.data[0] & ~DTE_FLAG_HAD);
>
And this doesn't need all the logic just to flip one bit in a single
64bit quantity..
Jason
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 2/5] iommu/amd: Introduce rw_semaphore for Device Table Entry (DTE)
2024-08-29 18:07 ` [PATCH v2 2/5] iommu/amd: Introduce rw_semaphore for Device Table Entry (DTE) Suravee Suthikulpanit
@ 2024-08-29 19:34 ` Jason Gunthorpe
2024-09-05 6:20 ` Suthikulpanit, Suravee
0 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2024-08-29 19:34 UTC (permalink / raw)
To: Suravee Suthikulpanit
Cc: linux-kernel, iommu, joro, robin.murphy, vasant.hegde, ubizjak,
jon.grimm, santosh.shukla, pandoh, kumaranand
On Thu, Aug 29, 2024 at 06:07:23PM +0000, Suravee Suthikulpanit wrote:
> -static void clear_dte_entry(struct amd_iommu *iommu, u16 devid)
> +static void clear_dte_entry(struct amd_iommu *iommu, struct iommu_dev_data *dev_data)
> {
> + u16 devid = dev_data->devid;
> struct dev_table_entry *dev_table = get_dev_table(iommu);
>
> + down_write(&dev_data->dte_sem);
> +
> /* remove entry from the device table seen by the hardware */
> dev_table[devid].data[0] = DTE_FLAG_V;
The rest of this function doesn't seem very good either:
/* remove entry from the device table seen by the hardware */
dev_table[devid].data[0] = DTE_FLAG_V;
if (!amd_iommu_snp_en)
dev_table[devid].data[0] |= DTE_FLAG_TV;
dev_table[devid].data[1] &= DTE_FLAG_MASK;
This should also cmpxchg the whole 128 bit quantity, you don't want
the HW to see a tear on HostPageTableRootPtr.
Which is back to my other point. Don't edit the DTE in place.
There is no such thing as 'clear' in the iommu domain API. The DTE is
either PAGING, BLOCKED or IDENTITY, and any write to it should be
writing an entire DTE for that target.
I guess clear is actually trying to set the DTE to BLOCKING?
Also no need to get the lock here because you don't touch 128 bit
quanta [1] which holds the IRQ stuff that is racey. This is already
locked by the iommu core group lock.
Jason
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 5/5] iommu/amd: Use 128-bit cmpxchg in set_dte_irq_entry()
2024-08-29 18:07 ` [PATCH v2 5/5] iommu/amd: Use 128-bit cmpxchg in set_dte_irq_entry() Suravee Suthikulpanit
@ 2024-08-29 19:40 ` Jason Gunthorpe
2024-09-05 5:32 ` Suthikulpanit, Suravee
0 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2024-08-29 19:40 UTC (permalink / raw)
To: Suravee Suthikulpanit
Cc: linux-kernel, iommu, joro, robin.murphy, vasant.hegde, ubizjak,
jon.grimm, santosh.shukla, pandoh, kumaranand
On Thu, Aug 29, 2024 at 06:07:26PM +0000, Suravee Suthikulpanit wrote:
> Interrupt-remapping-related fields are in the top 128-bit of the Device
> Table Entry (DTE), which should be updated using 128-bit write based on the
> AMD I/O Virtualization Techonology (IOMMU) Specification.
>
> Therefore, modify set_dte_irq_entry() to use 128-bit cmpxchg. Also, use
> struct dev_data->dte_sem to synchronize DTE access.
>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
> ---
> drivers/iommu/amd/iommu.c | 25 ++++++++++++++++---------
> 1 file changed, 16 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> index a24986c2478b..4eb53bd40487 100644
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -3020,17 +3020,24 @@ static void iommu_flush_irt_and_complete(struct amd_iommu *iommu, u16 devid)
> static void set_dte_irq_entry(struct amd_iommu *iommu, u16 devid,
> struct irq_remap_table *table)
> {
> - u64 dte;
> - struct dev_table_entry *dev_table = get_dev_table(iommu);
> + u128 new, old;
> + struct dev_table_entry *dte = &get_dev_table(iommu)[devid];
> + struct iommu_dev_data *dev_data = search_dev_data(iommu, devid);
> +
> + if (dev_data)
> + down_write(&dev_data->dte_sem);
> +
> + old = new = dte->data128[1];
> + new &= ~DTE_IRQ_PHYS_ADDR_MASK;
> + new |= iommu_virt_to_phys(table->table);
> + new |= DTE_IRQ_REMAP_INTCTL;
> + new |= DTE_INTTABLEN;
> + new |= DTE_IRQ_REMAP_ENABLE;
>
> - dte = dev_table[devid].data[2];
> - dte &= ~DTE_IRQ_PHYS_ADDR_MASK;
> - dte |= iommu_virt_to_phys(table->table);
> - dte |= DTE_IRQ_REMAP_INTCTL;
> - dte |= DTE_INTTABLEN;
> - dte |= DTE_IRQ_REMAP_ENABLE;
> + WARN_ON(!try_cmpxchg128(&dte->data128[1], &old, new));
This probably doesn't need cmpxchg because it is only touching 64 bit
quanta [2], just a lock is good enough and avoids the "what to do if
cmpxchg fails" question.
> - dev_table[devid].data[2] = dte;
But this should be
WRITE_ONCE(dev_table[devid].data[2], dte);
Beaucse it is writing to memory read back by HW.
Required for all the DTE touches everywhere.
Jason
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 5/5] iommu/amd: Use 128-bit cmpxchg in set_dte_irq_entry()
2024-08-29 19:40 ` Jason Gunthorpe
@ 2024-09-05 5:32 ` Suthikulpanit, Suravee
0 siblings, 0 replies; 16+ messages in thread
From: Suthikulpanit, Suravee @ 2024-09-05 5:32 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: linux-kernel, iommu, joro, robin.murphy, vasant.hegde, ubizjak,
jon.grimm, santosh.shukla, pandoh, kumaranand
Hi
On 8/30/2024 2:40 AM, Jason Gunthorpe wrote:
> On Thu, Aug 29, 2024 at 06:07:26PM +0000, Suravee Suthikulpanit wrote:
>> Interrupt-remapping-related fields are in the top 128-bit of the Device
>> Table Entry (DTE), which should be updated using 128-bit write based on the
>> AMD I/O Virtualization Techonology (IOMMU) Specification.
>>
>> Therefore, modify set_dte_irq_entry() to use 128-bit cmpxchg. Also, use
>> struct dev_data->dte_sem to synchronize DTE access.
>>
>> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
>> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
>> ---
>> drivers/iommu/amd/iommu.c | 25 ++++++++++++++++---------
>> 1 file changed, 16 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
>> index a24986c2478b..4eb53bd40487 100644
>> --- a/drivers/iommu/amd/iommu.c
>> +++ b/drivers/iommu/amd/iommu.c
>> @@ -3020,17 +3020,24 @@ static void iommu_flush_irt_and_complete(struct amd_iommu *iommu, u16 devid)
>> static void set_dte_irq_entry(struct amd_iommu *iommu, u16 devid,
>> struct irq_remap_table *table)
>> {
>> - u64 dte;
>> - struct dev_table_entry *dev_table = get_dev_table(iommu);
>> + u128 new, old;
>> + struct dev_table_entry *dte = &get_dev_table(iommu)[devid];
>> + struct iommu_dev_data *dev_data = search_dev_data(iommu, devid);
>> +
>> + if (dev_data)
>> + down_write(&dev_data->dte_sem);
>> +
>> + old = new = dte->data128[1];
>> + new &= ~DTE_IRQ_PHYS_ADDR_MASK;
>> + new |= iommu_virt_to_phys(table->table);
>> + new |= DTE_IRQ_REMAP_INTCTL;
>> + new |= DTE_INTTABLEN;
>> + new |= DTE_IRQ_REMAP_ENABLE;
>>
>> - dte = dev_table[devid].data[2];
>> - dte &= ~DTE_IRQ_PHYS_ADDR_MASK;
>> - dte |= iommu_virt_to_phys(table->table);
>> - dte |= DTE_IRQ_REMAP_INTCTL;
>> - dte |= DTE_INTTABLEN;
>> - dte |= DTE_IRQ_REMAP_ENABLE;
>> + WARN_ON(!try_cmpxchg128(&dte->data128[1], &old, new));
>
> This probably doesn't need cmpxchg because it is only touching 64 bit
> quanta [2], just a lock is good enough and avoids the "what to do if
> cmpxchg fails" question.
Okay
>> - dev_table[devid].data[2] = dte;
>
> But this should be
>
> WRITE_ONCE(dev_table[devid].data[2], dte);
>
> Beaucse it is writing to memory read back by HW.
>
> Required for all the DTE touches everywhere.
Good point.
Thanks,
Suravee
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 2/5] iommu/amd: Introduce rw_semaphore for Device Table Entry (DTE)
2024-08-29 19:34 ` Jason Gunthorpe
@ 2024-09-05 6:20 ` Suthikulpanit, Suravee
2024-09-05 12:15 ` Jason Gunthorpe
0 siblings, 1 reply; 16+ messages in thread
From: Suthikulpanit, Suravee @ 2024-09-05 6:20 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: linux-kernel, iommu, joro, robin.murphy, vasant.hegde, ubizjak,
jon.grimm, santosh.shukla, pandoh, kumaranand
Hi,
On 8/30/2024 2:34 AM, Jason Gunthorpe wrote:
> On Thu, Aug 29, 2024 at 06:07:23PM +0000, Suravee Suthikulpanit wrote:
>
>> -static void clear_dte_entry(struct amd_iommu *iommu, u16 devid)
>> +static void clear_dte_entry(struct amd_iommu *iommu, struct iommu_dev_data *dev_data)
>> {
>> + u16 devid = dev_data->devid;
>> struct dev_table_entry *dev_table = get_dev_table(iommu);
>>
>> + down_write(&dev_data->dte_sem);
>> +
>> /* remove entry from the device table seen by the hardware */
>> dev_table[devid].data[0] = DTE_FLAG_V;
>
> The rest of this function doesn't seem very good either:
>
> /* remove entry from the device table seen by the hardware */
> dev_table[devid].data[0] = DTE_FLAG_V;
>
> if (!amd_iommu_snp_en)
> dev_table[devid].data[0] |= DTE_FLAG_TV;
>
> dev_table[devid].data[1] &= DTE_FLAG_MASK;
>
> This should also cmpxchg the whole 128 bit quantity, you don't want
> the HW to see a tear on HostPageTableRootPtr.
>
> Which is back to my other point. Don't edit the DTE in place.
Thanks for pointing this out. I missed the detail in this function while
going through the driver to update the code, which updates live DTE.
I'll change this.
> There is no such thing as 'clear' in the iommu domain API. The DTE is
> either PAGING, BLOCKED or IDENTITY, and any write to it should be
> writing an entire DTE for that target.
>
> I guess clear is actually trying to set the DTE to BLOCKING?
Yes, it's called from blocked_domain->attach_dev() and when doing
amd_iommu_detach_device().
> Also no need to get the lock here because you don't touch 128 bit
> quanta [1] which holds the IRQ stuff that is racey. This is already
> locked by the iommu core group lock.
Actually, we Need to preserve DTE[96:106] because certain fields are
programmed using value in IVRS table from early init phase. So, to avoid
try_cmpxchg128 failure, I need to spin_lock on dte_lock across the
get_dte256() and update_dte256()
Thanks,
Suravee
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 2/5] iommu/amd: Introduce rw_semaphore for Device Table Entry (DTE)
2024-09-05 6:20 ` Suthikulpanit, Suravee
@ 2024-09-05 12:15 ` Jason Gunthorpe
0 siblings, 0 replies; 16+ messages in thread
From: Jason Gunthorpe @ 2024-09-05 12:15 UTC (permalink / raw)
To: Suthikulpanit, Suravee
Cc: linux-kernel, iommu, joro, robin.murphy, vasant.hegde, ubizjak,
jon.grimm, santosh.shukla, pandoh, kumaranand
On Thu, Sep 05, 2024 at 01:20:42PM +0700, Suthikulpanit, Suravee wrote:
> > There is no such thing as 'clear' in the iommu domain API. The DTE is
> > either PAGING, BLOCKED or IDENTITY, and any write to it should be
> > writing an entire DTE for that target.
> >
> > I guess clear is actually trying to set the DTE to BLOCKING?
>
> Yes, it's called from blocked_domain->attach_dev() and when doing
> amd_iommu_detach_device().
Could give it a better name then too
> > Also no need to get the lock here because you don't touch 128 bit
> > quanta [1] which holds the IRQ stuff that is racey. This is already
> > locked by the iommu core group lock.
>
> Actually, we Need to preserve DTE[96:106] because certain fields are
> programmed using value in IVRS table from early init phase. So, to avoid
> try_cmpxchg128 failure, I need to spin_lock on dte_lock across the
> get_dte256() and update_dte256()
But who is concurrently writing those bits? Isn't it just set at boot
time once and sort of stored in the DTE, never changing?
There are two locking regimes here, all the normal DTE touches from
iomm ops are locked by the group mutex and they are non-concurrent
already.
Jason
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 3/5] iommu/amd: Introduce helper functions to access and update 256-bit DTE
2024-08-29 19:28 ` Jason Gunthorpe
@ 2024-09-05 17:54 ` Suthikulpanit, Suravee
2024-09-05 18:21 ` Jason Gunthorpe
0 siblings, 1 reply; 16+ messages in thread
From: Suthikulpanit, Suravee @ 2024-09-05 17:54 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: linux-kernel, iommu, joro, robin.murphy, vasant.hegde, ubizjak,
jon.grimm, santosh.shukla, pandoh, kumaranand
Hi,
On 8/30/2024 2:28 AM, Jason Gunthorpe wrote:
> On Thu, Aug 29, 2024 at 06:07:24PM +0000, Suravee Suthikulpanit wrote:
>
>> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
>> index 994ed02842b9..93bca5c68bca 100644
>> --- a/drivers/iommu/amd/iommu.c
>> +++ b/drivers/iommu/amd/iommu.c
>> @@ -85,6 +85,47 @@ static void set_dte_entry(struct amd_iommu *iommu,
>> *
>> ****************************************************************************/
>>
>> +static void update_dte256(struct amd_iommu *iommu, struct iommu_dev_data *dev_data,
>> + struct dev_table_entry *new)
>> +{
>> + struct dev_table_entry *dev_table = get_dev_table(iommu);
>> + struct dev_table_entry *ptr = &dev_table[dev_data->devid];
>> + struct dev_table_entry old;
>> + u128 tmp;
>> +
>> + down_write(&dev_data->dte_sem);
>
> This locking is too narrow, you need the critical region to span from
> the get_dte256() till the update_dte256() because the get is
> retrieving the value written by set_dte_irq_entry(), and it must not
> change while the new DTE is worked on.
Ok.
> I suggest you copy the IRQ data here in this function under the lock
> from old to new and then store it so it is always fresh.
>
> Ideally you would remove get_dte256() because the driver *really*
> shouldn't be changing the DTE in some way that already assumes
> something is in the DTE (for instance my remarks on the nesting work)
>
> Really the only reason to read the DTE is the get the IRQ data..
I plan to use get_dte256() helper function to extract DTE for various
purposes. Getting the IRQ data is only one use case. There are other
fields, which are programmed early in the driver init phrase (i.e.
DTE[96:106]).
>> + old.data128[0] = ptr->data128[0];
>> + old.data128[1] = ptr->data128[1];
>> +
>> + tmp = cmpxchg128(&ptr->data128[1], old.data128[1], new->data128[1]);
>> + if (tmp == old.data128[1]) {
>> + if (!try_cmpxchg128(&ptr->data128[0], &old.data128[0], new->data128[0])) {
>> + /* Restore hi 128-bit */
>> + cmpxchg128(&ptr->data128[1], new->data128[1], tmp);
>
> I don't think you should restore, this should reflect a locking error
> but we still need to move forward and put some kind of correct
> data.. The code can't go backwards so it should try to move forwards..
In case of error, what if we pr_warn and put the device in blocking mode
since we need to prevent malicious DMAs.
> On ordering, I don't know, is this OK?
>
> If you are leaving/entering nesting mode I think you have to write the
> [2] value in the right sequence, you don't want to have the viommu
> enabled unless the host page table is setup properly. So [2] is
> written last when enabling, and first when disabling. Flushes required
> after each write to ensure the HW doesn't see a cross-128 word bit
> tear.
> > GuestPagingMode also has to be sequenced correctly, the GCR3 table
> pointer should be invalid when it is changed, meaning you have to
> write it and flush before storing the GCR3 table, and the reverse to
> undo it.
>
> The ordering, including when DTE flushes are needed, is pretty
> hard. This is much simpler than, say, ARM, so I think you could open
> code it, but it should be a pretty sizable bit of logic to figure out
> what to do.
IOMMU hardware do not do partial interpret of the DTE and SW ensure DTE
flush after updating the DTE. Therefore, ordering should not be of a
concern here as long as the driver correctly program the entry.
>> +static void get_dte256(struct amd_iommu *iommu, struct iommu_dev_data *dev_data,
>> + struct dev_table_entry *dte)
>> +{
>> + struct dev_table_entry *ptr;
>> + struct dev_table_entry *dev_table = get_dev_table(iommu);
>> +
>> + ptr = &dev_table[dev_data->devid];
>> +
>> + down_read(&dev_data->dte_sem);
>> + dte->data128[0] = ptr->data128[0];
>> + dte->data128[1] = ptr->data128[1];
>> + up_read(&dev_data->dte_sem);
>
> I don't think you need a rwsem either. As above, you shouldn't be
> reading anyhow to build a DTE, and you can't allow the interrupt
> update to run regardless, so a simple spinlock would be sufficient and
> faster, I think.
Ok.
>> @@ -2681,16 +2732,16 @@ static int amd_iommu_set_dirty_tracking(struct iommu_domain *domain,
>> }
>>
>> list_for_each_entry(dev_data, &pdomain->dev_list, list) {
>> - iommu = get_amd_iommu_from_dev_data(dev_data);
>> + struct dev_table_entry dte;
>>
>> - dev_table = get_dev_table(iommu);
>> - pte_root = dev_table[dev_data->devid].data[0];
>> + iommu = get_amd_iommu_from_dev_data(dev_data);
>> + get_dte256(iommu, dev_data, &dte);
>>
>> - pte_root = (enable ? pte_root | DTE_FLAG_HAD :
>> - pte_root & ~DTE_FLAG_HAD);
>> + dte.data[0] = (enable ? dte.data[0] | DTE_FLAG_HAD :
>> + dte.data[0] & ~DTE_FLAG_HAD);
>>
>
> And this doesn't need all the logic just to flip one bit in a single
> 64bit quantity..
Ok
Thanks,
Suravee
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 3/5] iommu/amd: Introduce helper functions to access and update 256-bit DTE
2024-09-05 17:54 ` Suthikulpanit, Suravee
@ 2024-09-05 18:21 ` Jason Gunthorpe
2024-09-06 14:08 ` Suthikulpanit, Suravee
0 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2024-09-05 18:21 UTC (permalink / raw)
To: Suthikulpanit, Suravee
Cc: linux-kernel, iommu, joro, robin.murphy, vasant.hegde, ubizjak,
jon.grimm, santosh.shukla, pandoh, kumaranand
On Fri, Sep 06, 2024 at 12:54:25AM +0700, Suthikulpanit, Suravee wrote:
> Hi,
>
> On 8/30/2024 2:28 AM, Jason Gunthorpe wrote:
> > On Thu, Aug 29, 2024 at 06:07:24PM +0000, Suravee Suthikulpanit wrote:
> >
> > > diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> > > index 994ed02842b9..93bca5c68bca 100644
> > > --- a/drivers/iommu/amd/iommu.c
> > > +++ b/drivers/iommu/amd/iommu.c
> > > @@ -85,6 +85,47 @@ static void set_dte_entry(struct amd_iommu *iommu,
> > > *
> > > ****************************************************************************/
> > > +static void update_dte256(struct amd_iommu *iommu, struct iommu_dev_data *dev_data,
> > > + struct dev_table_entry *new)
> > > +{
> > > + struct dev_table_entry *dev_table = get_dev_table(iommu);
> > > + struct dev_table_entry *ptr = &dev_table[dev_data->devid];
> > > + struct dev_table_entry old;
> > > + u128 tmp;
> > > +
> > > + down_write(&dev_data->dte_sem);
> >
> > This locking is too narrow, you need the critical region to span from
> > the get_dte256() till the update_dte256() because the get is
> > retrieving the value written by set_dte_irq_entry(), and it must not
> > change while the new DTE is worked on.
>
> Ok.
>
> > I suggest you copy the IRQ data here in this function under the lock
> > from old to new and then store it so it is always fresh.
> >
> > Ideally you would remove get_dte256() because the driver *really*
> > shouldn't be changing the DTE in some way that already assumes
> > something is in the DTE (for instance my remarks on the nesting work)
> >
> > Really the only reason to read the DTE is the get the IRQ data..
>
> I plan to use get_dte256() helper function to extract DTE for various
> purposes. Getting the IRQ data is only one use case. There are other fields,
> which are programmed early in the driver init phrase (i.e. DTE[96:106]).
Sure, a model where you have specific 'fixed' fields and you
store them in the DTE is logical. You want to target something like
struct dte new_dte = init_dte(..)
new_dte |= [....]
program_dte()
Where init_dte could read out fixed bits from the existing DTE
> > I don't think you should restore, this should reflect a locking error
> > but we still need to move forward and put some kind of correct
> > data.. The code can't go backwards so it should try to move forwards..
>
> In case of error, what if we pr_warn and put the device in blocking mode
> since we need to prevent malicious DMAs.
IMHO a WARN_ON is fine, and alerts to the possible machine corruption
No need to do blocking, you should have a perfectly valid target DTE
that represents the state the HW is expected to be in. Resolve the
race by making it bin that state and move forwards.
> > On ordering, I don't know, is this OK?
> >
> > If you are leaving/entering nesting mode I think you have to write the
> > [2] value in the right sequence, you don't want to have the viommu
> > enabled unless the host page table is setup properly. So [2] is
> > written last when enabling, and first when disabling. Flushes required
> > after each write to ensure the HW doesn't see a cross-128 word bit
> > tear.
> > > GuestPagingMode also has to be sequenced correctly, the GCR3 table
> > pointer should be invalid when it is changed, meaning you have to
> > write it and flush before storing the GCR3 table, and the reverse to
> > undo it.
> >
> > The ordering, including when DTE flushes are needed, is pretty
> > hard. This is much simpler than, say, ARM, so I think you could open
> > code it, but it should be a pretty sizable bit of logic to figure out
> > what to do.
>
> IOMMU hardware do not do partial interpret of the DTE and SW ensure DTE
> flush after updating the DTE. Therefore, ordering should not be of a concern
> here as long as the driver correctly program the entry.
Even if the IOMMU HW does a perfect 256 bit atomic read you still have
to order the CPU writes correctly. It just means you don't need to
flush.
The guidelines in "2.2.2.2 Making Device Table Entry Changes" make
this clear. The indivudal CPU writes smaller than 256 bits have to be
sequenced right.
This section looks like it was written before translation bits were
placed in the other 128 bit word - it assumes a single 128 bit write
is always sufficient which isn't true anymore.
So you still have the issue of having to decide if you write 128 bit
[0] or [1] first.
Jason
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 3/5] iommu/amd: Introduce helper functions to access and update 256-bit DTE
2024-09-05 18:21 ` Jason Gunthorpe
@ 2024-09-06 14:08 ` Suthikulpanit, Suravee
2024-09-06 16:01 ` Jason Gunthorpe
0 siblings, 1 reply; 16+ messages in thread
From: Suthikulpanit, Suravee @ 2024-09-06 14:08 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: linux-kernel, iommu, joro, robin.murphy, vasant.hegde, ubizjak,
jon.grimm, santosh.shukla, pandoh, kumaranand
On 9/6/2024 1:21 AM, Jason Gunthorpe wrote:
>>> I don't think you should restore, this should reflect a locking error
>>> but we still need to move forward and put some kind of correct
>>> data.. The code can't go backwards so it should try to move forwards..
>> In case of error, what if we pr_warn and put the device in blocking mode
>> since we need to prevent malicious DMAs.
> IMHO a WARN_ON is fine, and alerts to the possible machine corruption
>
> No need to do blocking, you should have a perfectly valid target DTE
> that represents the state the HW is expected to be in. Resolve the
> race by making it bin that state and move forwards.
What do you mean by "making it bin that state".
>>> On ordering, I don't know, is this OK?
>>>
>>> If you are leaving/entering nesting mode I think you have to write the
>>> [2] value in the right sequence, you don't want to have the viommu
>>> enabled unless the host page table is setup properly. So [2] is
>>> written last when enabling, and first when disabling. Flushes required
>>> after each write to ensure the HW doesn't see a cross-128 word bit
>>> tear.
>>>> GuestPagingMode also has to be sequenced correctly, the GCR3 table
>>> pointer should be invalid when it is changed, meaning you have to
>>> write it and flush before storing the GCR3 table, and the reverse to
>>> undo it.
>>>
>>> The ordering, including when DTE flushes are needed, is pretty
>>> hard. This is much simpler than, say, ARM, so I think you could open
>>> code it, but it should be a pretty sizable bit of logic to figure out
>>> what to do.
>> IOMMU hardware do not do partial interpret of the DTE and SW ensure DTE
>> flush after updating the DTE. Therefore, ordering should not be of a concern
>> here as long as the driver correctly program the entry.
> Even if the IOMMU HW does a perfect 256 bit atomic read you still have
> to order the CPU writes correctly. It just means you don't need to
> flush.
>
> The guidelines in "2.2.2.2 Making Device Table Entry Changes" make
> this clear. The indivudal CPU writes smaller than 256 bits have to be
> sequenced right.
For the interrupt remapping part, no special step is needed if we can
write do 64-bit write. Similary, for the address translation part, no
special step is needed if we can do 128-bit write.
> This section looks like it was written before translation bits were
> placed in the other 128 bit word - it assumes a single 128 bit write
> is always sufficient which isn't true anymore.
>
> So you still have the issue of having to decide if you write 128 bit
> [0] or [1] first.
The GuestPagingMode bit is in effect when GV=1. So, the higher 128-bit
(which contains GuestPagingMode bit) should be written first, and follow
by lower 128-bit (which contans GV bit).
Thanks,
Suravee
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 3/5] iommu/amd: Introduce helper functions to access and update 256-bit DTE
2024-09-06 14:08 ` Suthikulpanit, Suravee
@ 2024-09-06 16:01 ` Jason Gunthorpe
0 siblings, 0 replies; 16+ messages in thread
From: Jason Gunthorpe @ 2024-09-06 16:01 UTC (permalink / raw)
To: Suthikulpanit, Suravee
Cc: linux-kernel, iommu, joro, robin.murphy, vasant.hegde, ubizjak,
jon.grimm, santosh.shukla, pandoh, kumaranand
On Fri, Sep 06, 2024 at 09:08:06PM +0700, Suthikulpanit, Suravee wrote:
> On 9/6/2024 1:21 AM, Jason Gunthorpe wrote:
> > > > I don't think you should restore, this should reflect a locking error
> > > > but we still need to move forward and put some kind of correct
> > > > data.. The code can't go backwards so it should try to move forwards..
> > > In case of error, what if we pr_warn and put the device in blocking mode
> > > since we need to prevent malicious DMAs.
> > IMHO a WARN_ON is fine, and alerts to the possible machine corruption
> > No need to do blocking, you should have a perfectly valid target DTE
> > that represents the state the HW is expected to be in. Resolve the
> > race by making it bin that state and move forwards.
>
> What do you mean by "making it bin that state".
Sorry, "be in that state"
> > The guidelines in "2.2.2.2 Making Device Table Entry Changes" make
> > this clear. The indivudal CPU writes smaller than 256 bits have to be
> > sequenced right.
>
> For the interrupt remapping part, no special step is needed if we can write
> do 64-bit write.
Yes
> Similary, for the address translation part, no special step is
> needed if we can do 128-bit write.
Except for GuestPagingMode, as below.
> > This section looks like it was written before translation bits were
> > placed in the other 128 bit word - it assumes a single 128 bit write
> > is always sufficient which isn't true anymore.
> >
> > So you still have the issue of having to decide if you write 128 bit
> > [0] or [1] first.
>
> The GuestPagingMode bit is in effect when GV=1. So, the higher 128-bit
> (which contains GuestPagingMode bit) should be written first, and follow by
> lower 128-bit (which contans GV bit).
Yes, exactly. That is what I mean by ordering.
When clearing GV=0 you have to do the reverse ordering, write the low
128 then the high.
Jason
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2024-09-06 16:01 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-29 18:07 [PATCH v2 0/5] iommu/amd: Use 128-bit cmpxchg operation to update DTE Suravee Suthikulpanit
2024-08-29 18:07 ` [PATCH v2 1/5] iommu/amd: Disable AMD IOMMU if CMPXCHG16B feature is not supported Suravee Suthikulpanit
2024-08-29 18:07 ` [PATCH v2 2/5] iommu/amd: Introduce rw_semaphore for Device Table Entry (DTE) Suravee Suthikulpanit
2024-08-29 19:34 ` Jason Gunthorpe
2024-09-05 6:20 ` Suthikulpanit, Suravee
2024-09-05 12:15 ` Jason Gunthorpe
2024-08-29 18:07 ` [PATCH v2 3/5] iommu/amd: Introduce helper functions to access and update 256-bit DTE Suravee Suthikulpanit
2024-08-29 19:28 ` Jason Gunthorpe
2024-09-05 17:54 ` Suthikulpanit, Suravee
2024-09-05 18:21 ` Jason Gunthorpe
2024-09-06 14:08 ` Suthikulpanit, Suravee
2024-09-06 16:01 ` Jason Gunthorpe
2024-08-29 18:07 ` [PATCH v2 4/5] iommu/amd: Modify set_dte_entry() to use 256-bit DTE helpers Suravee Suthikulpanit
2024-08-29 18:07 ` [PATCH v2 5/5] iommu/amd: Use 128-bit cmpxchg in set_dte_irq_entry() Suravee Suthikulpanit
2024-08-29 19:40 ` Jason Gunthorpe
2024-09-05 5:32 ` Suthikulpanit, Suravee
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox