* [PATCH 0/3] iommu/vt-d: Ensure atomicity in context and PASID entry updates
@ 2026-01-13 3:00 Lu Baolu
2026-01-13 3:00 ` [PATCH 1/3] iommu/vt-d: Use 128-bit atomic updates for context entries Lu Baolu
` (2 more replies)
0 siblings, 3 replies; 39+ messages in thread
From: Lu Baolu @ 2026-01-13 3:00 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe
Cc: Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel, Lu Baolu
This is a follow-up from recent discussions in the iommu community
mailing list [1] [2] regarding potential race conditions in table
entry updates.
The Intel VT-d hardware fetches translation table entries (context
entries and PASID entries) in 128-bit (16-byte) chunks. Currently, the
Linux driver often updates these entries using multiple 64-bit writes.
This creates a race condition where the IOMMU hardware may fetch a
"torn" entry — a mixture of old and new data — during a CPU update. This
can lead to unpredictable hardware behavior, spurious faults, or system
instability.
This addresses these atomicity issues by implementing 128-bit atomic
updates where possible and following the VT-d specification's ownership
handshake protocol for larger structures.
[1] https://lore.kernel.org/linux-iommu/20251227175728.4358-1-dmaluka@chromium.org/
[2] https://lore.kernel.org/linux-iommu/20260107201800.2486137-1-skhawaja@google.com/
Lu Baolu (3):
iommu/vt-d: Use 128-bit atomic updates for context entries
iommu/vt-d: Clear Present bit before tearing down PASID entry
iommu/vt-d: Rework hitless PASID entry replacement
drivers/iommu/intel/Kconfig | 2 +-
drivers/iommu/intel/iommu.h | 22 ++++++++--
drivers/iommu/intel/pasid.h | 38 ++++++++++++++++-
drivers/iommu/intel/iommu.c | 30 ++++++-------
drivers/iommu/intel/pasid.c | 84 ++++++++++++++++++++++++++++++-------
5 files changed, 140 insertions(+), 36 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH 1/3] iommu/vt-d: Use 128-bit atomic updates for context entries
2026-01-13 3:00 [PATCH 0/3] iommu/vt-d: Ensure atomicity in context and PASID entry updates Lu Baolu
@ 2026-01-13 3:00 ` Lu Baolu
2026-01-13 19:27 ` Dmytro Maluka
2026-01-14 7:54 ` Tian, Kevin
2026-01-13 3:00 ` [PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry Lu Baolu
2026-01-13 3:00 ` [PATCH 3/3] iommu/vt-d: Rework hitless PASID entry replacement Lu Baolu
2 siblings, 2 replies; 39+ messages in thread
From: Lu Baolu @ 2026-01-13 3:00 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe
Cc: Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel, Lu Baolu
On Intel IOMMU, device context entries are accessed by hardware in
128-bit chunks. Currently, the driver updates these entries by
programming the 'lo' and 'hi' 64-bit fields individually.
This creates a potential race condition where the IOMMU hardware may fetch
a context entry while the CPU has only completed one of the two 64-bit
writes. This "torn" entry — consisting of half-old and half-new data —
could lead to unpredictable hardware behavior, especially when
transitioning the 'Present' bit or changing translation types.
To ensure the IOMMU hardware always observes a consistent state, use
128-bit atomic updates for context entries. This is achieved by building
context entries on the stack and write them to the table in a single
operation.
As this relies on arch_cmpxchg128_local(), restrict INTEL_IOMMU
dependencies to X86_64.
Fixes: ba39592764ed2 ("Intel IOMMU: Intel IOMMU driver")
Reported-by: Dmytro Maluka <dmaluka@chromium.org>
Closes: https://lore.kernel.org/all/aTG7gc7I5wExai3S@google.com/
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
drivers/iommu/intel/Kconfig | 2 +-
drivers/iommu/intel/iommu.h | 22 ++++++++++++++++++----
drivers/iommu/intel/iommu.c | 30 +++++++++++++++---------------
drivers/iommu/intel/pasid.c | 18 +++++++++---------
4 files changed, 43 insertions(+), 29 deletions(-)
diff --git a/drivers/iommu/intel/Kconfig b/drivers/iommu/intel/Kconfig
index 5471f814e073..efda19820f95 100644
--- a/drivers/iommu/intel/Kconfig
+++ b/drivers/iommu/intel/Kconfig
@@ -11,7 +11,7 @@ config DMAR_DEBUG
config INTEL_IOMMU
bool "Support for Intel IOMMU using DMA Remapping Devices"
- depends on PCI_MSI && ACPI && X86
+ depends on PCI_MSI && ACPI && X86_64
select IOMMU_API
select GENERIC_PT
select IOMMU_PT
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 25c5e22096d4..b8999802f401 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -546,6 +546,16 @@ struct pasid_entry;
struct pasid_state_entry;
struct page_req_dsc;
+static __always_inline void intel_iommu_atomic128_set(u128 *ptr, u128 val)
+{
+ /*
+ * Use the cmpxchg16b instruction for 128-bit atomicity. As updates
+ * are serialized by a spinlock, we use the local (unlocked) variant
+ * to avoid unnecessary bus locking overhead.
+ */
+ arch_cmpxchg128_local(ptr, *ptr, val);
+}
+
/*
* 0: Present
* 1-11: Reserved
@@ -569,8 +579,13 @@ struct root_entry {
* 8-23: domain id
*/
struct context_entry {
- u64 lo;
- u64 hi;
+ union {
+ struct {
+ u64 lo;
+ u64 hi;
+ };
+ u128 val128;
+ };
};
struct iommu_domain_info {
@@ -946,8 +961,7 @@ static inline int context_domain_id(struct context_entry *c)
static inline void context_clear_entry(struct context_entry *context)
{
- context->lo = 0;
- context->hi = 0;
+ intel_iommu_atomic128_set(&context->val128, 0);
}
#ifdef CONFIG_INTEL_IOMMU
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 134302fbcd92..d721061ebda2 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1147,8 +1147,8 @@ static int domain_context_mapping_one(struct dmar_domain *domain,
domain_lookup_dev_info(domain, iommu, bus, devfn);
u16 did = domain_id_iommu(domain, iommu);
int translation = CONTEXT_TT_MULTI_LEVEL;
+ struct context_entry *context, new = {0};
struct pt_iommu_vtdss_hw_info pt_info;
- struct context_entry *context;
int ret;
if (WARN_ON(!intel_domain_is_ss_paging(domain)))
@@ -1170,19 +1170,19 @@ static int domain_context_mapping_one(struct dmar_domain *domain,
goto out_unlock;
copied_context_tear_down(iommu, context, bus, devfn);
- context_clear_entry(context);
- context_set_domain_id(context, did);
+ context_set_domain_id(&new, did);
if (info && info->ats_supported)
translation = CONTEXT_TT_DEV_IOTLB;
else
translation = CONTEXT_TT_MULTI_LEVEL;
- context_set_address_root(context, pt_info.ssptptr);
- context_set_address_width(context, pt_info.aw);
- context_set_translation_type(context, translation);
- context_set_fault_enable(context);
- context_set_present(context);
+ context_set_address_root(&new, pt_info.ssptptr);
+ context_set_address_width(&new, pt_info.aw);
+ context_set_translation_type(&new, translation);
+ context_set_fault_enable(&new);
+ context_set_present(&new);
+ intel_iommu_atomic128_set(&context->val128, new.val128);
if (!ecap_coherent(iommu->ecap))
clflush_cache_range(context, sizeof(*context));
context_present_cache_flush(iommu, did, bus, devfn);
@@ -3771,8 +3771,8 @@ static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain,
static int context_setup_pass_through(struct device *dev, u8 bus, u8 devfn)
{
struct device_domain_info *info = dev_iommu_priv_get(dev);
+ struct context_entry *context, new = {0};
struct intel_iommu *iommu = info->iommu;
- struct context_entry *context;
spin_lock(&iommu->lock);
context = iommu_context_addr(iommu, bus, devfn, 1);
@@ -3787,17 +3787,17 @@ static int context_setup_pass_through(struct device *dev, u8 bus, u8 devfn)
}
copied_context_tear_down(iommu, context, bus, devfn);
- context_clear_entry(context);
- context_set_domain_id(context, FLPT_DEFAULT_DID);
+ context_set_domain_id(&new, FLPT_DEFAULT_DID);
/*
* In pass through mode, AW must be programmed to indicate the largest
* AGAW value supported by hardware. And ASR is ignored by hardware.
*/
- context_set_address_width(context, iommu->msagaw);
- context_set_translation_type(context, CONTEXT_TT_PASS_THROUGH);
- context_set_fault_enable(context);
- context_set_present(context);
+ context_set_address_width(&new, iommu->msagaw);
+ context_set_translation_type(&new, CONTEXT_TT_PASS_THROUGH);
+ context_set_fault_enable(&new);
+ context_set_present(&new);
+ intel_iommu_atomic128_set(&context->val128, new.val128);
if (!ecap_coherent(iommu->ecap))
clflush_cache_range(context, sizeof(*context));
context_present_cache_flush(iommu, FLPT_DEFAULT_DID, bus, devfn);
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 3e2255057079..298a39183996 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -978,23 +978,23 @@ static int context_entry_set_pasid_table(struct context_entry *context,
struct device_domain_info *info = dev_iommu_priv_get(dev);
struct pasid_table *table = info->pasid_table;
struct intel_iommu *iommu = info->iommu;
+ struct context_entry new = {0};
unsigned long pds;
- context_clear_entry(context);
-
pds = context_get_sm_pds(table);
- context->lo = (u64)virt_to_phys(table->table) | context_pdts(pds);
- context_set_sm_rid2pasid(context, IOMMU_NO_PASID);
+ new.lo = (u64)virt_to_phys(table->table) | context_pdts(pds);
+ context_set_sm_rid2pasid(&new, IOMMU_NO_PASID);
if (info->ats_supported)
- context_set_sm_dte(context);
+ context_set_sm_dte(&new);
if (info->pasid_supported)
- context_set_pasid(context);
+ context_set_pasid(&new);
if (info->pri_supported)
- context_set_sm_pre(context);
+ context_set_sm_pre(&new);
- context_set_fault_enable(context);
- context_set_present(context);
+ context_set_fault_enable(&new);
+ context_set_present(&new);
+ intel_iommu_atomic128_set(&context->val128, new.val128);
__iommu_flush_cache(iommu, context, sizeof(*context));
return 0;
--
2.43.0
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry
2026-01-13 3:00 [PATCH 0/3] iommu/vt-d: Ensure atomicity in context and PASID entry updates Lu Baolu
2026-01-13 3:00 ` [PATCH 1/3] iommu/vt-d: Use 128-bit atomic updates for context entries Lu Baolu
@ 2026-01-13 3:00 ` Lu Baolu
2026-01-13 19:34 ` Dmytro Maluka
2026-01-14 7:32 ` Tian, Kevin
2026-01-13 3:00 ` [PATCH 3/3] iommu/vt-d: Rework hitless PASID entry replacement Lu Baolu
2 siblings, 2 replies; 39+ messages in thread
From: Lu Baolu @ 2026-01-13 3:00 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe
Cc: Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel, Lu Baolu
The Intel VT-d Scalable Mode PASID table entry consists of 512 bits (64
bytes). When tearing down an entry, the current implementation zeros the
entire 64-byte structure immediately.
However, the IOMMU hardware may fetch these 64 bytes using multiple
internal transactions (e.g., four 128-bit bursts). If a hardware fetch
occurs simultaneously with the CPU zeroing the entry, the hardware could
observe a "torn" entry — where some chunks are zeroed and others still
contain old data — leading to unpredictable behavior or spurious faults.
Follow the "Guidance to Software for Invalidations" in the VT-d spec
(Section 6.5.3.3) by implementing a proper ownership handshake:
1. Clear only the 'Present' (P) bit of the PASID entry. This tells the
hardware that the entry is no longer valid.
2. Execute the required invalidation sequence (PASID cache, IOTLB, and
Device-TLB flush) to ensure the hardware has released all cached
references to the entry.
3. Only after the flushes are complete, zero out the remaining fields of
the PASID entry.
Additionally, add an explicit clflush in intel_pasid_clear_entry() to
ensure that the cleared entry is visible to the IOMMU on systems where
memory coherency (ecap_coherent) is not supported.
Fixes: 0bbeb01a4faf ("iommu/vt-d: Manage scalalble mode PASID tables")
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
drivers/iommu/intel/pasid.h | 12 ++++++++++++
drivers/iommu/intel/pasid.c | 9 +++++++--
2 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
index b4c85242dc79..35de1d77355f 100644
--- a/drivers/iommu/intel/pasid.h
+++ b/drivers/iommu/intel/pasid.h
@@ -237,6 +237,18 @@ static inline void pasid_set_present(struct pasid_entry *pe)
pasid_set_bits(&pe->val[0], 1 << 0, 1);
}
+/*
+ * Clear the Present (P) bit (bit 0) of a scalable-mode PASID table entry.
+ * This initiates the transition of the entry's ownership from hardware
+ * to software. The caller is responsible for fulfilling the invalidation
+ * handshake recommended by the VT-d spec, Section 6.5.3.3 (Guidance to
+ * Software for Invalidations).
+ */
+static inline void pasid_clear_present(struct pasid_entry *pe)
+{
+ pasid_set_bits(&pe->val[0], 1 << 0, 0);
+}
+
/*
* Setup Page Walk Snoop bit (Bit 87) of a scalable mode PASID
* entry.
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 298a39183996..4f36138448d8 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -178,7 +178,8 @@ static struct pasid_entry *intel_pasid_get_entry(struct device *dev, u32 pasid)
* Interfaces for PASID table entry manipulation:
*/
static void
-intel_pasid_clear_entry(struct device *dev, u32 pasid, bool fault_ignore)
+intel_pasid_clear_entry(struct intel_iommu *iommu, struct device *dev,
+ u32 pasid, bool fault_ignore)
{
struct pasid_entry *pe;
@@ -190,6 +191,9 @@ intel_pasid_clear_entry(struct device *dev, u32 pasid, bool fault_ignore)
pasid_clear_entry_with_fpd(pe);
else
pasid_clear_entry(pe);
+
+ if (!ecap_coherent(iommu->ecap))
+ clflush_cache_range(pe, sizeof(*pe));
}
static void
@@ -272,7 +276,7 @@ void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev,
did = pasid_get_domain_id(pte);
pgtt = pasid_pte_get_pgtt(pte);
- intel_pasid_clear_entry(dev, pasid, fault_ignore);
+ pasid_clear_present(pte);
spin_unlock(&iommu->lock);
if (!ecap_coherent(iommu->ecap))
@@ -286,6 +290,7 @@ void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev,
iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
devtlb_invalidation_with_pasid(iommu, dev, pasid);
+ intel_pasid_clear_entry(iommu, dev, pasid, fault_ignore);
if (!fault_ignore)
intel_iommu_drain_pasid_prq(dev, pasid);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 3/3] iommu/vt-d: Rework hitless PASID entry replacement
2026-01-13 3:00 [PATCH 0/3] iommu/vt-d: Ensure atomicity in context and PASID entry updates Lu Baolu
2026-01-13 3:00 ` [PATCH 1/3] iommu/vt-d: Use 128-bit atomic updates for context entries Lu Baolu
2026-01-13 3:00 ` [PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry Lu Baolu
@ 2026-01-13 3:00 ` Lu Baolu
2026-01-13 15:05 ` Jason Gunthorpe
` (2 more replies)
2 siblings, 3 replies; 39+ messages in thread
From: Lu Baolu @ 2026-01-13 3:00 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe
Cc: Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel, Lu Baolu
The Intel VT-d PASID table entry is 512 bits (64 bytes). Because the
hardware may fetch this entry in multiple 128-bit chunks, updating the
entire entry while it is active (P=1) risks a "torn" read where the
hardware observes an inconsistent state.
However, certain updates (e.g., changing page table pointers while
keeping the translation type and domain ID the same) can be performed
hitlessly. This is possible if the update is limited to a single
128-bit chunk while the other chunks remains stable.
Introduce a hitless replacement mechanism for PASID entries:
- Update 'struct pasid_entry' with a union to support 128-bit
access via the newly added val128[4] array.
- Add pasid_support_hitless_replace() to determine if a transition
between an old and new entry is safe to perform atomically.
- For First-level/Nested translations: The first 128 bits (chunk 0)
must remain identical; chunk 1 is updated atomically.
- For Second-level/Pass-through: The second 128 bits (chunk 1)
must remain identical; chunk 0 is updated atomically.
- If hitless replacement is supported, use intel_iommu_atomic128_set()
to commit the change in a single 16-byte burst.
- If the changes are too extensive to be hitless, fall back to the
safe "tear down and re-setup" flow (clear present -> flush -> setup).
Fixes: 7543ee63e811 ("iommu/vt-d: Add pasid replace helpers")
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
drivers/iommu/intel/pasid.h | 26 ++++++++++++++++-
drivers/iommu/intel/pasid.c | 57 ++++++++++++++++++++++++++++++++++---
2 files changed, 78 insertions(+), 5 deletions(-)
diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
index 35de1d77355f..b569e2828a8b 100644
--- a/drivers/iommu/intel/pasid.h
+++ b/drivers/iommu/intel/pasid.h
@@ -37,7 +37,10 @@ struct pasid_dir_entry {
};
struct pasid_entry {
- u64 val[8];
+ union {
+ u64 val[8];
+ u128 val128[4];
+ };
};
#define PASID_ENTRY_PGTT_FL_ONLY (1)
@@ -297,6 +300,27 @@ static inline void pasid_set_eafe(struct pasid_entry *pe)
pasid_set_bits(&pe->val[2], 1 << 7, 1 << 7);
}
+static inline bool pasid_support_hitless_replace(struct pasid_entry *pte,
+ struct pasid_entry *new, int type)
+{
+ switch (type) {
+ case PASID_ENTRY_PGTT_FL_ONLY:
+ case PASID_ENTRY_PGTT_NESTED:
+ /* The first 128 bits remain the same. */
+ return READ_ONCE(pte->val[0]) == READ_ONCE(new->val[0]) &&
+ READ_ONCE(pte->val[1]) == READ_ONCE(new->val[1]);
+ case PASID_ENTRY_PGTT_SL_ONLY:
+ case PASID_ENTRY_PGTT_PT:
+ /* The second 128 bits remain the same. */
+ return READ_ONCE(pte->val[2]) == READ_ONCE(new->val[2]) &&
+ READ_ONCE(pte->val[3]) == READ_ONCE(new->val[3]);
+ default:
+ WARN_ON(true);
+ }
+
+ return false;
+}
+
extern unsigned int intel_pasid_max_id;
int intel_pasid_alloc_table(struct device *dev);
void intel_pasid_free_table(struct device *dev);
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 4f36138448d8..da7ab18d3bfe 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -452,7 +452,20 @@ int intel_pasid_replace_first_level(struct intel_iommu *iommu,
WARN_ON(old_did != pasid_get_domain_id(pte));
- *pte = new_pte;
+ if (!pasid_support_hitless_replace(pte, &new_pte,
+ PASID_ENTRY_PGTT_FL_ONLY)) {
+ spin_unlock(&iommu->lock);
+ intel_pasid_tear_down_entry(iommu, dev, pasid, false);
+
+ return intel_pasid_setup_first_level(iommu, dev, fsptptr,
+ pasid, did, flags);
+ }
+
+ /*
+ * A first-only hitless replace requires the first 128 bits to remain
+ * the same. Only the second 128-bit chunk needs to be updated.
+ */
+ intel_iommu_atomic128_set(&pte->val128[1], new_pte.val128[1]);
spin_unlock(&iommu->lock);
intel_pasid_flush_present(iommu, dev, pasid, old_did, pte);
@@ -563,7 +576,19 @@ int intel_pasid_replace_second_level(struct intel_iommu *iommu,
WARN_ON(old_did != pasid_get_domain_id(pte));
- *pte = new_pte;
+ if (!pasid_support_hitless_replace(pte, &new_pte,
+ PASID_ENTRY_PGTT_SL_ONLY)) {
+ spin_unlock(&iommu->lock);
+ intel_pasid_tear_down_entry(iommu, dev, pasid, false);
+
+ return intel_pasid_setup_second_level(iommu, domain, dev, pasid);
+ }
+
+ /*
+ * A second-only hitless replace requires the second 128 bits to remain
+ * the same. Only the first 128-bit chunk needs to be updated.
+ */
+ intel_iommu_atomic128_set(&pte->val128[0], new_pte.val128[0]);
spin_unlock(&iommu->lock);
intel_pasid_flush_present(iommu, dev, pasid, old_did, pte);
@@ -707,7 +732,19 @@ int intel_pasid_replace_pass_through(struct intel_iommu *iommu,
WARN_ON(old_did != pasid_get_domain_id(pte));
- *pte = new_pte;
+ if (!pasid_support_hitless_replace(pte, &new_pte,
+ PASID_ENTRY_PGTT_PT)) {
+ spin_unlock(&iommu->lock);
+ intel_pasid_tear_down_entry(iommu, dev, pasid, false);
+
+ return intel_pasid_setup_pass_through(iommu, dev, pasid);
+ }
+
+ /*
+ * A passthrough hitless replace requires the second 128 bits to remain
+ * the same. Only the first 128-bit chunk needs to be updated.
+ */
+ intel_iommu_atomic128_set(&pte->val128[0], new_pte.val128[0]);
spin_unlock(&iommu->lock);
intel_pasid_flush_present(iommu, dev, pasid, old_did, pte);
@@ -903,7 +940,19 @@ int intel_pasid_replace_nested(struct intel_iommu *iommu,
WARN_ON(old_did != pasid_get_domain_id(pte));
- *pte = new_pte;
+ if (!pasid_support_hitless_replace(pte, &new_pte,
+ PASID_ENTRY_PGTT_NESTED)) {
+ spin_unlock(&iommu->lock);
+ intel_pasid_tear_down_entry(iommu, dev, pasid, false);
+
+ return intel_pasid_setup_nested(iommu, dev, pasid, domain);
+ }
+
+ /*
+ * A nested hitless replace requires the first 128 bits to remain
+ * the same. Only the second 128-bit chunk needs to be updated.
+ */
+ intel_iommu_atomic128_set(&pte->val128[1], new_pte.val128[1]);
spin_unlock(&iommu->lock);
intel_pasid_flush_present(iommu, dev, pasid, old_did, pte);
--
2.43.0
^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [PATCH 3/3] iommu/vt-d: Rework hitless PASID entry replacement
2026-01-13 3:00 ` [PATCH 3/3] iommu/vt-d: Rework hitless PASID entry replacement Lu Baolu
@ 2026-01-13 15:05 ` Jason Gunthorpe
2026-01-14 6:03 ` Baolu Lu
2026-01-13 19:27 ` Samiullah Khawaja
2026-01-13 19:39 ` Dmytro Maluka
2 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2026-01-13 15:05 UTC (permalink / raw)
To: Lu Baolu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel
On Tue, Jan 13, 2026 at 11:00:48AM +0800, Lu Baolu wrote:
> +static inline bool pasid_support_hitless_replace(struct pasid_entry *pte,
> + struct pasid_entry *new, int type)
> +{
> + switch (type) {
> + case PASID_ENTRY_PGTT_FL_ONLY:
> + case PASID_ENTRY_PGTT_NESTED:
> + /* The first 128 bits remain the same. */
> + return READ_ONCE(pte->val[0]) == READ_ONCE(new->val[0]) &&
> + READ_ONCE(pte->val[1]) == READ_ONCE(new->val[1]);
pte->val128[0] == new->val128[0]
> + case PASID_ENTRY_PGTT_SL_ONLY:
> + case PASID_ENTRY_PGTT_PT:
> + /* The second 128 bits remain the same. */
> + return READ_ONCE(pte->val[2]) == READ_ONCE(new->val[2]) &&
> + READ_ONCE(pte->val[3]) == READ_ONCE(new->val[3]);
These READ_ONCE's are pointless, especially the ones on new.
With 5 words to worry about I really feel strongly this should just
use the ARM algorithm. It handles everything very elegantly, we can
lift it out of ARM and make it general.
Here, I did a quick refactoring into general code:
https://github.com/jgunthorpe/linux/commits/for-baolu/
You just need to provide a used function to compute which bits HW is
not ignoring and a sync function to push the invalidation command. It
will take care of all sequencing needs for all possible new/old
combinations.
Then delete the replace/not replace split in the code too.
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 3/3] iommu/vt-d: Rework hitless PASID entry replacement
2026-01-13 3:00 ` [PATCH 3/3] iommu/vt-d: Rework hitless PASID entry replacement Lu Baolu
2026-01-13 15:05 ` Jason Gunthorpe
@ 2026-01-13 19:27 ` Samiullah Khawaja
2026-01-13 20:56 ` Jason Gunthorpe
2026-01-14 5:45 ` Baolu Lu
2026-01-13 19:39 ` Dmytro Maluka
2 siblings, 2 replies; 39+ messages in thread
From: Samiullah Khawaja @ 2026-01-13 19:27 UTC (permalink / raw)
To: Lu Baolu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Dmytro Maluka, iommu, linux-kernel
On Mon, Jan 12, 2026 at 7:03 PM Lu Baolu <baolu.lu@linux.intel.com> wrote:
>
> The Intel VT-d PASID table entry is 512 bits (64 bytes). Because the
> hardware may fetch this entry in multiple 128-bit chunks, updating the
> entire entry while it is active (P=1) risks a "torn" read where the
> hardware observes an inconsistent state.
>
> However, certain updates (e.g., changing page table pointers while
> keeping the translation type and domain ID the same) can be performed
> hitlessly. This is possible if the update is limited to a single
> 128-bit chunk while the other chunks remains stable.
>
> Introduce a hitless replacement mechanism for PASID entries:
>
> - Update 'struct pasid_entry' with a union to support 128-bit
> access via the newly added val128[4] array.
> - Add pasid_support_hitless_replace() to determine if a transition
> between an old and new entry is safe to perform atomically.
> - For First-level/Nested translations: The first 128 bits (chunk 0)
> must remain identical; chunk 1 is updated atomically.
Looking at the specs, the DID is part of the first 128 bits (chunk 0),
so I guess for the first level the hitless replacement would not be
supported since each domain will have a different DID?
> - For Second-level/Pass-through: The second 128 bits (chunk 1)
> must remain identical; chunk 0 is updated atomically.
> - If hitless replacement is supported, use intel_iommu_atomic128_set()
> to commit the change in a single 16-byte burst.
> - If the changes are too extensive to be hitless, fall back to the
> safe "tear down and re-setup" flow (clear present -> flush -> setup).
>
> Fixes: 7543ee63e811 ("iommu/vt-d: Add pasid replace helpers")
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
> drivers/iommu/intel/pasid.h | 26 ++++++++++++++++-
> drivers/iommu/intel/pasid.c | 57 ++++++++++++++++++++++++++++++++++---
> 2 files changed, 78 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
> index 35de1d77355f..b569e2828a8b 100644
> --- a/drivers/iommu/intel/pasid.h
> +++ b/drivers/iommu/intel/pasid.h
> @@ -37,7 +37,10 @@ struct pasid_dir_entry {
> };
>
> struct pasid_entry {
> - u64 val[8];
> + union {
> + u64 val[8];
> + u128 val128[4];
> + };
> };
>
> #define PASID_ENTRY_PGTT_FL_ONLY (1)
> @@ -297,6 +300,27 @@ static inline void pasid_set_eafe(struct pasid_entry *pe)
> pasid_set_bits(&pe->val[2], 1 << 7, 1 << 7);
> }
>
> +static inline bool pasid_support_hitless_replace(struct pasid_entry *pte,
> + struct pasid_entry *new, int type)
> +{
> + switch (type) {
> + case PASID_ENTRY_PGTT_FL_ONLY:
> + case PASID_ENTRY_PGTT_NESTED:
> + /* The first 128 bits remain the same. */
> + return READ_ONCE(pte->val[0]) == READ_ONCE(new->val[0]) &&
> + READ_ONCE(pte->val[1]) == READ_ONCE(new->val[1]);
> + case PASID_ENTRY_PGTT_SL_ONLY:
> + case PASID_ENTRY_PGTT_PT:
> + /* The second 128 bits remain the same. */
> + return READ_ONCE(pte->val[2]) == READ_ONCE(new->val[2]) &&
> + READ_ONCE(pte->val[3]) == READ_ONCE(new->val[3]);
> + default:
> + WARN_ON(true);
> + }
> +
> + return false;
> +}
> +
> extern unsigned int intel_pasid_max_id;
> int intel_pasid_alloc_table(struct device *dev);
> void intel_pasid_free_table(struct device *dev);
> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
> index 4f36138448d8..da7ab18d3bfe 100644
> --- a/drivers/iommu/intel/pasid.c
> +++ b/drivers/iommu/intel/pasid.c
> @@ -452,7 +452,20 @@ int intel_pasid_replace_first_level(struct intel_iommu *iommu,
>
> WARN_ON(old_did != pasid_get_domain_id(pte));
>
> - *pte = new_pte;
> + if (!pasid_support_hitless_replace(pte, &new_pte,
> + PASID_ENTRY_PGTT_FL_ONLY)) {
> + spin_unlock(&iommu->lock);
> + intel_pasid_tear_down_entry(iommu, dev, pasid, false);
> +
> + return intel_pasid_setup_first_level(iommu, dev, fsptptr,
> + pasid, did, flags);
> + }
> +
> + /*
> + * A first-only hitless replace requires the first 128 bits to remain
> + * the same. Only the second 128-bit chunk needs to be updated.
> + */
> + intel_iommu_atomic128_set(&pte->val128[1], new_pte.val128[1]);
> spin_unlock(&iommu->lock);
>
> intel_pasid_flush_present(iommu, dev, pasid, old_did, pte);
> @@ -563,7 +576,19 @@ int intel_pasid_replace_second_level(struct intel_iommu *iommu,
>
> WARN_ON(old_did != pasid_get_domain_id(pte));
>
> - *pte = new_pte;
> + if (!pasid_support_hitless_replace(pte, &new_pte,
> + PASID_ENTRY_PGTT_SL_ONLY)) {
> + spin_unlock(&iommu->lock);
> + intel_pasid_tear_down_entry(iommu, dev, pasid, false);
> +
> + return intel_pasid_setup_second_level(iommu, domain, dev, pasid);
> + }
> +
> + /*
> + * A second-only hitless replace requires the second 128 bits to remain
> + * the same. Only the first 128-bit chunk needs to be updated.
> + */
> + intel_iommu_atomic128_set(&pte->val128[0], new_pte.val128[0]);
> spin_unlock(&iommu->lock);
>
> intel_pasid_flush_present(iommu, dev, pasid, old_did, pte);
> @@ -707,7 +732,19 @@ int intel_pasid_replace_pass_through(struct intel_iommu *iommu,
>
> WARN_ON(old_did != pasid_get_domain_id(pte));
>
> - *pte = new_pte;
> + if (!pasid_support_hitless_replace(pte, &new_pte,
> + PASID_ENTRY_PGTT_PT)) {
> + spin_unlock(&iommu->lock);
> + intel_pasid_tear_down_entry(iommu, dev, pasid, false);
> +
> + return intel_pasid_setup_pass_through(iommu, dev, pasid);
> + }
> +
> + /*
> + * A passthrough hitless replace requires the second 128 bits to remain
> + * the same. Only the first 128-bit chunk needs to be updated.
> + */
> + intel_iommu_atomic128_set(&pte->val128[0], new_pte.val128[0]);
> spin_unlock(&iommu->lock);
>
> intel_pasid_flush_present(iommu, dev, pasid, old_did, pte);
> @@ -903,7 +940,19 @@ int intel_pasid_replace_nested(struct intel_iommu *iommu,
>
> WARN_ON(old_did != pasid_get_domain_id(pte));
>
> - *pte = new_pte;
> + if (!pasid_support_hitless_replace(pte, &new_pte,
> + PASID_ENTRY_PGTT_NESTED)) {
> + spin_unlock(&iommu->lock);
> + intel_pasid_tear_down_entry(iommu, dev, pasid, false);
> +
> + return intel_pasid_setup_nested(iommu, dev, pasid, domain);
> + }
> +
> + /*
> + * A nested hitless replace requires the first 128 bits to remain
> + * the same. Only the second 128-bit chunk needs to be updated.
> + */
> + intel_iommu_atomic128_set(&pte->val128[1], new_pte.val128[1]);
> spin_unlock(&iommu->lock);
>
> intel_pasid_flush_present(iommu, dev, pasid, old_did, pte);
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] iommu/vt-d: Use 128-bit atomic updates for context entries
2026-01-13 3:00 ` [PATCH 1/3] iommu/vt-d: Use 128-bit atomic updates for context entries Lu Baolu
@ 2026-01-13 19:27 ` Dmytro Maluka
2026-01-14 5:14 ` Baolu Lu
2026-01-14 7:54 ` Tian, Kevin
1 sibling, 1 reply; 39+ messages in thread
From: Dmytro Maluka @ 2026-01-13 19:27 UTC (permalink / raw)
To: Lu Baolu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Samiullah Khawaja, iommu, linux-kernel,
Vineeth Pillai (Google), Aashish Sharma
On Tue, Jan 13, 2026 at 11:00:46AM +0800, Lu Baolu wrote:
> On Intel IOMMU, device context entries are accessed by hardware in
> 128-bit chunks. Currently, the driver updates these entries by
> programming the 'lo' and 'hi' 64-bit fields individually.
>
> This creates a potential race condition where the IOMMU hardware may fetch
> a context entry while the CPU has only completed one of the two 64-bit
> writes. This "torn" entry — consisting of half-old and half-new data —
> could lead to unpredictable hardware behavior, especially when
> transitioning the 'Present' bit or changing translation types.
>
> To ensure the IOMMU hardware always observes a consistent state, use
> 128-bit atomic updates for context entries. This is achieved by building
> context entries on the stack and write them to the table in a single
> operation.
>
> As this relies on arch_cmpxchg128_local(), restrict INTEL_IOMMU
> dependencies to X86_64.
>
> Fixes: ba39592764ed2 ("Intel IOMMU: Intel IOMMU driver")
> Reported-by: Dmytro Maluka <dmaluka@chromium.org>
FWIW, Jason and Kevin contributed to this discovery more than I did. :)
> Closes: https://lore.kernel.org/all/aTG7gc7I5wExai3S@google.com/
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
> drivers/iommu/intel/Kconfig | 2 +-
> drivers/iommu/intel/iommu.h | 22 ++++++++++++++++++----
> drivers/iommu/intel/iommu.c | 30 +++++++++++++++---------------
> drivers/iommu/intel/pasid.c | 18 +++++++++---------
> 4 files changed, 43 insertions(+), 29 deletions(-)
>
> diff --git a/drivers/iommu/intel/Kconfig b/drivers/iommu/intel/Kconfig
> index 5471f814e073..efda19820f95 100644
> --- a/drivers/iommu/intel/Kconfig
> +++ b/drivers/iommu/intel/Kconfig
> @@ -11,7 +11,7 @@ config DMAR_DEBUG
>
> config INTEL_IOMMU
> bool "Support for Intel IOMMU using DMA Remapping Devices"
> - depends on PCI_MSI && ACPI && X86
> + depends on PCI_MSI && ACPI && X86_64
> select IOMMU_API
> select GENERIC_PT
> select IOMMU_PT
> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
> index 25c5e22096d4..b8999802f401 100644
> --- a/drivers/iommu/intel/iommu.h
> +++ b/drivers/iommu/intel/iommu.h
> @@ -546,6 +546,16 @@ struct pasid_entry;
> struct pasid_state_entry;
> struct page_req_dsc;
>
> +static __always_inline void intel_iommu_atomic128_set(u128 *ptr, u128 val)
> +{
> + /*
> + * Use the cmpxchg16b instruction for 128-bit atomicity. As updates
> + * are serialized by a spinlock, we use the local (unlocked) variant
> + * to avoid unnecessary bus locking overhead.
> + */
> + arch_cmpxchg128_local(ptr, *ptr, val);
Any reason why not cmpxchg128_local()? (except following the AMD driver)
Otherwise,
Reviewed-by: Dmytro Maluka <dmaluka@chromium.org>
> +}
> +
> /*
> * 0: Present
> * 1-11: Reserved
> @@ -569,8 +579,13 @@ struct root_entry {
> * 8-23: domain id
> */
> struct context_entry {
> - u64 lo;
> - u64 hi;
> + union {
> + struct {
> + u64 lo;
> + u64 hi;
> + };
> + u128 val128;
> + };
> };
>
> struct iommu_domain_info {
> @@ -946,8 +961,7 @@ static inline int context_domain_id(struct context_entry *c)
>
> static inline void context_clear_entry(struct context_entry *context)
> {
> - context->lo = 0;
> - context->hi = 0;
> + intel_iommu_atomic128_set(&context->val128, 0);
> }
>
> #ifdef CONFIG_INTEL_IOMMU
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 134302fbcd92..d721061ebda2 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -1147,8 +1147,8 @@ static int domain_context_mapping_one(struct dmar_domain *domain,
> domain_lookup_dev_info(domain, iommu, bus, devfn);
> u16 did = domain_id_iommu(domain, iommu);
> int translation = CONTEXT_TT_MULTI_LEVEL;
> + struct context_entry *context, new = {0};
> struct pt_iommu_vtdss_hw_info pt_info;
> - struct context_entry *context;
> int ret;
>
> if (WARN_ON(!intel_domain_is_ss_paging(domain)))
> @@ -1170,19 +1170,19 @@ static int domain_context_mapping_one(struct dmar_domain *domain,
> goto out_unlock;
>
> copied_context_tear_down(iommu, context, bus, devfn);
> - context_clear_entry(context);
> - context_set_domain_id(context, did);
> + context_set_domain_id(&new, did);
>
> if (info && info->ats_supported)
> translation = CONTEXT_TT_DEV_IOTLB;
> else
> translation = CONTEXT_TT_MULTI_LEVEL;
>
> - context_set_address_root(context, pt_info.ssptptr);
> - context_set_address_width(context, pt_info.aw);
> - context_set_translation_type(context, translation);
> - context_set_fault_enable(context);
> - context_set_present(context);
> + context_set_address_root(&new, pt_info.ssptptr);
> + context_set_address_width(&new, pt_info.aw);
> + context_set_translation_type(&new, translation);
> + context_set_fault_enable(&new);
> + context_set_present(&new);
> + intel_iommu_atomic128_set(&context->val128, new.val128);
> if (!ecap_coherent(iommu->ecap))
> clflush_cache_range(context, sizeof(*context));
> context_present_cache_flush(iommu, did, bus, devfn);
> @@ -3771,8 +3771,8 @@ static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain,
> static int context_setup_pass_through(struct device *dev, u8 bus, u8 devfn)
> {
> struct device_domain_info *info = dev_iommu_priv_get(dev);
> + struct context_entry *context, new = {0};
> struct intel_iommu *iommu = info->iommu;
> - struct context_entry *context;
>
> spin_lock(&iommu->lock);
> context = iommu_context_addr(iommu, bus, devfn, 1);
> @@ -3787,17 +3787,17 @@ static int context_setup_pass_through(struct device *dev, u8 bus, u8 devfn)
> }
>
> copied_context_tear_down(iommu, context, bus, devfn);
> - context_clear_entry(context);
> - context_set_domain_id(context, FLPT_DEFAULT_DID);
> + context_set_domain_id(&new, FLPT_DEFAULT_DID);
>
> /*
> * In pass through mode, AW must be programmed to indicate the largest
> * AGAW value supported by hardware. And ASR is ignored by hardware.
> */
> - context_set_address_width(context, iommu->msagaw);
> - context_set_translation_type(context, CONTEXT_TT_PASS_THROUGH);
> - context_set_fault_enable(context);
> - context_set_present(context);
> + context_set_address_width(&new, iommu->msagaw);
> + context_set_translation_type(&new, CONTEXT_TT_PASS_THROUGH);
> + context_set_fault_enable(&new);
> + context_set_present(&new);
> + intel_iommu_atomic128_set(&context->val128, new.val128);
> if (!ecap_coherent(iommu->ecap))
> clflush_cache_range(context, sizeof(*context));
> context_present_cache_flush(iommu, FLPT_DEFAULT_DID, bus, devfn);
> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
> index 3e2255057079..298a39183996 100644
> --- a/drivers/iommu/intel/pasid.c
> +++ b/drivers/iommu/intel/pasid.c
> @@ -978,23 +978,23 @@ static int context_entry_set_pasid_table(struct context_entry *context,
> struct device_domain_info *info = dev_iommu_priv_get(dev);
> struct pasid_table *table = info->pasid_table;
> struct intel_iommu *iommu = info->iommu;
> + struct context_entry new = {0};
> unsigned long pds;
>
> - context_clear_entry(context);
> -
> pds = context_get_sm_pds(table);
> - context->lo = (u64)virt_to_phys(table->table) | context_pdts(pds);
> - context_set_sm_rid2pasid(context, IOMMU_NO_PASID);
> + new.lo = (u64)virt_to_phys(table->table) | context_pdts(pds);
> + context_set_sm_rid2pasid(&new, IOMMU_NO_PASID);
>
> if (info->ats_supported)
> - context_set_sm_dte(context);
> + context_set_sm_dte(&new);
> if (info->pasid_supported)
> - context_set_pasid(context);
> + context_set_pasid(&new);
> if (info->pri_supported)
> - context_set_sm_pre(context);
> + context_set_sm_pre(&new);
>
> - context_set_fault_enable(context);
> - context_set_present(context);
> + context_set_fault_enable(&new);
> + context_set_present(&new);
> + intel_iommu_atomic128_set(&context->val128, new.val128);
> __iommu_flush_cache(iommu, context, sizeof(*context));
>
> return 0;
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry
2026-01-13 3:00 ` [PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry Lu Baolu
@ 2026-01-13 19:34 ` Dmytro Maluka
2026-01-14 5:38 ` Baolu Lu
2026-01-14 7:32 ` Tian, Kevin
1 sibling, 1 reply; 39+ messages in thread
From: Dmytro Maluka @ 2026-01-13 19:34 UTC (permalink / raw)
To: Lu Baolu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Samiullah Khawaja, iommu, linux-kernel,
Vineeth Pillai (Google), Aashish Sharma
On Tue, Jan 13, 2026 at 11:00:47AM +0800, Lu Baolu wrote:
> The Intel VT-d Scalable Mode PASID table entry consists of 512 bits (64
> bytes). When tearing down an entry, the current implementation zeros the
> entire 64-byte structure immediately.
>
> However, the IOMMU hardware may fetch these 64 bytes using multiple
> internal transactions (e.g., four 128-bit bursts). If a hardware fetch
> occurs simultaneously with the CPU zeroing the entry, the hardware could
> observe a "torn" entry — where some chunks are zeroed and others still
> contain old data — leading to unpredictable behavior or spurious faults.
>
> Follow the "Guidance to Software for Invalidations" in the VT-d spec
> (Section 6.5.3.3) by implementing a proper ownership handshake:
>
> 1. Clear only the 'Present' (P) bit of the PASID entry. This tells the
> hardware that the entry is no longer valid.
> 2. Execute the required invalidation sequence (PASID cache, IOTLB, and
> Device-TLB flush) to ensure the hardware has released all cached
> references to the entry.
> 3. Only after the flushes are complete, zero out the remaining fields of
> the PASID entry.
>
> Additionally, add an explicit clflush in intel_pasid_clear_entry() to
> ensure that the cleared entry is visible to the IOMMU on systems where
> memory coherency (ecap_coherent) is not supported.
>
> Fixes: 0bbeb01a4faf ("iommu/vt-d: Manage scalalble mode PASID tables")
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
> drivers/iommu/intel/pasid.h | 12 ++++++++++++
> drivers/iommu/intel/pasid.c | 9 +++++++--
> 2 files changed, 19 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
> index b4c85242dc79..35de1d77355f 100644
> --- a/drivers/iommu/intel/pasid.h
> +++ b/drivers/iommu/intel/pasid.h
> @@ -237,6 +237,18 @@ static inline void pasid_set_present(struct pasid_entry *pe)
> pasid_set_bits(&pe->val[0], 1 << 0, 1);
> }
>
> +/*
> + * Clear the Present (P) bit (bit 0) of a scalable-mode PASID table entry.
> + * This initiates the transition of the entry's ownership from hardware
> + * to software. The caller is responsible for fulfilling the invalidation
> + * handshake recommended by the VT-d spec, Section 6.5.3.3 (Guidance to
> + * Software for Invalidations).
> + */
> +static inline void pasid_clear_present(struct pasid_entry *pe)
> +{
> + pasid_set_bits(&pe->val[0], 1 << 0, 0);
> +}
> +
> /*
> * Setup Page Walk Snoop bit (Bit 87) of a scalable mode PASID
> * entry.
> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
> index 298a39183996..4f36138448d8 100644
> --- a/drivers/iommu/intel/pasid.c
> +++ b/drivers/iommu/intel/pasid.c
> @@ -178,7 +178,8 @@ static struct pasid_entry *intel_pasid_get_entry(struct device *dev, u32 pasid)
> * Interfaces for PASID table entry manipulation:
> */
> static void
> -intel_pasid_clear_entry(struct device *dev, u32 pasid, bool fault_ignore)
> +intel_pasid_clear_entry(struct intel_iommu *iommu, struct device *dev,
> + u32 pasid, bool fault_ignore)
> {
> struct pasid_entry *pe;
>
> @@ -190,6 +191,9 @@ intel_pasid_clear_entry(struct device *dev, u32 pasid, bool fault_ignore)
> pasid_clear_entry_with_fpd(pe);
> else
> pasid_clear_entry(pe);
> +
> + if (!ecap_coherent(iommu->ecap))
> + clflush_cache_range(pe, sizeof(*pe));
> }
>
> static void
> @@ -272,7 +276,7 @@ void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev,
>
> did = pasid_get_domain_id(pte);
> pgtt = pasid_pte_get_pgtt(pte);
> - intel_pasid_clear_entry(dev, pasid, fault_ignore);
> + pasid_clear_present(pte);
> spin_unlock(&iommu->lock);
>
> if (!ecap_coherent(iommu->ecap))
> @@ -286,6 +290,7 @@ void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev,
> iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
>
> devtlb_invalidation_with_pasid(iommu, dev, pasid);
> + intel_pasid_clear_entry(iommu, dev, pasid, fault_ignore);
Is it safe to do this with iommu->lock already unlocked?
> if (!fault_ignore)
> intel_iommu_drain_pasid_prq(dev, pasid);
> }
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 3/3] iommu/vt-d: Rework hitless PASID entry replacement
2026-01-13 3:00 ` [PATCH 3/3] iommu/vt-d: Rework hitless PASID entry replacement Lu Baolu
2026-01-13 15:05 ` Jason Gunthorpe
2026-01-13 19:27 ` Samiullah Khawaja
@ 2026-01-13 19:39 ` Dmytro Maluka
2026-01-13 20:06 ` Dmytro Maluka
2 siblings, 1 reply; 39+ messages in thread
From: Dmytro Maluka @ 2026-01-13 19:39 UTC (permalink / raw)
To: Lu Baolu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Samiullah Khawaja, iommu, linux-kernel,
Vineeth Pillai (Google), Aashish Sharma
On Tue, Jan 13, 2026 at 11:00:48AM +0800, Lu Baolu wrote:
> The Intel VT-d PASID table entry is 512 bits (64 bytes). Because the
> hardware may fetch this entry in multiple 128-bit chunks, updating the
> entire entry while it is active (P=1) risks a "torn" read where the
> hardware observes an inconsistent state.
>
> However, certain updates (e.g., changing page table pointers while
> keeping the translation type and domain ID the same) can be performed
> hitlessly. This is possible if the update is limited to a single
> 128-bit chunk while the other chunks remains stable.
>
> Introduce a hitless replacement mechanism for PASID entries:
>
> - Update 'struct pasid_entry' with a union to support 128-bit
> access via the newly added val128[4] array.
> - Add pasid_support_hitless_replace() to determine if a transition
> between an old and new entry is safe to perform atomically.
> - For First-level/Nested translations: The first 128 bits (chunk 0)
> must remain identical; chunk 1 is updated atomically.
> - For Second-level/Pass-through: The second 128 bits (chunk 1)
> must remain identical; chunk 0 is updated atomically.
> - If hitless replacement is supported, use intel_iommu_atomic128_set()
> to commit the change in a single 16-byte burst.
> - If the changes are too extensive to be hitless, fall back to the
> safe "tear down and re-setup" flow (clear present -> flush -> setup).
>
> Fixes: 7543ee63e811 ("iommu/vt-d: Add pasid replace helpers")
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
> drivers/iommu/intel/pasid.h | 26 ++++++++++++++++-
> drivers/iommu/intel/pasid.c | 57 ++++++++++++++++++++++++++++++++++---
> 2 files changed, 78 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
> index 35de1d77355f..b569e2828a8b 100644
> --- a/drivers/iommu/intel/pasid.h
> +++ b/drivers/iommu/intel/pasid.h
> @@ -37,7 +37,10 @@ struct pasid_dir_entry {
> };
>
> struct pasid_entry {
> - u64 val[8];
> + union {
> + u64 val[8];
> + u128 val128[4];
> + };
> };
>
> #define PASID_ENTRY_PGTT_FL_ONLY (1)
> @@ -297,6 +300,27 @@ static inline void pasid_set_eafe(struct pasid_entry *pe)
> pasid_set_bits(&pe->val[2], 1 << 7, 1 << 7);
> }
>
> +static inline bool pasid_support_hitless_replace(struct pasid_entry *pte,
> + struct pasid_entry *new, int type)
> +{
> + switch (type) {
> + case PASID_ENTRY_PGTT_FL_ONLY:
> + case PASID_ENTRY_PGTT_NESTED:
> + /* The first 128 bits remain the same. */
> + return READ_ONCE(pte->val[0]) == READ_ONCE(new->val[0]) &&
> + READ_ONCE(pte->val[1]) == READ_ONCE(new->val[1]);
> + case PASID_ENTRY_PGTT_SL_ONLY:
> + case PASID_ENTRY_PGTT_PT:
> + /* The second 128 bits remain the same. */
> + return READ_ONCE(pte->val[2]) == READ_ONCE(new->val[2]) &&
> + READ_ONCE(pte->val[3]) == READ_ONCE(new->val[3]);
> + default:
> + WARN_ON(true);
nit: WARN_ON(false) seems a bit more suitable?
> + }
> +
> + return false;
> +}
> +
> extern unsigned int intel_pasid_max_id;
> int intel_pasid_alloc_table(struct device *dev);
> void intel_pasid_free_table(struct device *dev);
> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
> index 4f36138448d8..da7ab18d3bfe 100644
> --- a/drivers/iommu/intel/pasid.c
> +++ b/drivers/iommu/intel/pasid.c
> @@ -452,7 +452,20 @@ int intel_pasid_replace_first_level(struct intel_iommu *iommu,
>
> WARN_ON(old_did != pasid_get_domain_id(pte));
>
> - *pte = new_pte;
> + if (!pasid_support_hitless_replace(pte, &new_pte,
> + PASID_ENTRY_PGTT_FL_ONLY)) {
> + spin_unlock(&iommu->lock);
> + intel_pasid_tear_down_entry(iommu, dev, pasid, false);
> +
> + return intel_pasid_setup_first_level(iommu, dev, fsptptr,
> + pasid, did, flags);
> + }
> +
> + /*
> + * A first-only hitless replace requires the first 128 bits to remain
> + * the same. Only the second 128-bit chunk needs to be updated.
> + */
> + intel_iommu_atomic128_set(&pte->val128[1], new_pte.val128[1]);
> spin_unlock(&iommu->lock);
>
> intel_pasid_flush_present(iommu, dev, pasid, old_did, pte);
> @@ -563,7 +576,19 @@ int intel_pasid_replace_second_level(struct intel_iommu *iommu,
>
> WARN_ON(old_did != pasid_get_domain_id(pte));
>
> - *pte = new_pte;
> + if (!pasid_support_hitless_replace(pte, &new_pte,
> + PASID_ENTRY_PGTT_SL_ONLY)) {
> + spin_unlock(&iommu->lock);
> + intel_pasid_tear_down_entry(iommu, dev, pasid, false);
> +
> + return intel_pasid_setup_second_level(iommu, domain, dev, pasid);
> + }
> +
> + /*
> + * A second-only hitless replace requires the second 128 bits to remain
> + * the same. Only the first 128-bit chunk needs to be updated.
> + */
> + intel_iommu_atomic128_set(&pte->val128[0], new_pte.val128[0]);
> spin_unlock(&iommu->lock);
>
> intel_pasid_flush_present(iommu, dev, pasid, old_did, pte);
> @@ -707,7 +732,19 @@ int intel_pasid_replace_pass_through(struct intel_iommu *iommu,
>
> WARN_ON(old_did != pasid_get_domain_id(pte));
>
> - *pte = new_pte;
> + if (!pasid_support_hitless_replace(pte, &new_pte,
> + PASID_ENTRY_PGTT_PT)) {
> + spin_unlock(&iommu->lock);
> + intel_pasid_tear_down_entry(iommu, dev, pasid, false);
> +
> + return intel_pasid_setup_pass_through(iommu, dev, pasid);
> + }
> +
> + /*
> + * A passthrough hitless replace requires the second 128 bits to remain
> + * the same. Only the first 128-bit chunk needs to be updated.
> + */
> + intel_iommu_atomic128_set(&pte->val128[0], new_pte.val128[0]);
> spin_unlock(&iommu->lock);
>
> intel_pasid_flush_present(iommu, dev, pasid, old_did, pte);
> @@ -903,7 +940,19 @@ int intel_pasid_replace_nested(struct intel_iommu *iommu,
>
> WARN_ON(old_did != pasid_get_domain_id(pte));
>
> - *pte = new_pte;
> + if (!pasid_support_hitless_replace(pte, &new_pte,
> + PASID_ENTRY_PGTT_NESTED)) {
> + spin_unlock(&iommu->lock);
> + intel_pasid_tear_down_entry(iommu, dev, pasid, false);
> +
> + return intel_pasid_setup_nested(iommu, dev, pasid, domain);
> + }
> +
> + /*
> + * A nested hitless replace requires the first 128 bits to remain
> + * the same. Only the second 128-bit chunk needs to be updated.
> + */
> + intel_iommu_atomic128_set(&pte->val128[1], new_pte.val128[1]);
> spin_unlock(&iommu->lock);
>
> intel_pasid_flush_present(iommu, dev, pasid, old_did, pte);
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 3/3] iommu/vt-d: Rework hitless PASID entry replacement
2026-01-13 19:39 ` Dmytro Maluka
@ 2026-01-13 20:06 ` Dmytro Maluka
0 siblings, 0 replies; 39+ messages in thread
From: Dmytro Maluka @ 2026-01-13 20:06 UTC (permalink / raw)
To: Lu Baolu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Samiullah Khawaja, iommu, linux-kernel,
Vineeth Pillai (Google), Aashish Sharma
On Tue, Jan 13, 2026 at 08:40:00PM +0100, Dmytro Maluka wrote:
> On Tue, Jan 13, 2026 at 11:00:48AM +0800, Lu Baolu wrote:
...
> > + default:
> > + WARN_ON(true);
>
> nit: WARN_ON(false) seems a bit more suitable?
Ah sorry, ignore this. Somehow I thought this was returning true...
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 3/3] iommu/vt-d: Rework hitless PASID entry replacement
2026-01-13 19:27 ` Samiullah Khawaja
@ 2026-01-13 20:56 ` Jason Gunthorpe
2026-01-14 5:45 ` Baolu Lu
1 sibling, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2026-01-13 20:56 UTC (permalink / raw)
To: Samiullah Khawaja
Cc: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Dmytro Maluka, iommu, linux-kernel
On Tue, Jan 13, 2026 at 11:27:30AM -0800, Samiullah Khawaja wrote:
> > - Update 'struct pasid_entry' with a union to support 128-bit
> > access via the newly added val128[4] array.
> > - Add pasid_support_hitless_replace() to determine if a transition
> > between an old and new entry is safe to perform atomically.
> > - For First-level/Nested translations: The first 128 bits (chunk 0)
> > must remain identical; chunk 1 is updated atomically.
>
> Looking at the specs, the DID is part of the first 128 bits (chunk 0),
> so I guess for the first level the hitless replacement would not be
> supported since each domain will have a different DID?
Ah, yeah, you wont be able to do the KHO replace thing when using a
first stage..
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] iommu/vt-d: Use 128-bit atomic updates for context entries
2026-01-13 19:27 ` Dmytro Maluka
@ 2026-01-14 5:14 ` Baolu Lu
2026-01-14 10:55 ` Dmytro Maluka
0 siblings, 1 reply; 39+ messages in thread
From: Baolu Lu @ 2026-01-14 5:14 UTC (permalink / raw)
To: Dmytro Maluka
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Samiullah Khawaja, iommu, linux-kernel,
Vineeth Pillai (Google), Aashish Sharma
On 1/14/26 03:27, Dmytro Maluka wrote:
> On Tue, Jan 13, 2026 at 11:00:46AM +0800, Lu Baolu wrote:
>> On Intel IOMMU, device context entries are accessed by hardware in
>> 128-bit chunks. Currently, the driver updates these entries by
>> programming the 'lo' and 'hi' 64-bit fields individually.
>>
>> This creates a potential race condition where the IOMMU hardware may fetch
>> a context entry while the CPU has only completed one of the two 64-bit
>> writes. This "torn" entry — consisting of half-old and half-new data —
>> could lead to unpredictable hardware behavior, especially when
>> transitioning the 'Present' bit or changing translation types.
>>
>> To ensure the IOMMU hardware always observes a consistent state, use
>> 128-bit atomic updates for context entries. This is achieved by building
>> context entries on the stack and write them to the table in a single
>> operation.
>>
>> As this relies on arch_cmpxchg128_local(), restrict INTEL_IOMMU
>> dependencies to X86_64.
>>
>> Fixes: ba39592764ed2 ("Intel IOMMU: Intel IOMMU driver")
>> Reported-by: Dmytro Maluka<dmaluka@chromium.org>
> FWIW, Jason and Kevin contributed to this discovery more than I did. 🙂
Thanks to all you guys.
>
>> Closes:https://lore.kernel.org/all/aTG7gc7I5wExai3S@google.com/
>> Signed-off-by: Lu Baolu<baolu.lu@linux.intel.com>
>> ---
>> drivers/iommu/intel/Kconfig | 2 +-
>> drivers/iommu/intel/iommu.h | 22 ++++++++++++++++++----
>> drivers/iommu/intel/iommu.c | 30 +++++++++++++++---------------
>> drivers/iommu/intel/pasid.c | 18 +++++++++---------
>> 4 files changed, 43 insertions(+), 29 deletions(-)
>>
>> diff --git a/drivers/iommu/intel/Kconfig b/drivers/iommu/intel/Kconfig
>> index 5471f814e073..efda19820f95 100644
>> --- a/drivers/iommu/intel/Kconfig
>> +++ b/drivers/iommu/intel/Kconfig
>> @@ -11,7 +11,7 @@ config DMAR_DEBUG
>>
>> config INTEL_IOMMU
>> bool "Support for Intel IOMMU using DMA Remapping Devices"
>> - depends on PCI_MSI && ACPI && X86
>> + depends on PCI_MSI && ACPI && X86_64
>> select IOMMU_API
>> select GENERIC_PT
>> select IOMMU_PT
>> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
>> index 25c5e22096d4..b8999802f401 100644
>> --- a/drivers/iommu/intel/iommu.h
>> +++ b/drivers/iommu/intel/iommu.h
>> @@ -546,6 +546,16 @@ struct pasid_entry;
>> struct pasid_state_entry;
>> struct page_req_dsc;
>>
>> +static __always_inline void intel_iommu_atomic128_set(u128 *ptr, u128 val)
>> +{
>> + /*
>> + * Use the cmpxchg16b instruction for 128-bit atomicity. As updates
>> + * are serialized by a spinlock, we use the local (unlocked) variant
>> + * to avoid unnecessary bus locking overhead.
>> + */
>> + arch_cmpxchg128_local(ptr, *ptr, val);
> Any reason why not cmpxchg128_local()? (except following the AMD driver)
Yes. This follows the AMD IOMMU driver. Both drivers use spin lock to
synchronize the update of table entries. They only need the atomicity of
the 128-bit instruction itself. So arch_cmpxchg128_local() works.
>
> Otherwise,
>
> Reviewed-by: Dmytro Maluka<dmaluka@chromium.org>
Thanks,
baolu
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry
2026-01-13 19:34 ` Dmytro Maluka
@ 2026-01-14 5:38 ` Baolu Lu
2026-01-14 11:12 ` Dmytro Maluka
0 siblings, 1 reply; 39+ messages in thread
From: Baolu Lu @ 2026-01-14 5:38 UTC (permalink / raw)
To: Dmytro Maluka
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Samiullah Khawaja, iommu, linux-kernel,
Vineeth Pillai (Google), Aashish Sharma
On 1/14/26 03:34, Dmytro Maluka wrote:
> On Tue, Jan 13, 2026 at 11:00:47AM +0800, Lu Baolu wrote:
>> The Intel VT-d Scalable Mode PASID table entry consists of 512 bits (64
>> bytes). When tearing down an entry, the current implementation zeros the
>> entire 64-byte structure immediately.
>>
>> However, the IOMMU hardware may fetch these 64 bytes using multiple
>> internal transactions (e.g., four 128-bit bursts). If a hardware fetch
>> occurs simultaneously with the CPU zeroing the entry, the hardware could
>> observe a "torn" entry — where some chunks are zeroed and others still
>> contain old data — leading to unpredictable behavior or spurious faults.
>>
>> Follow the "Guidance to Software for Invalidations" in the VT-d spec
>> (Section 6.5.3.3) by implementing a proper ownership handshake:
>>
>> 1. Clear only the 'Present' (P) bit of the PASID entry. This tells the
>> hardware that the entry is no longer valid.
>> 2. Execute the required invalidation sequence (PASID cache, IOTLB, and
>> Device-TLB flush) to ensure the hardware has released all cached
>> references to the entry.
>> 3. Only after the flushes are complete, zero out the remaining fields of
>> the PASID entry.
>>
>> Additionally, add an explicit clflush in intel_pasid_clear_entry() to
>> ensure that the cleared entry is visible to the IOMMU on systems where
>> memory coherency (ecap_coherent) is not supported.
>>
>> Fixes: 0bbeb01a4faf ("iommu/vt-d: Manage scalalble mode PASID tables")
>> Signed-off-by: Lu Baolu<baolu.lu@linux.intel.com>
>> ---
>> drivers/iommu/intel/pasid.h | 12 ++++++++++++
>> drivers/iommu/intel/pasid.c | 9 +++++++--
>> 2 files changed, 19 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
>> index b4c85242dc79..35de1d77355f 100644
>> --- a/drivers/iommu/intel/pasid.h
>> +++ b/drivers/iommu/intel/pasid.h
>> @@ -237,6 +237,18 @@ static inline void pasid_set_present(struct pasid_entry *pe)
>> pasid_set_bits(&pe->val[0], 1 << 0, 1);
>> }
>>
>> +/*
>> + * Clear the Present (P) bit (bit 0) of a scalable-mode PASID table entry.
>> + * This initiates the transition of the entry's ownership from hardware
>> + * to software. The caller is responsible for fulfilling the invalidation
>> + * handshake recommended by the VT-d spec, Section 6.5.3.3 (Guidance to
>> + * Software for Invalidations).
>> + */
>> +static inline void pasid_clear_present(struct pasid_entry *pe)
>> +{
>> + pasid_set_bits(&pe->val[0], 1 << 0, 0);
>> +}
>> +
>> /*
>> * Setup Page Walk Snoop bit (Bit 87) of a scalable mode PASID
>> * entry.
>> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
>> index 298a39183996..4f36138448d8 100644
>> --- a/drivers/iommu/intel/pasid.c
>> +++ b/drivers/iommu/intel/pasid.c
>> @@ -178,7 +178,8 @@ static struct pasid_entry *intel_pasid_get_entry(struct device *dev, u32 pasid)
>> * Interfaces for PASID table entry manipulation:
>> */
>> static void
>> -intel_pasid_clear_entry(struct device *dev, u32 pasid, bool fault_ignore)
>> +intel_pasid_clear_entry(struct intel_iommu *iommu, struct device *dev,
>> + u32 pasid, bool fault_ignore)
>> {
>> struct pasid_entry *pe;
>>
>> @@ -190,6 +191,9 @@ intel_pasid_clear_entry(struct device *dev, u32 pasid, bool fault_ignore)
>> pasid_clear_entry_with_fpd(pe);
>> else
>> pasid_clear_entry(pe);
>> +
>> + if (!ecap_coherent(iommu->ecap))
>> + clflush_cache_range(pe, sizeof(*pe));
>> }
>>
>> static void
>> @@ -272,7 +276,7 @@ void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev,
>>
>> did = pasid_get_domain_id(pte);
>> pgtt = pasid_pte_get_pgtt(pte);
>> - intel_pasid_clear_entry(dev, pasid, fault_ignore);
>> + pasid_clear_present(pte);
>> spin_unlock(&iommu->lock);
>>
>> if (!ecap_coherent(iommu->ecap))
>> @@ -286,6 +290,7 @@ void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev,
>> iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
>>
>> devtlb_invalidation_with_pasid(iommu, dev, pasid);
>> + intel_pasid_clear_entry(iommu, dev, pasid, fault_ignore);
> Is it safe to do this with iommu->lock already unlocked?
Yes, it is. The PASID entry lifecycle is serialized by the iommu_group-
>mutex in the iommu core, which ensures that no other thread can attempt
to allocate or setup this same PASID until intel_pasid_tear_down_entry()
has returned.
The iommu->lock is held during the initial transition (P->0) to ensure
atomicity against other hardware-table walkers, but once the P bit is
cleared and the caches are flushed, the final zeroing of the 'dead'
entry does not strictly require the spinlock because the PASID remains
reserved in software until the function completes.
Thanks,
baolu
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 3/3] iommu/vt-d: Rework hitless PASID entry replacement
2026-01-13 19:27 ` Samiullah Khawaja
2026-01-13 20:56 ` Jason Gunthorpe
@ 2026-01-14 5:45 ` Baolu Lu
2026-01-14 7:26 ` Tian, Kevin
1 sibling, 1 reply; 39+ messages in thread
From: Baolu Lu @ 2026-01-14 5:45 UTC (permalink / raw)
To: Samiullah Khawaja
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Dmytro Maluka, iommu, linux-kernel
On 1/14/26 03:27, Samiullah Khawaja wrote:
> On Mon, Jan 12, 2026 at 7:03 PM Lu Baolu<baolu.lu@linux.intel.com> wrote:
>> The Intel VT-d PASID table entry is 512 bits (64 bytes). Because the
>> hardware may fetch this entry in multiple 128-bit chunks, updating the
>> entire entry while it is active (P=1) risks a "torn" read where the
>> hardware observes an inconsistent state.
>>
>> However, certain updates (e.g., changing page table pointers while
>> keeping the translation type and domain ID the same) can be performed
>> hitlessly. This is possible if the update is limited to a single
>> 128-bit chunk while the other chunks remains stable.
>>
>> Introduce a hitless replacement mechanism for PASID entries:
>>
>> - Update 'struct pasid_entry' with a union to support 128-bit
>> access via the newly added val128[4] array.
>> - Add pasid_support_hitless_replace() to determine if a transition
>> between an old and new entry is safe to perform atomically.
>> - For First-level/Nested translations: The first 128 bits (chunk 0)
>> must remain identical; chunk 1 is updated atomically.
> Looking at the specs, the DID is part of the first 128 bits (chunk 0),
> so I guess for the first level the hitless replacement would not be
> supported since each domain will have a different DID?
It's not necessarily true that each domain will have a different DID. On
Intel IOMMU, all SVA domains can share a single DID. Similarly, multiple
nested domains sitting on top of the same second-stage page table can
also share a DID.
Thanks,
baolu
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 3/3] iommu/vt-d: Rework hitless PASID entry replacement
2026-01-13 15:05 ` Jason Gunthorpe
@ 2026-01-14 6:03 ` Baolu Lu
0 siblings, 0 replies; 39+ messages in thread
From: Baolu Lu @ 2026-01-14 6:03 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel
On 1/13/26 23:05, Jason Gunthorpe wrote:
> On Tue, Jan 13, 2026 at 11:00:48AM +0800, Lu Baolu wrote:
>> +static inline bool pasid_support_hitless_replace(struct pasid_entry *pte,
>> + struct pasid_entry *new, int type)
>> +{
>> + switch (type) {
>> + case PASID_ENTRY_PGTT_FL_ONLY:
>> + case PASID_ENTRY_PGTT_NESTED:
>> + /* The first 128 bits remain the same. */
>> + return READ_ONCE(pte->val[0]) == READ_ONCE(new->val[0]) &&
>> + READ_ONCE(pte->val[1]) == READ_ONCE(new->val[1]);
>
> pte->val128[0] == new->val128[0]
>
>> + case PASID_ENTRY_PGTT_SL_ONLY:
>> + case PASID_ENTRY_PGTT_PT:
>> + /* The second 128 bits remain the same. */
>> + return READ_ONCE(pte->val[2]) == READ_ONCE(new->val[2]) &&
>> + READ_ONCE(pte->val[3]) == READ_ONCE(new->val[3]);
>
> These READ_ONCE's are pointless, especially the ones on new.
>
> With 5 words to worry about I really feel strongly this should just
> use the ARM algorithm. It handles everything very elegantly, we can
> lift it out of ARM and make it general.
>
> Here, I did a quick refactoring into general code:
>
> https://github.com/jgunthorpe/linux/commits/for-baolu/
>
> You just need to provide a used function to compute which bits HW is
> not ignoring and a sync function to push the invalidation command. It
> will take care of all sequencing needs for all possible new/old
> combinations.
It's always good to generate common library code to avoid boilerplate
code and different behaviors across multiple drivers. I'll try to
integrate this into the next version. Thanks for the help!
>
> Then delete the replace/not replace split in the code too.
>
> Jason
Thanks,
baolu
^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH 3/3] iommu/vt-d: Rework hitless PASID entry replacement
2026-01-14 5:45 ` Baolu Lu
@ 2026-01-14 7:26 ` Tian, Kevin
2026-01-14 13:17 ` Jason Gunthorpe
0 siblings, 1 reply; 39+ messages in thread
From: Tian, Kevin @ 2026-01-14 7:26 UTC (permalink / raw)
To: Baolu Lu, Samiullah Khawaja
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
Dmytro Maluka, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org
> From: Baolu Lu <baolu.lu@linux.intel.com>
> Sent: Wednesday, January 14, 2026 1:45 PM
>
> On 1/14/26 03:27, Samiullah Khawaja wrote:
> > On Mon, Jan 12, 2026 at 7:03 PM Lu Baolu<baolu.lu@linux.intel.com>
> wrote:
> >> The Intel VT-d PASID table entry is 512 bits (64 bytes). Because the
> >> hardware may fetch this entry in multiple 128-bit chunks, updating the
> >> entire entry while it is active (P=1) risks a "torn" read where the
> >> hardware observes an inconsistent state.
> >>
> >> However, certain updates (e.g., changing page table pointers while
> >> keeping the translation type and domain ID the same) can be performed
> >> hitlessly. This is possible if the update is limited to a single
> >> 128-bit chunk while the other chunks remains stable.
> >>
> >> Introduce a hitless replacement mechanism for PASID entries:
> >>
> >> - Update 'struct pasid_entry' with a union to support 128-bit
> >> access via the newly added val128[4] array.
> >> - Add pasid_support_hitless_replace() to determine if a transition
> >> between an old and new entry is safe to perform atomically.
> >> - For First-level/Nested translations: The first 128 bits (chunk 0)
> >> must remain identical; chunk 1 is updated atomically.
> > Looking at the specs, the DID is part of the first 128 bits (chunk 0),
> > so I guess for the first level the hitless replacement would not be
> > supported since each domain will have a different DID?
>
> It's not necessarily true that each domain will have a different DID. On
> Intel IOMMU, all SVA domains can share a single DID. Similarly, multiple
> nested domains sitting on top of the same second-stage page table can
> also share a DID.
>
I guess Samiullah talked about DMA domain with first stage, where each
DMA domain has a DID. The spec says that DID and pgtable pointer must
be updated in one atomic operation. It applies to second-stage but not
first-stage which sits in a different chunk from where DID sits.
But thinking more I'm not sure whether that guidance is too strict.
The key requirement is below:
When modifying fields in present (P=1) entries, software must ensure
that at any point of time during the modification (performed through
single or multiple write operations), the before and after state of the
entry being modified is individually self-consistent.
i.e. there should be no iommu error triggered when the hw reads a
partially-modified entry in that transition period - either translating via
the old table or via the new table.
Then the one initiating replace will ensure that in-fly DMAs will only
target the addresses with same mapping in both old/new tables.
Otherwise its own problem.
Now let's say a flow in the iommu driver:
1) updates the first stage page pointer (in the 2nd 128bit)
2) updates the DID (in the 1st 128bit)
3) flush iommu cache
before cache is flushed, it may contain:
- entries tagged with old DID, with content loaded from old table
- entries tagged with old DID, with content loaded from new table
- entries tagged with new DID, with content loaded from new table
Compared to 2nd-stage the only problematic one is old DID + new table.
According to 6.2.1 (Tagging of Cached Translations), the root address
of page table is not used in tagging and DID-based invalidation will
flush all entries related to old DID (no matter it's from old or new table).
Then it should just work!
p.s. Jason said that atomic size is 128bit on AMD and 64bit on ARM.
they both have DID concept and two page table pointers. So I assume
it's the same case on this front?
^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry
2026-01-13 3:00 ` [PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry Lu Baolu
2026-01-13 19:34 ` Dmytro Maluka
@ 2026-01-14 7:32 ` Tian, Kevin
2026-01-14 8:27 ` Baolu Lu
1 sibling, 1 reply; 39+ messages in thread
From: Tian, Kevin @ 2026-01-14 7:32 UTC (permalink / raw)
To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy,
Jason Gunthorpe
Cc: Dmytro Maluka, Samiullah Khawaja, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org
> From: Lu Baolu <baolu.lu@linux.intel.com>
> Sent: Tuesday, January 13, 2026 11:01 AM
>
> The Intel VT-d Scalable Mode PASID table entry consists of 512 bits (64
> bytes). When tearing down an entry, the current implementation zeros the
> entire 64-byte structure immediately.
>
> However, the IOMMU hardware may fetch these 64 bytes using multiple
> internal transactions (e.g., four 128-bit bursts). If a hardware fetch
> occurs simultaneously with the CPU zeroing the entry, the hardware could
> observe a "torn" entry — where some chunks are zeroed and others still
> contain old data — leading to unpredictable behavior or spurious faults.
>
> Follow the "Guidance to Software for Invalidations" in the VT-d spec
> (Section 6.5.3.3) by implementing a proper ownership handshake:
>
> 1. Clear only the 'Present' (P) bit of the PASID entry. This tells the
> hardware that the entry is no longer valid.
> 2. Execute the required invalidation sequence (PASID cache, IOTLB, and
> Device-TLB flush) to ensure the hardware has released all cached
> references to the entry.
> 3. Only after the flushes are complete, zero out the remaining fields of
> the PASID entry.
>
> Additionally, add an explicit clflush in intel_pasid_clear_entry() to
> ensure that the cleared entry is visible to the IOMMU on systems where
> memory coherency (ecap_coherent) is not supported.
>
better split the clflush part into a separate patch
^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH 1/3] iommu/vt-d: Use 128-bit atomic updates for context entries
2026-01-13 3:00 ` [PATCH 1/3] iommu/vt-d: Use 128-bit atomic updates for context entries Lu Baolu
2026-01-13 19:27 ` Dmytro Maluka
@ 2026-01-14 7:54 ` Tian, Kevin
2026-01-15 3:26 ` Baolu Lu
1 sibling, 1 reply; 39+ messages in thread
From: Tian, Kevin @ 2026-01-14 7:54 UTC (permalink / raw)
To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy,
Jason Gunthorpe
Cc: Dmytro Maluka, Samiullah Khawaja, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org
> From: Lu Baolu <baolu.lu@linux.intel.com>
> Sent: Tuesday, January 13, 2026 11:01 AM
>
> On Intel IOMMU, device context entries are accessed by hardware in
> 128-bit chunks. Currently, the driver updates these entries by
> programming the 'lo' and 'hi' 64-bit fields individually.
>
> This creates a potential race condition where the IOMMU hardware may
> fetch
> a context entry while the CPU has only completed one of the two 64-bit
> writes. This "torn" entry — consisting of half-old and half-new data —
> could lead to unpredictable hardware behavior, especially when
> transitioning the 'Present' bit or changing translation types.
this is not accurate. context entry is 128bits only. Scalable context
entry is 256bits but only the lower 128bits are defined. so hardware always
fetches the context entry atomically. Then if software ensures the right
order of updates (clear present first then other bits), the hardware won't
look at the partial entry after seeing present=0.
But now as Dmytro reported there is no barrier in place so two 64bits
updates to the context entry might be reordered so hw could fetch
an entry with old lower half (present=1) and new higher half.
then 128bit atomic operation avoids this ordering concern.
> @@ -1170,19 +1170,19 @@ static int domain_context_mapping_one(struct
> dmar_domain *domain,
> goto out_unlock;
>
> copied_context_tear_down(iommu, context, bus, devfn);
> - context_clear_entry(context);
> - context_set_domain_id(context, did);
> + context_set_domain_id(&new, did);
I wonder whether it's necessary to use atomic in the attach path, from
fix p.o.v.
The assumption is that the context should have been cleared already
before calling this function (and following ones). Does it make more
sense to check the present bit, warning if set, then fail the operation?
We could refactor them to do atomic update, but then it's for cleanup
instead of being part of a fix.
Then this may be split into three patches:
- change context_clear_entry() to be atomic, to fix the teardown path
- add present bit check in other functions in this patch, to scrutinize the
attach path
- change those functions to be atomic, as a clean up
Does it make sense?
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry
2026-01-14 7:32 ` Tian, Kevin
@ 2026-01-14 8:27 ` Baolu Lu
2026-01-15 5:49 ` Tian, Kevin
0 siblings, 1 reply; 39+ messages in thread
From: Baolu Lu @ 2026-01-14 8:27 UTC (permalink / raw)
To: Tian, Kevin, Joerg Roedel, Will Deacon, Robin Murphy,
Jason Gunthorpe
Cc: Dmytro Maluka, Samiullah Khawaja, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org
On 1/14/26 15:32, Tian, Kevin wrote:
>> From: Lu Baolu <baolu.lu@linux.intel.com>
>> Sent: Tuesday, January 13, 2026 11:01 AM
>>
>> The Intel VT-d Scalable Mode PASID table entry consists of 512 bits (64
>> bytes). When tearing down an entry, the current implementation zeros the
>> entire 64-byte structure immediately.
>>
>> However, the IOMMU hardware may fetch these 64 bytes using multiple
>> internal transactions (e.g., four 128-bit bursts). If a hardware fetch
>> occurs simultaneously with the CPU zeroing the entry, the hardware could
>> observe a "torn" entry — where some chunks are zeroed and others still
>> contain old data — leading to unpredictable behavior or spurious faults.
>>
>> Follow the "Guidance to Software for Invalidations" in the VT-d spec
>> (Section 6.5.3.3) by implementing a proper ownership handshake:
>>
>> 1. Clear only the 'Present' (P) bit of the PASID entry. This tells the
>> hardware that the entry is no longer valid.
>> 2. Execute the required invalidation sequence (PASID cache, IOTLB, and
>> Device-TLB flush) to ensure the hardware has released all cached
>> references to the entry.
>> 3. Only after the flushes are complete, zero out the remaining fields of
>> the PASID entry.
>>
>> Additionally, add an explicit clflush in intel_pasid_clear_entry() to
>> ensure that the cleared entry is visible to the IOMMU on systems where
>> memory coherency (ecap_coherent) is not supported.
>>
>
> better split the clflush part into a separate patch
Ah, the commit message is a bit confusing. Previously,
intel_pasid_clear_entry() was followed by a clflush. Now that we've
moved the clearing logic after the cache invalidation, an explicit
clflush is required. Instead of splitting this into a separate patch,
how about amending the commit message to make this clear.
Or, keep intel_pasid_clear_entry() untouched like the following:
iommu/vt-d: Clear Present bit before tearing down PASID entry
The Intel VT-d Scalable Mode PASID table entry consists of 512 bits (64
bytes). When tearing down an entry, the current implementation zeros the
entire 64-byte structure immediately.
However, the IOMMU hardware may fetch these 64 bytes using multiple
internal transactions (e.g., four 128-bit bursts). If a hardware fetch
occurs simultaneously with the CPU zeroing the entry, the hardware could
observe a "torn" entry — where some chunks are zeroed and others still
contain old data — leading to unpredictable behavior or spurious faults.
Follow the "Guidance to Software for Invalidations" in the VT-d spec
(Section 6.5.3.3) by implementing a proper ownership handshake:
1. Clear only the 'Present' (P) bit of the PASID entry. This tells the
hardware that the entry is no longer valid.
2. Execute the required invalidation sequence (PASID cache, IOTLB, and
Device-TLB flush) to ensure the hardware has released all cached
references to the entry.
3. Only after the flushes are complete, zero out the remaining fields of
the PASID entry.
Fixes: 0bbeb01a4faf ("iommu/vt-d: Manage scalalble mode PASID tables")
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
drivers/iommu/intel/pasid.h | 12 ++++++++++++
drivers/iommu/intel/pasid.c | 6 +++++-
2 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
index b4c85242dc79..35de1d77355f 100644
--- a/drivers/iommu/intel/pasid.h
+++ b/drivers/iommu/intel/pasid.h
@@ -237,6 +237,18 @@ static inline void pasid_set_present(struct
pasid_entry *pe)
pasid_set_bits(&pe->val[0], 1 << 0, 1);
}
+/*
+ * Clear the Present (P) bit (bit 0) of a scalable-mode PASID table entry.
+ * This initiates the transition of the entry's ownership from hardware
+ * to software. The caller is responsible for fulfilling the invalidation
+ * handshake recommended by the VT-d spec, Section 6.5.3.3 (Guidance to
+ * Software for Invalidations).
+ */
+static inline void pasid_clear_present(struct pasid_entry *pe)
+{
+ pasid_set_bits(&pe->val[0], 1 << 0, 0);
+}
+
/*
* Setup Page Walk Snoop bit (Bit 87) of a scalable mode PASID
* entry.
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 298a39183996..3a8ec5eba83e 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -272,7 +272,7 @@ void intel_pasid_tear_down_entry(struct intel_iommu
*iommu, struct device *dev,
did = pasid_get_domain_id(pte);
pgtt = pasid_pte_get_pgtt(pte);
- intel_pasid_clear_entry(dev, pasid, fault_ignore);
+ pasid_clear_present(pte);
spin_unlock(&iommu->lock);
if (!ecap_coherent(iommu->ecap))
@@ -286,6 +286,10 @@ void intel_pasid_tear_down_entry(struct intel_iommu
*iommu, struct device *dev,
iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
devtlb_invalidation_with_pasid(iommu, dev, pasid);
+ intel_pasid_clear_entry(dev, pasid, fault_ignore);
+ if (!ecap_coherent(iommu->ecap))
+ clflush_cache_range(pte, sizeof(*pte));
+
if (!fault_ignore)
intel_iommu_drain_pasid_prq(dev, pasid);
}
--
2.43.0
Thanks,
baolu
^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] iommu/vt-d: Use 128-bit atomic updates for context entries
2026-01-14 5:14 ` Baolu Lu
@ 2026-01-14 10:55 ` Dmytro Maluka
2026-01-15 2:26 ` Baolu Lu
0 siblings, 1 reply; 39+ messages in thread
From: Dmytro Maluka @ 2026-01-14 10:55 UTC (permalink / raw)
To: Baolu Lu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Samiullah Khawaja, iommu, linux-kernel,
Vineeth Pillai (Google), Aashish Sharma
On Wed, Jan 14, 2026 at 01:14:36PM +0800, Baolu Lu wrote:
> On 1/14/26 03:27, Dmytro Maluka wrote:
> > On Tue, Jan 13, 2026 at 11:00:46AM +0800, Lu Baolu wrote:
> > > +static __always_inline void intel_iommu_atomic128_set(u128 *ptr, u128 val)
> > > +{
> > > + /*
> > > + * Use the cmpxchg16b instruction for 128-bit atomicity. As updates
> > > + * are serialized by a spinlock, we use the local (unlocked) variant
> > > + * to avoid unnecessary bus locking overhead.
> > > + */
> > > + arch_cmpxchg128_local(ptr, *ptr, val);
> > Any reason why not cmpxchg128_local()? (except following the AMD driver)
>
> Yes. This follows the AMD IOMMU driver. Both drivers use spin lock to
> synchronize the update of table entries. They only need the atomicity of
> the 128-bit instruction itself. So arch_cmpxchg128_local() works.
Yeah, but my question was merely: why use the raw arch_*() version, not
cmpxchg128_local() which is the same but also includes optional
kasan/kcsan instrumentation:
#define cmpxchg128_local(ptr, ...) \
({ \
typeof(ptr) __ai_ptr = (ptr); \
instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
raw_cmpxchg128_local(__ai_ptr, __VA_ARGS__); \
})
IOW, why bypass this instrumentation?
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry
2026-01-14 5:38 ` Baolu Lu
@ 2026-01-14 11:12 ` Dmytro Maluka
2026-01-15 2:45 ` Baolu Lu
0 siblings, 1 reply; 39+ messages in thread
From: Dmytro Maluka @ 2026-01-14 11:12 UTC (permalink / raw)
To: Baolu Lu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Samiullah Khawaja, iommu, linux-kernel,
Vineeth Pillai (Google), Aashish Sharma
On Wed, Jan 14, 2026 at 01:38:13PM +0800, Baolu Lu wrote:
> On 1/14/26 03:34, Dmytro Maluka wrote:
> > On Tue, Jan 13, 2026 at 11:00:47AM +0800, Lu Baolu wrote:
> > > + intel_pasid_clear_entry(iommu, dev, pasid, fault_ignore);
> > Is it safe to do this with iommu->lock already unlocked?
>
> Yes, it is. The PASID entry lifecycle is serialized by the iommu_group-
> >mutex in the iommu core, which ensures that no other thread can attempt
> to allocate or setup this same PASID until intel_pasid_tear_down_entry()
> has returned.
>
> The iommu->lock is held during the initial transition (P->0) to ensure
> atomicity against other hardware-table walkers, but once the P bit is
> cleared and the caches are flushed, the final zeroing of the 'dead'
> entry does not strictly require the spinlock because the PASID remains
> reserved in software until the function completes.
Ok. Just to understand: "other hardware-table walkers" means some
software walkers, not hardware ones? Which software walkers are those?
(I can't imagine how holding a spinlock could prevent the hardware from
walking those tables. :))
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 3/3] iommu/vt-d: Rework hitless PASID entry replacement
2026-01-14 7:26 ` Tian, Kevin
@ 2026-01-14 13:17 ` Jason Gunthorpe
2026-01-14 18:51 ` Samiullah Khawaja
2026-01-15 5:44 ` Tian, Kevin
0 siblings, 2 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2026-01-14 13:17 UTC (permalink / raw)
To: Tian, Kevin
Cc: Baolu Lu, Samiullah Khawaja, Joerg Roedel, Will Deacon,
Robin Murphy, Dmytro Maluka, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org
On Wed, Jan 14, 2026 at 07:26:10AM +0000, Tian, Kevin wrote:
> before cache is flushed, it may contain:
>
> - entries tagged with old DID, with content loaded from old table
> - entries tagged with old DID, with content loaded from new table
> - entries tagged with new DID, with content loaded from new table
>
> Compared to 2nd-stage the only problematic one is old DID + new table.
>
> According to 6.2.1 (Tagging of Cached Translations), the root address
> of page table is not used in tagging and DID-based invalidation will
> flush all entries related to old DID (no matter it's from old or new table).
>
> Then it should just work!
Unless the original domain is attached to another device, then you've
corrupted the DID and corrupted IOTLB for the second innocent device
that isn't changing translation.
> p.s. Jason said that atomic size is 128bit on AMD and 64bit on ARM.
> they both have DID concept and two page table pointers. So I assume
> it's the same case on this front?
Hmm, yeah, ARM has it worse you can't change any ASID/VMID
concurrently with the table pointer.
You could make a safe algorithm by allocating a temporary ID, moving
the current entry to the temporary ID, moving to the new pointer,
moving to the final ID, then flushing the tempoary ID.
It avoids the cross device issue and your logic above would hold.
Or maybe the case Samiullah is interested in should have the new
domain adopt the original ID..
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 3/3] iommu/vt-d: Rework hitless PASID entry replacement
2026-01-14 13:17 ` Jason Gunthorpe
@ 2026-01-14 18:51 ` Samiullah Khawaja
2026-01-14 19:07 ` Jason Gunthorpe
2026-01-15 5:44 ` Tian, Kevin
1 sibling, 1 reply; 39+ messages in thread
From: Samiullah Khawaja @ 2026-01-14 18:51 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Tian, Kevin, Baolu Lu, Joerg Roedel, Will Deacon, Robin Murphy,
Dmytro Maluka, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org
On Wed, Jan 14, 2026 at 5:17 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Jan 14, 2026 at 07:26:10AM +0000, Tian, Kevin wrote:
> > before cache is flushed, it may contain:
> >
> > - entries tagged with old DID, with content loaded from old table
> > - entries tagged with old DID, with content loaded from new table
> > - entries tagged with new DID, with content loaded from new table
> >
> > Compared to 2nd-stage the only problematic one is old DID + new table.
> >
> > According to 6.2.1 (Tagging of Cached Translations), the root address
> > of page table is not used in tagging and DID-based invalidation will
> > flush all entries related to old DID (no matter it's from old or new table).
> >
> > Then it should just work!
>
> Unless the original domain is attached to another device, then you've
> corrupted the DID and corrupted IOTLB for the second innocent device
> that isn't changing translation.
>
> > p.s. Jason said that atomic size is 128bit on AMD and 64bit on ARM.
> > they both have DID concept and two page table pointers. So I assume
> > it's the same case on this front?
>
> Hmm, yeah, ARM has it worse you can't change any ASID/VMID
> concurrently with the table pointer.
>
> You could make a safe algorithm by allocating a temporary ID, moving
> the current entry to the temporary ID, moving to the new pointer,
> moving to the final ID, then flushing the tempoary ID.
Interesting Idea. I think this can be done.
>
> It avoids the cross device issue and your logic above would hold.
>
> Or maybe the case Samiullah is interested in should have the new
> domain adopt the original ID..
There is no immediate use case for first level, as I understand we use
second level for all our VM use cases. But looking at the code and
specs, the first level is used for pasid compatible domains (using
IOMMU_HWPT_ALLOC_PASID flag) and it will not be replaceable
hitlessly..
>
> Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 3/3] iommu/vt-d: Rework hitless PASID entry replacement
2026-01-14 18:51 ` Samiullah Khawaja
@ 2026-01-14 19:07 ` Jason Gunthorpe
0 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2026-01-14 19:07 UTC (permalink / raw)
To: Samiullah Khawaja
Cc: Tian, Kevin, Baolu Lu, Joerg Roedel, Will Deacon, Robin Murphy,
Dmytro Maluka, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org
On Wed, Jan 14, 2026 at 10:51:02AM -0800, Samiullah Khawaja wrote:
> There is no immediate use case for first level, as I understand we use
> second level for all our VM use cases. But looking at the code and
> specs, the first level is used for pasid compatible domains (using
> IOMMU_HWPT_ALLOC_PASID flag) and it will not be replaceable
> hitlessly..
If the VMM allocates using the NEST_PARENT flag it will always get a
S2, and that is a reasonable thing for the VMM to do.
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] iommu/vt-d: Use 128-bit atomic updates for context entries
2026-01-14 10:55 ` Dmytro Maluka
@ 2026-01-15 2:26 ` Baolu Lu
2026-01-15 13:12 ` Jason Gunthorpe
0 siblings, 1 reply; 39+ messages in thread
From: Baolu Lu @ 2026-01-15 2:26 UTC (permalink / raw)
To: Dmytro Maluka
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Samiullah Khawaja, iommu, linux-kernel,
Vineeth Pillai (Google), Aashish Sharma
On 1/14/26 18:55, Dmytro Maluka wrote:
> On Wed, Jan 14, 2026 at 01:14:36PM +0800, Baolu Lu wrote:
>> On 1/14/26 03:27, Dmytro Maluka wrote:
>>> On Tue, Jan 13, 2026 at 11:00:46AM +0800, Lu Baolu wrote:
>>>> +static __always_inline void intel_iommu_atomic128_set(u128 *ptr, u128 val)
>>>> +{
>>>> + /*
>>>> + * Use the cmpxchg16b instruction for 128-bit atomicity. As updates
>>>> + * are serialized by a spinlock, we use the local (unlocked) variant
>>>> + * to avoid unnecessary bus locking overhead.
>>>> + */
>>>> + arch_cmpxchg128_local(ptr, *ptr, val);
>>> Any reason why not cmpxchg128_local()? (except following the AMD driver)
>>
>> Yes. This follows the AMD IOMMU driver. Both drivers use spin lock to
>> synchronize the update of table entries. They only need the atomicity of
>> the 128-bit instruction itself. So arch_cmpxchg128_local() works.
>
> Yeah, but my question was merely: why use the raw arch_*() version, not
> cmpxchg128_local() which is the same but also includes optional
> kasan/kcsan instrumentation:
>
> #define cmpxchg128_local(ptr, ...) \
> ({ \
> typeof(ptr) __ai_ptr = (ptr); \
> instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
> raw_cmpxchg128_local(__ai_ptr, __VA_ARGS__); \
> })
>
> IOW, why bypass this instrumentation?
You are right. There is no strong technical reason to bypass the kasan/
kcsan instrumentation here. My use of the arch_ version was primarily
following the existing pattern in the AMD driver, likely under the
assumption that the spinlock provided sufficient synchronization.
That said, Jason has suggested the generic entry_sync library to handle
these types of multi-quanta updates across different IOMMU drivers. I
plan to adopt that in the next version.
Thanks,
baolu
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry
2026-01-14 11:12 ` Dmytro Maluka
@ 2026-01-15 2:45 ` Baolu Lu
2026-01-15 21:35 ` Dmytro Maluka
0 siblings, 1 reply; 39+ messages in thread
From: Baolu Lu @ 2026-01-15 2:45 UTC (permalink / raw)
To: Dmytro Maluka
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Samiullah Khawaja, iommu, linux-kernel,
Vineeth Pillai (Google), Aashish Sharma
On 1/14/26 19:12, Dmytro Maluka wrote:
> On Wed, Jan 14, 2026 at 01:38:13PM +0800, Baolu Lu wrote:
>> On 1/14/26 03:34, Dmytro Maluka wrote:
>>> On Tue, Jan 13, 2026 at 11:00:47AM +0800, Lu Baolu wrote:
>>>> + intel_pasid_clear_entry(iommu, dev, pasid, fault_ignore);
>>> Is it safe to do this with iommu->lock already unlocked?
>>
>> Yes, it is. The PASID entry lifecycle is serialized by the iommu_group-
>>> mutex in the iommu core, which ensures that no other thread can attempt
>> to allocate or setup this same PASID until intel_pasid_tear_down_entry()
>> has returned.
>>
>> The iommu->lock is held during the initial transition (P->0) to ensure
>> atomicity against other hardware-table walkers, but once the P bit is
>> cleared and the caches are flushed, the final zeroing of the 'dead'
>> entry does not strictly require the spinlock because the PASID remains
>> reserved in software until the function completes.
>
> Ok. Just to understand: "other hardware-table walkers" means some
> software walkers, not hardware ones? Which software walkers are those?
> (I can't imagine how holding a spinlock could prevent the hardware from
> walking those tables. :))
You are right. A spinlock doesn't stop the hardware. The spinlock
serializes software threads to ensure the hardware walker always sees a
consistent entry.
When a PASID entry is active (P=1), other kernel paths might modify
the control bits in-place. For example:
void intel_pasid_setup_page_snoop_control(struct intel_iommu *iommu,
struct device *dev, u32 pasid)
{
struct pasid_entry *pte;
u16 did;
spin_lock(&iommu->lock);
pte = intel_pasid_get_entry(dev, pasid);
if (WARN_ON(!pte || !pasid_pte_is_present(pte))) {
spin_unlock(&iommu->lock);
return;
}
pasid_set_pgsnp(pte);
did = pasid_get_domain_id(pte);
spin_unlock(&iommu->lock);
intel_pasid_flush_present(iommu, dev, pasid, did, pte);
}
In this case, the iommu->lock ensures that if two threads try to modify
the same active entry, they don't interfere with each other and leave
the entry in a 'torn' state for the IOMMU hardware to read.
In intel_pasid_tear_down_entry(), once the PASID entry is deactivated
(setting P=0 and flushing caches), the entry is owned exclusively by
the teardown thread until it is re-configured. That's the reason why the
final zeroing doesn't need the spinlock.
Thanks,
baolu
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] iommu/vt-d: Use 128-bit atomic updates for context entries
2026-01-14 7:54 ` Tian, Kevin
@ 2026-01-15 3:26 ` Baolu Lu
2026-01-15 5:59 ` Tian, Kevin
0 siblings, 1 reply; 39+ messages in thread
From: Baolu Lu @ 2026-01-15 3:26 UTC (permalink / raw)
To: Tian, Kevin, Joerg Roedel, Will Deacon, Robin Murphy,
Jason Gunthorpe
Cc: Dmytro Maluka, Samiullah Khawaja, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org
On 1/14/26 15:54, Tian, Kevin wrote:
>> From: Lu Baolu <baolu.lu@linux.intel.com>
>> Sent: Tuesday, January 13, 2026 11:01 AM
>>
>> On Intel IOMMU, device context entries are accessed by hardware in
>> 128-bit chunks. Currently, the driver updates these entries by
>> programming the 'lo' and 'hi' 64-bit fields individually.
>>
>> This creates a potential race condition where the IOMMU hardware may
>> fetch
>> a context entry while the CPU has only completed one of the two 64-bit
>> writes. This "torn" entry — consisting of half-old and half-new data —
>> could lead to unpredictable hardware behavior, especially when
>> transitioning the 'Present' bit or changing translation types.
>
> this is not accurate. context entry is 128bits only. Scalable context
> entry is 256bits but only the lower 128bits are defined. so hardware always
> fetches the context entry atomically. Then if software ensures the right
> order of updates (clear present first then other bits), the hardware won't
> look at the partial entry after seeing present=0.
>
> But now as Dmytro reported there is no barrier in place so two 64bits
> updates to the context entry might be reordered so hw could fetch
> an entry with old lower half (present=1) and new higher half.
>
> then 128bit atomic operation avoids this ordering concern.
You're right. I will update the commit message to be more precise. Since
the hardware fetches the 128-bit context entry atomically, the issue is
essentially a software ordering problem.
We considered three approaches to solve this:
- Memory barriers (to enforce Present bit clearing order)
- WRITE_ONCE() (to prevent compiler reordering)
- 128-bit atomic updates
This patch uses the atomic update approach.
>
>> @@ -1170,19 +1170,19 @@ static int domain_context_mapping_one(struct
>> dmar_domain *domain,
>> goto out_unlock;
>>
>> copied_context_tear_down(iommu, context, bus, devfn);
>> - context_clear_entry(context);
>> - context_set_domain_id(context, did);
>> + context_set_domain_id(&new, did);
>
> I wonder whether it's necessary to use atomic in the attach path, from
> fix p.o.v.
>
> The assumption is that the context should have been cleared already
> before calling this function (and following ones). Does it make more
> sense to check the present bit, warning if set, then fail the operation?
> We could refactor them to do atomic update, but then it's for cleanup> instead of being part of a fix.
Yes. For the attach path, this is a cleanup rather than a fix.
>
> Then this may be split into three patches:
>
> - change context_clear_entry() to be atomic, to fix the teardown path
> - add present bit check in other functions in this patch, to scrutinize the
> attach path
> - change those functions to be atomic, as a clean up
Perhaps this also paves the way for enabling hitless replace in the
attach_dev path?
> Does it make sense?
Yes, it is.
Thanks,
baolu
^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH 3/3] iommu/vt-d: Rework hitless PASID entry replacement
2026-01-14 13:17 ` Jason Gunthorpe
2026-01-14 18:51 ` Samiullah Khawaja
@ 2026-01-15 5:44 ` Tian, Kevin
2026-01-15 13:28 ` Jason Gunthorpe
1 sibling, 1 reply; 39+ messages in thread
From: Tian, Kevin @ 2026-01-15 5:44 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Baolu Lu, Samiullah Khawaja, Joerg Roedel, Will Deacon,
Robin Murphy, Dmytro Maluka, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, January 14, 2026 9:17 PM
>
> On Wed, Jan 14, 2026 at 07:26:10AM +0000, Tian, Kevin wrote:
> > before cache is flushed, it may contain:
> >
> > - entries tagged with old DID, with content loaded from old table
> > - entries tagged with old DID, with content loaded from new table
> > - entries tagged with new DID, with content loaded from new table
> >
> > Compared to 2nd-stage the only problematic one is old DID + new table.
> >
> > According to 6.2.1 (Tagging of Cached Translations), the root address
> > of page table is not used in tagging and DID-based invalidation will
> > flush all entries related to old DID (no matter it's from old or new table).
> >
> > Then it should just work!
>
> Unless the original domain is attached to another device, then you've
> corrupted the DID and corrupted IOTLB for the second innocent device
> that isn't changing translation.
ah, yes, that's the key point!
>
> > p.s. Jason said that atomic size is 128bit on AMD and 64bit on ARM.
> > they both have DID concept and two page table pointers. So I assume
> > it's the same case on this front?
>
> Hmm, yeah, ARM has it worse you can't change any ASID/VMID
> concurrently with the table pointer.
>
> You could make a safe algorithm by allocating a temporary ID, moving
> the current entry to the temporary ID, moving to the new pointer,
> moving to the final ID, then flushing the tempoary ID.
right, that ensures the corrupted state is only associated with the
temporary ID.
>
> It avoids the cross device issue and your logic above would hold.
>
> Or maybe the case Samiullah is interested in should have the new
> domain adopt the original ID..
>
that also works. the KHO resume process could have a special DID
allocation scheme to reuse the original one.
or in Samiullah's case the old/new domains always contains the
same mappings, so no corruption would ever occur, but that'd be
a very KHO specific assumption. 😊
^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry
2026-01-14 8:27 ` Baolu Lu
@ 2026-01-15 5:49 ` Tian, Kevin
0 siblings, 0 replies; 39+ messages in thread
From: Tian, Kevin @ 2026-01-15 5:49 UTC (permalink / raw)
To: Baolu Lu, Joerg Roedel, Will Deacon, Robin Murphy,
Jason Gunthorpe
Cc: Dmytro Maluka, Samiullah Khawaja, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org
> From: Baolu Lu <baolu.lu@linux.intel.com>
> Sent: Wednesday, January 14, 2026 4:27 PM
>
> On 1/14/26 15:32, Tian, Kevin wrote:
> >> From: Lu Baolu <baolu.lu@linux.intel.com>
> >> Sent: Tuesday, January 13, 2026 11:01 AM
> >>
> >> The Intel VT-d Scalable Mode PASID table entry consists of 512 bits (64
> >> bytes). When tearing down an entry, the current implementation zeros
> the
> >> entire 64-byte structure immediately.
> >>
> >> However, the IOMMU hardware may fetch these 64 bytes using multiple
> >> internal transactions (e.g., four 128-bit bursts). If a hardware fetch
> >> occurs simultaneously with the CPU zeroing the entry, the hardware
> could
> >> observe a "torn" entry — where some chunks are zeroed and others still
> >> contain old data — leading to unpredictable behavior or spurious faults.
> >>
> >> Follow the "Guidance to Software for Invalidations" in the VT-d spec
> >> (Section 6.5.3.3) by implementing a proper ownership handshake:
> >>
> >> 1. Clear only the 'Present' (P) bit of the PASID entry. This tells the
> >> hardware that the entry is no longer valid.
> >> 2. Execute the required invalidation sequence (PASID cache, IOTLB, and
> >> Device-TLB flush) to ensure the hardware has released all cached
> >> references to the entry.
> >> 3. Only after the flushes are complete, zero out the remaining fields of
> >> the PASID entry.
> >>
> >> Additionally, add an explicit clflush in intel_pasid_clear_entry() to
> >> ensure that the cleared entry is visible to the IOMMU on systems where
> >> memory coherency (ecap_coherent) is not supported.
> >>
> >
> > better split the clflush part into a separate patch
>
> Ah, the commit message is a bit confusing. Previously,
> intel_pasid_clear_entry() was followed by a clflush. Now that we've
> moved the clearing logic after the cache invalidation, an explicit
> clflush is required. Instead of splitting this into a separate patch,
> how about amending the commit message to make this clear.
>
> Or, keep intel_pasid_clear_entry() untouched like the following:
let's go this way to have two places consistent.
>
> iommu/vt-d: Clear Present bit before tearing down PASID entry
>
> The Intel VT-d Scalable Mode PASID table entry consists of 512 bits (64
> bytes). When tearing down an entry, the current implementation zeros the
> entire 64-byte structure immediately.
>
> However, the IOMMU hardware may fetch these 64 bytes using multiple
> internal transactions (e.g., four 128-bit bursts). If a hardware fetch
> occurs simultaneously with the CPU zeroing the entry, the hardware could
> observe a "torn" entry — where some chunks are zeroed and others still
> contain old data — leading to unpredictable behavior or spurious faults.
>
> Follow the "Guidance to Software for Invalidations" in the VT-d spec
> (Section 6.5.3.3) by implementing a proper ownership handshake:
>
> 1. Clear only the 'Present' (P) bit of the PASID entry. This tells the
> hardware that the entry is no longer valid.
> 2. Execute the required invalidation sequence (PASID cache, IOTLB, and
> Device-TLB flush) to ensure the hardware has released all cached
> references to the entry.
> 3. Only after the flushes are complete, zero out the remaining fields of
> the PASID entry.
>
> Fixes: 0bbeb01a4faf ("iommu/vt-d: Manage scalalble mode PASID tables")
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
> drivers/iommu/intel/pasid.h | 12 ++++++++++++
> drivers/iommu/intel/pasid.c | 6 +++++-
> 2 files changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
> index b4c85242dc79..35de1d77355f 100644
> --- a/drivers/iommu/intel/pasid.h
> +++ b/drivers/iommu/intel/pasid.h
> @@ -237,6 +237,18 @@ static inline void pasid_set_present(struct
> pasid_entry *pe)
> pasid_set_bits(&pe->val[0], 1 << 0, 1);
> }
>
> +/*
> + * Clear the Present (P) bit (bit 0) of a scalable-mode PASID table entry.
> + * This initiates the transition of the entry's ownership from hardware
> + * to software. The caller is responsible for fulfilling the invalidation
> + * handshake recommended by the VT-d spec, Section 6.5.3.3 (Guidance to
> + * Software for Invalidations).
> + */
> +static inline void pasid_clear_present(struct pasid_entry *pe)
> +{
> + pasid_set_bits(&pe->val[0], 1 << 0, 0);
> +}
> +
> /*
> * Setup Page Walk Snoop bit (Bit 87) of a scalable mode PASID
> * entry.
> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
> index 298a39183996..3a8ec5eba83e 100644
> --- a/drivers/iommu/intel/pasid.c
> +++ b/drivers/iommu/intel/pasid.c
> @@ -272,7 +272,7 @@ void intel_pasid_tear_down_entry(struct
> intel_iommu
> *iommu, struct device *dev,
>
> did = pasid_get_domain_id(pte);
> pgtt = pasid_pte_get_pgtt(pte);
> - intel_pasid_clear_entry(dev, pasid, fault_ignore);
> + pasid_clear_present(pte);
> spin_unlock(&iommu->lock);
>
> if (!ecap_coherent(iommu->ecap))
> @@ -286,6 +286,10 @@ void intel_pasid_tear_down_entry(struct
> intel_iommu
> *iommu, struct device *dev,
> iommu->flush.flush_iotlb(iommu, did, 0, 0,
> DMA_TLB_DSI_FLUSH);
>
> devtlb_invalidation_with_pasid(iommu, dev, pasid);
> + intel_pasid_clear_entry(dev, pasid, fault_ignore);
> + if (!ecap_coherent(iommu->ecap))
> + clflush_cache_range(pte, sizeof(*pte));
> +
> if (!fault_ignore)
> intel_iommu_drain_pasid_prq(dev, pasid);
> }
> --
> 2.43.0
>
> Thanks,
> baolu
^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH 1/3] iommu/vt-d: Use 128-bit atomic updates for context entries
2026-01-15 3:26 ` Baolu Lu
@ 2026-01-15 5:59 ` Tian, Kevin
2026-01-15 13:23 ` Jason Gunthorpe
0 siblings, 1 reply; 39+ messages in thread
From: Tian, Kevin @ 2026-01-15 5:59 UTC (permalink / raw)
To: Baolu Lu, Joerg Roedel, Will Deacon, Robin Murphy,
Jason Gunthorpe
Cc: Dmytro Maluka, Samiullah Khawaja, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org
> From: Baolu Lu <baolu.lu@linux.intel.com>
> Sent: Thursday, January 15, 2026 11:27 AM
>
> On 1/14/26 15:54, Tian, Kevin wrote:
> >
> > Then this may be split into three patches:
> >
> > - change context_clear_entry() to be atomic, to fix the teardown path
> > - add present bit check in other functions in this patch, to scrutinize the
> > attach path
> > - change those functions to be atomic, as a clean up
>
> Perhaps this also paves the way for enabling hitless replace in the
> attach_dev path?
>
I didn't get it. attach/replace are different paths and iommu core will
reject an attach request for a device which is already attached...
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] iommu/vt-d: Use 128-bit atomic updates for context entries
2026-01-15 2:26 ` Baolu Lu
@ 2026-01-15 13:12 ` Jason Gunthorpe
0 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2026-01-15 13:12 UTC (permalink / raw)
To: Baolu Lu
Cc: Dmytro Maluka, Joerg Roedel, Will Deacon, Robin Murphy,
Kevin Tian, Samiullah Khawaja, iommu, linux-kernel,
Vineeth Pillai (Google), Aashish Sharma
On Thu, Jan 15, 2026 at 10:26:16AM +0800, Baolu Lu wrote:
> On 1/14/26 18:55, Dmytro Maluka wrote:
> > On Wed, Jan 14, 2026 at 01:14:36PM +0800, Baolu Lu wrote:
> > > On 1/14/26 03:27, Dmytro Maluka wrote:
> > > > On Tue, Jan 13, 2026 at 11:00:46AM +0800, Lu Baolu wrote:
> > > > > +static __always_inline void intel_iommu_atomic128_set(u128 *ptr, u128 val)
> > > > > +{
> > > > > + /*
> > > > > + * Use the cmpxchg16b instruction for 128-bit atomicity. As updates
> > > > > + * are serialized by a spinlock, we use the local (unlocked) variant
> > > > > + * to avoid unnecessary bus locking overhead.
> > > > > + */
> > > > > + arch_cmpxchg128_local(ptr, *ptr, val);
> > > > Any reason why not cmpxchg128_local()? (except following the AMD driver)
> > >
> > > Yes. This follows the AMD IOMMU driver. Both drivers use spin lock to
> > > synchronize the update of table entries. They only need the atomicity of
> > > the 128-bit instruction itself. So arch_cmpxchg128_local() works.
> >
> > Yeah, but my question was merely: why use the raw arch_*() version, not
> > cmpxchg128_local() which is the same but also includes optional
> > kasan/kcsan instrumentation:
> >
> > #define cmpxchg128_local(ptr, ...) \
> > ({ \
> > typeof(ptr) __ai_ptr = (ptr); \
> > instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
> > raw_cmpxchg128_local(__ai_ptr, __VA_ARGS__); \
> > })
> >
> > IOW, why bypass this instrumentation?
>
> You are right. There is no strong technical reason to bypass the kasan/
> kcsan instrumentation here. My use of the arch_ version was primarily
> following the existing pattern in the AMD driver, likely under the
> assumption that the spinlock provided sufficient synchronization.
>
> That said, Jason has suggested the generic entry_sync library to handle
> these types of multi-quanta updates across different IOMMU drivers. I
> plan to adopt that in the next version.
I also copied the amd driver in my draft so it should be changed to
this I think?
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] iommu/vt-d: Use 128-bit atomic updates for context entries
2026-01-15 5:59 ` Tian, Kevin
@ 2026-01-15 13:23 ` Jason Gunthorpe
2026-01-16 5:19 ` Tian, Kevin
0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2026-01-15 13:23 UTC (permalink / raw)
To: Tian, Kevin
Cc: Baolu Lu, Joerg Roedel, Will Deacon, Robin Murphy, Dmytro Maluka,
Samiullah Khawaja, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org
On Thu, Jan 15, 2026 at 05:59:56AM +0000, Tian, Kevin wrote:
> > From: Baolu Lu <baolu.lu@linux.intel.com>
> > Sent: Thursday, January 15, 2026 11:27 AM
> >
> > On 1/14/26 15:54, Tian, Kevin wrote:
> > >
> > > Then this may be split into three patches:
> > >
> > > - change context_clear_entry() to be atomic, to fix the teardown path
> > > - add present bit check in other functions in this patch, to scrutinize the
> > > attach path
> > > - change those functions to be atomic, as a clean up
> >
> > Perhaps this also paves the way for enabling hitless replace in the
> > attach_dev path?
> >
>
> I didn't get it. attach/replace are different paths and iommu core will
> reject an attach request for a device which is already attached...
From a driver perspective there should be no such thing as
repalce. vt-d has gone wrong by having special replace flows inside
itself.
Drivers do attach and should try to make it hitless, that's it.
Meaning drivers do the whole attach sequence in a robust way:
- Manage invalidation lists so that old/new domains continue to
generate invalidations while transitioning
- Updating HW data structures without making them non-present
- Managing ATS enable/disable in the right order
It is alot to think about, but if you follow the ARM sequence it is
all written out there..
So this series should be about the 2nd one, making HW updates go from
something on the stack to something in the HW, and if Balou is going
to use entry_set then use entry_set for *everything* and it will deal
with things like using the 128 bit store or making things non-present
temporarily. Even if the thing has only 128 bits used you may as well
stick with it.
Remember also that the 128 bit store is a CPU optional feature. AMD
investigated and found all their CPUs with IOMMUs have the store
feature so they could use it unconditionally (IIRC they check at
driver probe and fail probe without 128 bit store). VTD needs to do
the same thing, and if 128 bit store is actually sometimes not
available it needs to fall back to the 64 bit entry set, for
eveything.
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 3/3] iommu/vt-d: Rework hitless PASID entry replacement
2026-01-15 5:44 ` Tian, Kevin
@ 2026-01-15 13:28 ` Jason Gunthorpe
2026-01-16 6:16 ` Tian, Kevin
0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2026-01-15 13:28 UTC (permalink / raw)
To: Tian, Kevin
Cc: Baolu Lu, Samiullah Khawaja, Joerg Roedel, Will Deacon,
Robin Murphy, Dmytro Maluka, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org
On Thu, Jan 15, 2026 at 05:44:08AM +0000, Tian, Kevin wrote:
> or in Samiullah's case the old/new domains always contains the
> same mappings, so no corruption would ever occur, but that'd be
> a very KHO specific assumption. 😊
The thing witih KHO is we don't actually know this is true. We hope it
is true, but ultimately userspace is responsible to do it, and the
kernel doesn't check it.
This is why the hitless update in an appealing solution because it
means we don't need to deal with trying to adopt an entire page table
prepared by another kernel and ensuring this kernel has IOAS areas
that fully cover and exactly match what is in the table.
Unfortunately to do the extra DID will require a complex invalidation
sequence to hook multiple DIDs into the same domain
simultaneously.. Nicolin is working on this for ARM and I think
Intel's linked list scheme could be tasked to do it too..
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry
2026-01-15 2:45 ` Baolu Lu
@ 2026-01-15 21:35 ` Dmytro Maluka
2026-01-16 6:06 ` Baolu Lu
0 siblings, 1 reply; 39+ messages in thread
From: Dmytro Maluka @ 2026-01-15 21:35 UTC (permalink / raw)
To: Baolu Lu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Samiullah Khawaja, iommu, linux-kernel,
Vineeth Pillai (Google), Aashish Sharma
On Thu, Jan 15, 2026 at 10:45:12AM +0800, Baolu Lu wrote:
> On 1/14/26 19:12, Dmytro Maluka wrote:
> > On Wed, Jan 14, 2026 at 01:38:13PM +0800, Baolu Lu wrote:
> > > On 1/14/26 03:34, Dmytro Maluka wrote:
> > > > On Tue, Jan 13, 2026 at 11:00:47AM +0800, Lu Baolu wrote:
> > > > > + intel_pasid_clear_entry(iommu, dev, pasid, fault_ignore);
> > > > Is it safe to do this with iommu->lock already unlocked?
> > >
> > > Yes, it is. The PASID entry lifecycle is serialized by the iommu_group-
> > > > mutex in the iommu core, which ensures that no other thread can attempt
> > > to allocate or setup this same PASID until intel_pasid_tear_down_entry()
> > > has returned.
> > >
> > > The iommu->lock is held during the initial transition (P->0) to ensure
> > > atomicity against other hardware-table walkers, but once the P bit is
> > > cleared and the caches are flushed, the final zeroing of the 'dead'
> > > entry does not strictly require the spinlock because the PASID remains
> > > reserved in software until the function completes.
> >
> > Ok. Just to understand: "other hardware-table walkers" means some
> > software walkers, not hardware ones? Which software walkers are those?
> > (I can't imagine how holding a spinlock could prevent the hardware from
> > walking those tables. :))
>
> You are right. A spinlock doesn't stop the hardware. The spinlock
> serializes software threads to ensure the hardware walker always sees a
> consistent entry.
>
> When a PASID entry is active (P=1), other kernel paths might modify
> the control bits in-place. For example:
>
> void intel_pasid_setup_page_snoop_control(struct intel_iommu *iommu,
> struct device *dev, u32 pasid)
> {
> struct pasid_entry *pte;
> u16 did;
>
> spin_lock(&iommu->lock);
> pte = intel_pasid_get_entry(dev, pasid);
> if (WARN_ON(!pte || !pasid_pte_is_present(pte))) {
> spin_unlock(&iommu->lock);
> return;
> }
>
> pasid_set_pgsnp(pte);
> did = pasid_get_domain_id(pte);
> spin_unlock(&iommu->lock);
>
> intel_pasid_flush_present(iommu, dev, pasid, did, pte);
> }
>
> In this case, the iommu->lock ensures that if two threads try to modify
> the same active entry, they don't interfere with each other and leave
> the entry in a 'torn' state for the IOMMU hardware to read.
>
> In intel_pasid_tear_down_entry(), once the PASID entry is deactivated
> (setting P=0 and flushing caches), the entry is owned exclusively by
> the teardown thread until it is re-configured. That's the reason why the
> final zeroing doesn't need the spinlock.
I see. Am I correct that those other code paths (modifying an entry
in-place) are not supposed to do that concurrently with
intel_pasid_tear_down_entry(), i.e. they should only do that while it is
guaranteed that the entry remains present? Otherwise there is a bug
(hence, for example, the WARN_ON in
intel_pasid_setup_page_snoop_control())? So, holding iommu->lock during
entry teardown is not strictly necessary (i.e. we could unlock it even
earlier than setting P=0), i.e. holding the lock until the entry is
deactivated is basically just a safety measure for possible buggy code?
^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH 1/3] iommu/vt-d: Use 128-bit atomic updates for context entries
2026-01-15 13:23 ` Jason Gunthorpe
@ 2026-01-16 5:19 ` Tian, Kevin
2026-01-16 14:33 ` Jason Gunthorpe
0 siblings, 1 reply; 39+ messages in thread
From: Tian, Kevin @ 2026-01-16 5:19 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Baolu Lu, Joerg Roedel, Will Deacon, Robin Murphy, Dmytro Maluka,
Samiullah Khawaja, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, January 15, 2026 9:23 PM
>
> On Thu, Jan 15, 2026 at 05:59:56AM +0000, Tian, Kevin wrote:
> > > From: Baolu Lu <baolu.lu@linux.intel.com>
> > > Sent: Thursday, January 15, 2026 11:27 AM
> > >
> > > On 1/14/26 15:54, Tian, Kevin wrote:
> > > >
> > > > Then this may be split into three patches:
> > > >
> > > > - change context_clear_entry() to be atomic, to fix the teardown path
> > > > - add present bit check in other functions in this patch, to scrutinize the
> > > > attach path
> > > > - change those functions to be atomic, as a clean up
> > >
> > > Perhaps this also paves the way for enabling hitless replace in the
> > > attach_dev path?
> > >
> >
> > I didn't get it. attach/replace are different paths and iommu core will
> > reject an attach request for a device which is already attached...
>
> From a driver perspective there should be no such thing as
> repalce. vt-d has gone wrong by having special replace flows inside
> itself.
>
> Drivers do attach and should try to make it hitless, that's it.
Agree
>
> Meaning drivers do the whole attach sequence in a robust way:
> - Manage invalidation lists so that old/new domains continue to
> generate invalidations while transitioning
> - Updating HW data structures without making them non-present
> - Managing ATS enable/disable in the right order
>
> It is alot to think about, but if you follow the ARM sequence it is
> all written out there..
>
> So this series should be about the 2nd one, making HW updates go from
> something on the stack to something in the HW, and if Balou is going
> to use entry_set then use entry_set for *everything* and it will deal
> with things like using the 128 bit store or making things non-present
> temporarily. Even if the thing has only 128 bits used you may as well
> stick with it.
>
> Remember also that the 128 bit store is a CPU optional feature. AMD
> investigated and found all their CPUs with IOMMUs have the store
> feature so they could use it unconditionally (IIRC they check at
> driver probe and fail probe without 128 bit store). VTD needs to do
> the same thing, and if 128 bit store is actually sometimes not
> available it needs to fall back to the 64 bit entry set, for
> eveything.
>
We are checking that...
btw the way that VT-d spec describes it as "software MUST use
128bit..." highly suggest that it's either as AMD or with exception
only on very early generations. Another exception is VM which
may have hardware supporting 128bit but vcpu model doesn't,
but I doubt it's a meaningful configuration nowadays...
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry
2026-01-15 21:35 ` Dmytro Maluka
@ 2026-01-16 6:06 ` Baolu Lu
2026-01-20 13:49 ` Dmytro Maluka
0 siblings, 1 reply; 39+ messages in thread
From: Baolu Lu @ 2026-01-16 6:06 UTC (permalink / raw)
To: Dmytro Maluka
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Samiullah Khawaja, iommu, linux-kernel,
Vineeth Pillai (Google), Aashish Sharma
On 1/16/26 05:35, Dmytro Maluka wrote:
> On Thu, Jan 15, 2026 at 10:45:12AM +0800, Baolu Lu wrote:
>> On 1/14/26 19:12, Dmytro Maluka wrote:
>>> On Wed, Jan 14, 2026 at 01:38:13PM +0800, Baolu Lu wrote:
>>>> On 1/14/26 03:34, Dmytro Maluka wrote:
>>>>> On Tue, Jan 13, 2026 at 11:00:47AM +0800, Lu Baolu wrote:
>>>>>> + intel_pasid_clear_entry(iommu, dev, pasid, fault_ignore);
>>>>> Is it safe to do this with iommu->lock already unlocked?
>>>>
>>>> Yes, it is. The PASID entry lifecycle is serialized by the iommu_group-
>>>>> mutex in the iommu core, which ensures that no other thread can attempt
>>>> to allocate or setup this same PASID until intel_pasid_tear_down_entry()
>>>> has returned.
>>>>
>>>> The iommu->lock is held during the initial transition (P->0) to ensure
>>>> atomicity against other hardware-table walkers, but once the P bit is
>>>> cleared and the caches are flushed, the final zeroing of the 'dead'
>>>> entry does not strictly require the spinlock because the PASID remains
>>>> reserved in software until the function completes.
>>>
>>> Ok. Just to understand: "other hardware-table walkers" means some
>>> software walkers, not hardware ones? Which software walkers are those?
>>> (I can't imagine how holding a spinlock could prevent the hardware from
>>> walking those tables. :))
>>
>> You are right. A spinlock doesn't stop the hardware. The spinlock
>> serializes software threads to ensure the hardware walker always sees a
>> consistent entry.
>>
>> When a PASID entry is active (P=1), other kernel paths might modify
>> the control bits in-place. For example:
>>
>> void intel_pasid_setup_page_snoop_control(struct intel_iommu *iommu,
>> struct device *dev, u32 pasid)
>> {
>> struct pasid_entry *pte;
>> u16 did;
>>
>> spin_lock(&iommu->lock);
>> pte = intel_pasid_get_entry(dev, pasid);
>> if (WARN_ON(!pte || !pasid_pte_is_present(pte))) {
>> spin_unlock(&iommu->lock);
>> return;
>> }
>>
>> pasid_set_pgsnp(pte);
>> did = pasid_get_domain_id(pte);
>> spin_unlock(&iommu->lock);
>>
>> intel_pasid_flush_present(iommu, dev, pasid, did, pte);
>> }
>>
>> In this case, the iommu->lock ensures that if two threads try to modify
>> the same active entry, they don't interfere with each other and leave
>> the entry in a 'torn' state for the IOMMU hardware to read.
>>
>> In intel_pasid_tear_down_entry(), once the PASID entry is deactivated
>> (setting P=0 and flushing caches), the entry is owned exclusively by
>> the teardown thread until it is re-configured. That's the reason why the
>> final zeroing doesn't need the spinlock.
>
> I see. Am I correct that those other code paths (modifying an entry
> in-place) are not supposed to do that concurrently with
> intel_pasid_tear_down_entry(), i.e. they should only do that while it is
> guaranteed that the entry remains present? Otherwise there is a bug
> (hence, for example, the WARN_ON in
> intel_pasid_setup_page_snoop_control())?
The iommu driver assumes that high-level software should ensure this.
> So, holding iommu->lock during
> entry teardown is not strictly necessary (i.e. we could unlock it even
> earlier than setting P=0), i.e. holding the lock until the entry is
> deactivated is basically just a safety measure for possible buggy code?
There are other paths that may be concurrent, such as the debugfs path
(dumping the pasid table through debugfs). Therefore, keeping iommu-
>lock in the driver is neither redundant nor buggy.
Thanks,
baolu
^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH 3/3] iommu/vt-d: Rework hitless PASID entry replacement
2026-01-15 13:28 ` Jason Gunthorpe
@ 2026-01-16 6:16 ` Tian, Kevin
0 siblings, 0 replies; 39+ messages in thread
From: Tian, Kevin @ 2026-01-16 6:16 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Baolu Lu, Samiullah Khawaja, Joerg Roedel, Will Deacon,
Robin Murphy, Dmytro Maluka, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, January 15, 2026 9:28 PM
>
> On Thu, Jan 15, 2026 at 05:44:08AM +0000, Tian, Kevin wrote:
>
> > or in Samiullah's case the old/new domains always contains the
> > same mappings, so no corruption would ever occur, but that'd be
> > a very KHO specific assumption. 😊
>
> The thing witih KHO is we don't actually know this is true. We hope it
> is true, but ultimately userspace is responsible to do it, and the
> kernel doesn't check it.
Make sense.
Then the option having the new domain adopt the original ID could not
work either. We cannot have one DID tagging two domains which may
contain different mappings.
>
> This is why the hitless update in an appealing solution because it
> means we don't need to deal with trying to adopt an entire page table
> prepared by another kernel and ensuring this kernel has IOAS areas
> that fully cover and exactly match what is in the table.
>
> Unfortunately to do the extra DID will require a complex invalidation
> sequence to hook multiple DIDs into the same domain
> simultaneously.. Nicolin is working on this for ARM and I think
> Intel's linked list scheme could be tasked to do it too..
>
yes
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/3] iommu/vt-d: Use 128-bit atomic updates for context entries
2026-01-16 5:19 ` Tian, Kevin
@ 2026-01-16 14:33 ` Jason Gunthorpe
0 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2026-01-16 14:33 UTC (permalink / raw)
To: Tian, Kevin
Cc: Baolu Lu, Joerg Roedel, Will Deacon, Robin Murphy, Dmytro Maluka,
Samiullah Khawaja, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org
On Fri, Jan 16, 2026 at 05:19:52AM +0000, Tian, Kevin wrote:
> btw the way that VT-d spec describes it as "software MUST use
> 128bit..." highly suggest that it's either as AMD or with exception
> only on very early generations. Another exception is VM which
> may have hardware supporting 128bit but vcpu model doesn't,
> but I doubt it's a meaningful configuration nowadays...
The VMs is an interesting point..
I think with a bit of fiddling the entry_set mechanism could fall back
to 64 bit operation automatically, and that would handle the
compatability needs. It would be much less hitless in such systems but
who cares?
Jason
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry
2026-01-16 6:06 ` Baolu Lu
@ 2026-01-20 13:49 ` Dmytro Maluka
0 siblings, 0 replies; 39+ messages in thread
From: Dmytro Maluka @ 2026-01-20 13:49 UTC (permalink / raw)
To: Baolu Lu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Samiullah Khawaja, iommu, linux-kernel,
Vineeth Pillai (Google), Aashish Sharma
On Fri, Jan 16, 2026 at 02:06:30PM +0800, Baolu Lu wrote:
> On 1/16/26 05:35, Dmytro Maluka wrote:
> > On Thu, Jan 15, 2026 at 10:45:12AM +0800, Baolu Lu wrote:
> > > On 1/14/26 19:12, Dmytro Maluka wrote:
> > > > On Wed, Jan 14, 2026 at 01:38:13PM +0800, Baolu Lu wrote:
> > > > > On 1/14/26 03:34, Dmytro Maluka wrote:
> > > > > > On Tue, Jan 13, 2026 at 11:00:47AM +0800, Lu Baolu wrote:
> > > > > > > + intel_pasid_clear_entry(iommu, dev, pasid, fault_ignore);
> > > > > > Is it safe to do this with iommu->lock already unlocked?
> > > > >
> > > > > Yes, it is. The PASID entry lifecycle is serialized by the iommu_group-
> > > > > > mutex in the iommu core, which ensures that no other thread can attempt
> > > > > to allocate or setup this same PASID until intel_pasid_tear_down_entry()
> > > > > has returned.
> > > > >
> > > > > The iommu->lock is held during the initial transition (P->0) to ensure
> > > > > atomicity against other hardware-table walkers, but once the P bit is
> > > > > cleared and the caches are flushed, the final zeroing of the 'dead'
> > > > > entry does not strictly require the spinlock because the PASID remains
> > > > > reserved in software until the function completes.
> > > >
> > > > Ok. Just to understand: "other hardware-table walkers" means some
> > > > software walkers, not hardware ones? Which software walkers are those?
> > > > (I can't imagine how holding a spinlock could prevent the hardware from
> > > > walking those tables. :))
> > >
> > > You are right. A spinlock doesn't stop the hardware. The spinlock
> > > serializes software threads to ensure the hardware walker always sees a
> > > consistent entry.
> > >
> > > When a PASID entry is active (P=1), other kernel paths might modify
> > > the control bits in-place. For example:
> > >
> > > void intel_pasid_setup_page_snoop_control(struct intel_iommu *iommu,
> > > struct device *dev, u32 pasid)
> > > {
> > > struct pasid_entry *pte;
> > > u16 did;
> > >
> > > spin_lock(&iommu->lock);
> > > pte = intel_pasid_get_entry(dev, pasid);
> > > if (WARN_ON(!pte || !pasid_pte_is_present(pte))) {
> > > spin_unlock(&iommu->lock);
> > > return;
> > > }
> > >
> > > pasid_set_pgsnp(pte);
> > > did = pasid_get_domain_id(pte);
> > > spin_unlock(&iommu->lock);
> > >
> > > intel_pasid_flush_present(iommu, dev, pasid, did, pte);
> > > }
> > >
> > > In this case, the iommu->lock ensures that if two threads try to modify
> > > the same active entry, they don't interfere with each other and leave
> > > the entry in a 'torn' state for the IOMMU hardware to read.
> > >
> > > In intel_pasid_tear_down_entry(), once the PASID entry is deactivated
> > > (setting P=0 and flushing caches), the entry is owned exclusively by
> > > the teardown thread until it is re-configured. That's the reason why the
> > > final zeroing doesn't need the spinlock.
> >
> > I see. Am I correct that those other code paths (modifying an entry
> > in-place) are not supposed to do that concurrently with
> > intel_pasid_tear_down_entry(), i.e. they should only do that while it is
> > guaranteed that the entry remains present? Otherwise there is a bug
> > (hence, for example, the WARN_ON in
> > intel_pasid_setup_page_snoop_control())?
>
> The iommu driver assumes that high-level software should ensure this.
>
> > So, holding iommu->lock during
> > entry teardown is not strictly necessary (i.e. we could unlock it even
> > earlier than setting P=0), i.e. holding the lock until the entry is
> > deactivated is basically just a safety measure for possible buggy code?
>
> There are other paths that may be concurrent, such as the debugfs path
> (dumping the pasid table through debugfs). Therefore, keeping iommu-
> >lock in the driver is neither redundant nor buggy.
I see, that makes sense: clearing the present bit before iommu->lock is
unlocked should prevent such read-only walkers like debugfs from trying
to further walk down the path (i.e. to page tables) after iommu->lock is
unlocked.
> Thanks,
> baolu
^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2026-01-20 13:49 UTC | newest]
Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-13 3:00 [PATCH 0/3] iommu/vt-d: Ensure atomicity in context and PASID entry updates Lu Baolu
2026-01-13 3:00 ` [PATCH 1/3] iommu/vt-d: Use 128-bit atomic updates for context entries Lu Baolu
2026-01-13 19:27 ` Dmytro Maluka
2026-01-14 5:14 ` Baolu Lu
2026-01-14 10:55 ` Dmytro Maluka
2026-01-15 2:26 ` Baolu Lu
2026-01-15 13:12 ` Jason Gunthorpe
2026-01-14 7:54 ` Tian, Kevin
2026-01-15 3:26 ` Baolu Lu
2026-01-15 5:59 ` Tian, Kevin
2026-01-15 13:23 ` Jason Gunthorpe
2026-01-16 5:19 ` Tian, Kevin
2026-01-16 14:33 ` Jason Gunthorpe
2026-01-13 3:00 ` [PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry Lu Baolu
2026-01-13 19:34 ` Dmytro Maluka
2026-01-14 5:38 ` Baolu Lu
2026-01-14 11:12 ` Dmytro Maluka
2026-01-15 2:45 ` Baolu Lu
2026-01-15 21:35 ` Dmytro Maluka
2026-01-16 6:06 ` Baolu Lu
2026-01-20 13:49 ` Dmytro Maluka
2026-01-14 7:32 ` Tian, Kevin
2026-01-14 8:27 ` Baolu Lu
2026-01-15 5:49 ` Tian, Kevin
2026-01-13 3:00 ` [PATCH 3/3] iommu/vt-d: Rework hitless PASID entry replacement Lu Baolu
2026-01-13 15:05 ` Jason Gunthorpe
2026-01-14 6:03 ` Baolu Lu
2026-01-13 19:27 ` Samiullah Khawaja
2026-01-13 20:56 ` Jason Gunthorpe
2026-01-14 5:45 ` Baolu Lu
2026-01-14 7:26 ` Tian, Kevin
2026-01-14 13:17 ` Jason Gunthorpe
2026-01-14 18:51 ` Samiullah Khawaja
2026-01-14 19:07 ` Jason Gunthorpe
2026-01-15 5:44 ` Tian, Kevin
2026-01-15 13:28 ` Jason Gunthorpe
2026-01-16 6:16 ` Tian, Kevin
2026-01-13 19:39 ` Dmytro Maluka
2026-01-13 20:06 ` Dmytro Maluka
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox