All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates
@ 2025-12-27 17:57 Dmytro Maluka
  2025-12-27 17:57 ` [PATCH v2 1/5] iommu/vt-d: Sanitize set bits in pasid_set_bits() Dmytro Maluka
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Dmytro Maluka @ 2025-12-27 17:57 UTC (permalink / raw)
  To: David Woodhouse, Lu Baolu, iommu
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, linux-kernel,
	Vineeth Pillai (Google), Aashish Sharma, Grzegorz Jaszczyk,
	Chuanxiao Dong, Kevin Tian, Dmytro Maluka

As discussed in [1], we don't currently prevent the compiler from
reordering memory writes when updating context entries, which is
potentially dangerous, as it may cause setting the present bit (i.e.
enabling DMA translation for the given device) before finishing setting
up other bits in the context entry (and thus creating a time window when
a DMA from the device may result in an unpredicted behavior).

Fix this in the same way as how this is already addressed for PASID
entries, i.e. by using READ_ONCE/WRITE_ONCE in the helpers used for
setting individual bits in context entries, so that memory writes done
by those helpers are ordered in relation to each other (plus, prevent
load/store tearing and so on).

While at it, similarly paranoidally fix updating root entries as well:
use WRITE_ONCE to make sure that the present bit is set atomically
together with the context table address bits, not before them.

[1] https://lore.kernel.org/all/aTG7gc7I5wExai3S@google.com/

v1 -> v2:
- Sanitize bits to not exceed the mask (suggested by Baolu)
- Reuse pasid_set_bits() for context entries as well (rename it to
  entry_set_bits())
- Add extra barrier in *_set_present() (suggested by Baolu)

Dmytro Maluka (5):
  iommu/vt-d: Sanitize set bits in pasid_set_bits()
  iommu/vt-d: Generalize pasid_set_bits()
  iommu/vt-d: Ensure memory ordering in context entry updates
  iommu/vt-d: Use smp_wmb() before setting context/pasid present bit
  iommu/vt-d: Use WRITE_ONCE for setting root table entries

 drivers/iommu/intel/iommu.c |  2 +-
 drivers/iommu/intel/iommu.h | 49 +++++++++++++++++++++++++------------
 drivers/iommu/intel/pasid.c |  3 ++-
 drivers/iommu/intel/pasid.h | 46 +++++++++++++++++-----------------
 4 files changed, 59 insertions(+), 41 deletions(-)

-- 
2.47.3


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 1/5] iommu/vt-d: Sanitize set bits in pasid_set_bits()
  2025-12-27 17:57 [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates Dmytro Maluka
@ 2025-12-27 17:57 ` Dmytro Maluka
  2025-12-27 17:57 ` [PATCH v2 2/5] iommu/vt-d: Generalize pasid_set_bits() Dmytro Maluka
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Dmytro Maluka @ 2025-12-27 17:57 UTC (permalink / raw)
  To: David Woodhouse, Lu Baolu, iommu
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, linux-kernel,
	Vineeth Pillai (Google), Aashish Sharma, Grzegorz Jaszczyk,
	Chuanxiao Dong, Kevin Tian, Dmytro Maluka

When setting PASID entry bits via pasid_set_*() helpers, apply the
specified mask to the specified bits to be set, as a safety measure
in case the caller accidentally passes a value that exceeds the
specified mask. Also warn if that happens.

Suggested-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Dmytro Maluka <dmaluka@chromium.org>
---
 drivers/iommu/intel/pasid.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
index b4c85242dc79..39acd3efa3ab 100644
--- a/drivers/iommu/intel/pasid.h
+++ b/drivers/iommu/intel/pasid.h
@@ -113,8 +113,10 @@ static inline void pasid_set_bits(u64 *ptr, u64 mask, u64 bits)
 {
 	u64 old;
 
+	WARN_ON_ONCE(bits & ~mask);
+
 	old = READ_ONCE(*ptr);
-	WRITE_ONCE(*ptr, (old & ~mask) | bits);
+	WRITE_ONCE(*ptr, (old & ~mask) | (bits & mask));
 }
 
 static inline u64 pasid_get_bits(u64 *ptr)
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 2/5] iommu/vt-d: Generalize pasid_set_bits()
  2025-12-27 17:57 [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates Dmytro Maluka
  2025-12-27 17:57 ` [PATCH v2 1/5] iommu/vt-d: Sanitize set bits in pasid_set_bits() Dmytro Maluka
@ 2025-12-27 17:57 ` Dmytro Maluka
  2025-12-27 17:57 ` [PATCH v2 3/5] iommu/vt-d: Ensure memory ordering in context entry updates Dmytro Maluka
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Dmytro Maluka @ 2025-12-27 17:57 UTC (permalink / raw)
  To: David Woodhouse, Lu Baolu, iommu
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, linux-kernel,
	Vineeth Pillai (Google), Aashish Sharma, Grzegorz Jaszczyk,
	Chuanxiao Dong, Kevin Tian, Dmytro Maluka

To prepare for reusing the pasid_set_bits() for context entries as well,
rename it to entry_set_bits() and move its definition to iommu.h.

Signed-off-by: Dmytro Maluka <dmaluka@chromium.org>
---
 drivers/iommu/intel/iommu.h |  9 ++++++++
 drivers/iommu/intel/pasid.h | 42 +++++++++++++++----------------------
 2 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 25c5e22096d4..2fab7ff4b932 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -897,6 +897,15 @@ static inline int pfn_level_offset(u64 pfn, int level)
 	return (pfn >> level_to_offset_bits(level)) & LEVEL_MASK;
 }
 
+static inline void entry_set_bits(u64 *ptr, u64 mask, u64 bits)
+{
+	u64 old;
+
+	WARN_ON_ONCE(bits & ~mask);
+
+	old = READ_ONCE(*ptr);
+	WRITE_ONCE(*ptr, (old & ~mask) | (bits & mask));
+}
 
 static inline void context_set_present(struct context_entry *context)
 {
diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
index 39acd3efa3ab..f8fc73676192 100644
--- a/drivers/iommu/intel/pasid.h
+++ b/drivers/iommu/intel/pasid.h
@@ -10,6 +10,8 @@
 #ifndef __INTEL_PASID_H
 #define __INTEL_PASID_H
 
+#include "iommu.h"
+
 #define PASID_MAX			0x100000
 #define PASID_PTE_MASK			0x3F
 #define PASID_PTE_PRESENT		1
@@ -109,16 +111,6 @@ static inline void pasid_clear_entry_with_fpd(struct pasid_entry *pe)
 	WRITE_ONCE(pe->val[7], 0);
 }
 
-static inline void pasid_set_bits(u64 *ptr, u64 mask, u64 bits)
-{
-	u64 old;
-
-	WARN_ON_ONCE(bits & ~mask);
-
-	old = READ_ONCE(*ptr);
-	WRITE_ONCE(*ptr, (old & ~mask) | (bits & mask));
-}
-
 static inline u64 pasid_get_bits(u64 *ptr)
 {
 	return READ_ONCE(*ptr);
@@ -131,7 +123,7 @@ static inline u64 pasid_get_bits(u64 *ptr)
 static inline void
 pasid_set_domain_id(struct pasid_entry *pe, u64 value)
 {
-	pasid_set_bits(&pe->val[1], GENMASK_ULL(15, 0), value);
+	entry_set_bits(&pe->val[1], GENMASK_ULL(15, 0), value);
 }
 
 /*
@@ -150,7 +142,7 @@ pasid_get_domain_id(struct pasid_entry *pe)
 static inline void
 pasid_set_slptr(struct pasid_entry *pe, u64 value)
 {
-	pasid_set_bits(&pe->val[0], VTD_PAGE_MASK, value);
+	entry_set_bits(&pe->val[0], VTD_PAGE_MASK, value);
 }
 
 /*
@@ -160,7 +152,7 @@ pasid_set_slptr(struct pasid_entry *pe, u64 value)
 static inline void
 pasid_set_address_width(struct pasid_entry *pe, u64 value)
 {
-	pasid_set_bits(&pe->val[0], GENMASK_ULL(4, 2), value << 2);
+	entry_set_bits(&pe->val[0], GENMASK_ULL(4, 2), value << 2);
 }
 
 /*
@@ -170,7 +162,7 @@ pasid_set_address_width(struct pasid_entry *pe, u64 value)
 static inline void
 pasid_set_translation_type(struct pasid_entry *pe, u64 value)
 {
-	pasid_set_bits(&pe->val[0], GENMASK_ULL(8, 6), value << 6);
+	entry_set_bits(&pe->val[0], GENMASK_ULL(8, 6), value << 6);
 }
 
 /*
@@ -179,7 +171,7 @@ pasid_set_translation_type(struct pasid_entry *pe, u64 value)
  */
 static inline void pasid_set_fault_enable(struct pasid_entry *pe)
 {
-	pasid_set_bits(&pe->val[0], 1 << 1, 0);
+	entry_set_bits(&pe->val[0], 1 << 1, 0);
 }
 
 /*
@@ -189,7 +181,7 @@ static inline void pasid_set_fault_enable(struct pasid_entry *pe)
  */
 static inline void pasid_set_ssade(struct pasid_entry *pe)
 {
-	pasid_set_bits(&pe->val[0], 1 << 9, 1 << 9);
+	entry_set_bits(&pe->val[0], 1 << 9, 1 << 9);
 }
 
 /*
@@ -199,7 +191,7 @@ static inline void pasid_set_ssade(struct pasid_entry *pe)
  */
 static inline void pasid_clear_ssade(struct pasid_entry *pe)
 {
-	pasid_set_bits(&pe->val[0], 1 << 9, 0);
+	entry_set_bits(&pe->val[0], 1 << 9, 0);
 }
 
 /*
@@ -218,7 +210,7 @@ static inline bool pasid_get_ssade(struct pasid_entry *pe)
  */
 static inline void pasid_set_sre(struct pasid_entry *pe)
 {
-	pasid_set_bits(&pe->val[2], 1 << 0, 1);
+	entry_set_bits(&pe->val[2], 1 << 0, 1);
 }
 
 /*
@@ -227,7 +219,7 @@ static inline void pasid_set_sre(struct pasid_entry *pe)
  */
 static inline void pasid_set_wpe(struct pasid_entry *pe)
 {
-	pasid_set_bits(&pe->val[2], 1 << 4, 1 << 4);
+	entry_set_bits(&pe->val[2], 1 << 4, 1 << 4);
 }
 
 /*
@@ -236,7 +228,7 @@ static inline void pasid_set_wpe(struct pasid_entry *pe)
  */
 static inline void pasid_set_present(struct pasid_entry *pe)
 {
-	pasid_set_bits(&pe->val[0], 1 << 0, 1);
+	entry_set_bits(&pe->val[0], 1 << 0, 1);
 }
 
 /*
@@ -245,7 +237,7 @@ static inline void pasid_set_present(struct pasid_entry *pe)
  */
 static inline void pasid_set_page_snoop(struct pasid_entry *pe, bool value)
 {
-	pasid_set_bits(&pe->val[1], 1 << 23, value << 23);
+	entry_set_bits(&pe->val[1], 1 << 23, value << 23);
 }
 
 /*
@@ -255,7 +247,7 @@ static inline void pasid_set_page_snoop(struct pasid_entry *pe, bool value)
 static inline void
 pasid_set_pgsnp(struct pasid_entry *pe)
 {
-	pasid_set_bits(&pe->val[1], 1ULL << 24, 1ULL << 24);
+	entry_set_bits(&pe->val[1], 1ULL << 24, 1ULL << 24);
 }
 
 /*
@@ -265,7 +257,7 @@ pasid_set_pgsnp(struct pasid_entry *pe)
 static inline void
 pasid_set_flptr(struct pasid_entry *pe, u64 value)
 {
-	pasid_set_bits(&pe->val[2], VTD_PAGE_MASK, value);
+	entry_set_bits(&pe->val[2], VTD_PAGE_MASK, value);
 }
 
 /*
@@ -275,7 +267,7 @@ pasid_set_flptr(struct pasid_entry *pe, u64 value)
 static inline void
 pasid_set_flpm(struct pasid_entry *pe, u64 value)
 {
-	pasid_set_bits(&pe->val[2], GENMASK_ULL(3, 2), value << 2);
+	entry_set_bits(&pe->val[2], GENMASK_ULL(3, 2), value << 2);
 }
 
 /*
@@ -284,7 +276,7 @@ pasid_set_flpm(struct pasid_entry *pe, u64 value)
  */
 static inline void pasid_set_eafe(struct pasid_entry *pe)
 {
-	pasid_set_bits(&pe->val[2], 1 << 7, 1 << 7);
+	entry_set_bits(&pe->val[2], 1 << 7, 1 << 7);
 }
 
 extern unsigned int intel_pasid_max_id;
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 3/5] iommu/vt-d: Ensure memory ordering in context entry updates
  2025-12-27 17:57 [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates Dmytro Maluka
  2025-12-27 17:57 ` [PATCH v2 1/5] iommu/vt-d: Sanitize set bits in pasid_set_bits() Dmytro Maluka
  2025-12-27 17:57 ` [PATCH v2 2/5] iommu/vt-d: Generalize pasid_set_bits() Dmytro Maluka
@ 2025-12-27 17:57 ` Dmytro Maluka
  2025-12-27 17:57 ` [PATCH v2 4/5] iommu/vt-d: Use smp_wmb() before setting context/pasid present bit Dmytro Maluka
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Dmytro Maluka @ 2025-12-27 17:57 UTC (permalink / raw)
  To: David Woodhouse, Lu Baolu, iommu
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, linux-kernel,
	Vineeth Pillai (Google), Aashish Sharma, Grzegorz Jaszczyk,
	Chuanxiao Dong, Kevin Tian, Dmytro Maluka

We do take care to not set the present bit in a context table entry via
context_set_present() earlier than setting up all other bits in it.
However, we don't do anything to actually ensure this order, i.e. to
prevent the compiler from reordering it. And since context entries may
be updated at runtime when translation is already enabled, this is a
potential source of bugs or security issues.

To easily fix this, convert the context_set_*() and context_clear_*()
helpers to use entry_set_bits() which uses READ_ONCE/WRITE_ONCE, to
ensure that the ordering between updates of individual bits in context
entries matches the order of calling those helpers, just like we already
do that for PASID table entries.

Link: https://lore.kernel.org/all/aTG7gc7I5wExai3S@google.com/
Signed-off-by: Dmytro Maluka <dmaluka@chromium.org>
---
 drivers/iommu/intel/iommu.h | 30 ++++++++++++++----------------
 drivers/iommu/intel/pasid.c |  3 ++-
 2 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 2fab7ff4b932..5bc69ffc7c8e 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -869,7 +869,7 @@ static inline bool dma_pte_superpage(struct dma_pte *pte)
 
 static inline bool context_present(struct context_entry *context)
 {
-	return (context->lo & 1);
+	return READ_ONCE(context->lo) & 1;
 }
 
 #define LEVEL_STRIDE		(9)
@@ -909,43 +909,41 @@ static inline void entry_set_bits(u64 *ptr, u64 mask, u64 bits)
 
 static inline void context_set_present(struct context_entry *context)
 {
-	context->lo |= 1;
+	entry_set_bits(&context->lo, 1ULL << 0, 1ULL);
 }
 
 static inline void context_set_fault_enable(struct context_entry *context)
 {
-	context->lo &= (((u64)-1) << 2) | 1;
+	entry_set_bits(&context->lo, 1ULL << 1, 0ULL);
 }
 
 static inline void context_set_translation_type(struct context_entry *context,
 						unsigned long value)
 {
-	context->lo &= (((u64)-1) << 4) | 3;
-	context->lo |= (value & 3) << 2;
+	entry_set_bits(&context->lo, GENMASK_ULL(3, 2), value << 2);
 }
 
 static inline void context_set_address_root(struct context_entry *context,
 					    unsigned long value)
 {
-	context->lo &= ~VTD_PAGE_MASK;
-	context->lo |= value & VTD_PAGE_MASK;
+	entry_set_bits(&context->lo, VTD_PAGE_MASK, value);
 }
 
 static inline void context_set_address_width(struct context_entry *context,
 					     unsigned long value)
 {
-	context->hi |= value & 7;
+	entry_set_bits(&context->hi, GENMASK_ULL(2, 0), value);
 }
 
 static inline void context_set_domain_id(struct context_entry *context,
 					 unsigned long value)
 {
-	context->hi |= (value & ((1 << 16) - 1)) << 8;
+	entry_set_bits(&context->hi, GENMASK_ULL(23, 8), value << 8);
 }
 
 static inline void context_set_pasid(struct context_entry *context)
 {
-	context->lo |= CONTEXT_PASIDE;
+	entry_set_bits(&context->lo, CONTEXT_PASIDE, CONTEXT_PASIDE);
 }
 
 static inline int context_domain_id(struct context_entry *c)
@@ -955,8 +953,8 @@ static inline int context_domain_id(struct context_entry *c)
 
 static inline void context_clear_entry(struct context_entry *context)
 {
-	context->lo = 0;
-	context->hi = 0;
+	WRITE_ONCE(context->lo, 0);
+	WRITE_ONCE(context->hi, 0);
 }
 
 #ifdef CONFIG_INTEL_IOMMU
@@ -989,7 +987,7 @@ clear_context_copied(struct intel_iommu *iommu, u8 bus, u8 devfn)
 static inline void
 context_set_sm_rid2pasid(struct context_entry *context, unsigned long pasid)
 {
-	context->hi |= pasid & ((1 << 20) - 1);
+	entry_set_bits(&context->hi, GENMASK_ULL(19, 0), pasid);
 }
 
 /*
@@ -998,7 +996,7 @@ context_set_sm_rid2pasid(struct context_entry *context, unsigned long pasid)
  */
 static inline void context_set_sm_dte(struct context_entry *context)
 {
-	context->lo |= BIT_ULL(2);
+	entry_set_bits(&context->lo, BIT_ULL(2), BIT_ULL(2));
 }
 
 /*
@@ -1007,7 +1005,7 @@ static inline void context_set_sm_dte(struct context_entry *context)
  */
 static inline void context_set_sm_pre(struct context_entry *context)
 {
-	context->lo |= BIT_ULL(4);
+	entry_set_bits(&context->lo, BIT_ULL(4), BIT_ULL(4));
 }
 
 /*
@@ -1016,7 +1014,7 @@ static inline void context_set_sm_pre(struct context_entry *context)
  */
 static inline void context_clear_sm_pre(struct context_entry *context)
 {
-	context->lo &= ~BIT_ULL(4);
+	entry_set_bits(&context->lo, BIT_ULL(4), 0);
 }
 
 /* Returns a number of VTD pages, but aligned to MM page size */
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 3e2255057079..10bc1908d892 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -983,7 +983,8 @@ static int context_entry_set_pasid_table(struct context_entry *context,
 	context_clear_entry(context);
 
 	pds = context_get_sm_pds(table);
-	context->lo = (u64)virt_to_phys(table->table) | context_pdts(pds);
+	WRITE_ONCE(context->lo,
+		   (u64)virt_to_phys(table->table) | context_pdts(pds));
 	context_set_sm_rid2pasid(context, IOMMU_NO_PASID);
 
 	if (info->ats_supported)
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 4/5] iommu/vt-d: Use smp_wmb() before setting context/pasid present bit
  2025-12-27 17:57 [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates Dmytro Maluka
                   ` (2 preceding siblings ...)
  2025-12-27 17:57 ` [PATCH v2 3/5] iommu/vt-d: Ensure memory ordering in context entry updates Dmytro Maluka
@ 2025-12-27 17:57 ` Dmytro Maluka
  2025-12-27 17:57 ` [PATCH v2 5/5] iommu/vt-d: Use WRITE_ONCE for setting root table entries Dmytro Maluka
  2026-01-05 18:12 ` [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates Jason Gunthorpe
  5 siblings, 0 replies; 23+ messages in thread
From: Dmytro Maluka @ 2025-12-27 17:57 UTC (permalink / raw)
  To: David Woodhouse, Lu Baolu, iommu
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, linux-kernel,
	Vineeth Pillai (Google), Aashish Sharma, Grzegorz Jaszczyk,
	Chuanxiao Dong, Kevin Tian, Dmytro Maluka

With the previous patch we already ensure that the present bit in
context or PASID entries is not set earlier than setting/clearing other
needed bits (assuming that setting/clearing any bits in them is always
done via WRITE_ONCE, which is enough to ensure ordering between them on
x86).

However, it also doesn't hurt to add an explicit smp_wmb() barrier
(which on x86 is merely a compiler barrier) before setting the present
bit, as an extra safety measure in case we still forget to use
WRITE_ONCE when updating any other bits in context/PASID entries in the
future, plus for documentation purposes.

Suggested-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Dmytro Maluka <dmaluka@chromium.org>
---
 drivers/iommu/intel/iommu.h | 10 ++++++++++
 drivers/iommu/intel/pasid.h |  6 ++++++
 2 files changed, 16 insertions(+)

diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 5bc69ffc7c8e..75576885314b 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -909,6 +909,16 @@ static inline void entry_set_bits(u64 *ptr, u64 mask, u64 bits)
 
 static inline void context_set_present(struct context_entry *context)
 {
+	/*
+	 * Make sure to not set the present bit earlier than updating other
+	 * bits.
+	 *
+	 * This barrier may be redundant, but only as long as any context
+	 * entry modifications use WRITE_ONCE(), which is enough to ensure
+	 * ordering between them on x86 hardware.
+	 */
+	smp_wmb();
+
 	entry_set_bits(&context->lo, 1ULL << 0, 1ULL);
 }
 
diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
index f8fc73676192..0d50959a0495 100644
--- a/drivers/iommu/intel/pasid.h
+++ b/drivers/iommu/intel/pasid.h
@@ -228,6 +228,12 @@ static inline void pasid_set_wpe(struct pasid_entry *pe)
  */
 static inline void pasid_set_present(struct pasid_entry *pe)
 {
+	/*
+	 * Make sure to not set the present bit earlier than updating other
+	 * bits. See also the comment in context_set_present().
+	 */
+	smp_wmb();
+
 	entry_set_bits(&pe->val[0], 1 << 0, 1);
 }
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 5/5] iommu/vt-d: Use WRITE_ONCE for setting root table entries
  2025-12-27 17:57 [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates Dmytro Maluka
                   ` (3 preceding siblings ...)
  2025-12-27 17:57 ` [PATCH v2 4/5] iommu/vt-d: Use smp_wmb() before setting context/pasid present bit Dmytro Maluka
@ 2025-12-27 17:57 ` Dmytro Maluka
  2026-01-05 18:12 ` [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates Jason Gunthorpe
  5 siblings, 0 replies; 23+ messages in thread
From: Dmytro Maluka @ 2025-12-27 17:57 UTC (permalink / raw)
  To: David Woodhouse, Lu Baolu, iommu
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, linux-kernel,
	Vineeth Pillai (Google), Aashish Sharma, Grzegorz Jaszczyk,
	Chuanxiao Dong, Kevin Tian, Dmytro Maluka

Like context table entries (addressed in the previous patches), root
table entries may also be set up at runtime, when DMA translation is
already enabled (e.g. due to a device hotplug). So to stay on the safe
side, use WRITE_ONCE when setting a root table entry, to prevent the
compiler from doing store tearing which could result in setting the
present bit earlier than the context table address bits.

Signed-off-by: Dmytro Maluka <dmaluka@chromium.org>
---
 drivers/iommu/intel/iommu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 134302fbcd92..fe4d0d210a5f 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -374,7 +374,7 @@ struct context_entry *iommu_context_addr(struct intel_iommu *iommu, u8 bus,
 
 		__iommu_flush_cache(iommu, (void *)context, CONTEXT_SIZE);
 		phy_addr = virt_to_phys((void *)context);
-		*entry = phy_addr | 1;
+		WRITE_ONCE(*entry, phy_addr | 1);
 		__iommu_flush_cache(iommu, entry, sizeof(*entry));
 	}
 	return &context[devfn];
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates
  2025-12-27 17:57 [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates Dmytro Maluka
                   ` (4 preceding siblings ...)
  2025-12-27 17:57 ` [PATCH v2 5/5] iommu/vt-d: Use WRITE_ONCE for setting root table entries Dmytro Maluka
@ 2026-01-05 18:12 ` Jason Gunthorpe
  2026-01-05 18:54   ` Dmytro Maluka
  2026-01-06  3:37   ` Baolu Lu
  5 siblings, 2 replies; 23+ messages in thread
From: Jason Gunthorpe @ 2026-01-05 18:12 UTC (permalink / raw)
  To: Dmytro Maluka
  Cc: David Woodhouse, Lu Baolu, iommu, Joerg Roedel, Will Deacon,
	Robin Murphy, linux-kernel, Vineeth Pillai (Google),
	Aashish Sharma, Grzegorz Jaszczyk, Chuanxiao Dong, Kevin Tian

On Sat, Dec 27, 2025 at 06:57:23PM +0100, Dmytro Maluka wrote:
> As discussed in [1], we don't currently prevent the compiler from
> reordering memory writes when updating context entries, which is
> potentially dangerous, as it may cause setting the present bit (i.e.
> enabling DMA translation for the given device) before finishing setting
> up other bits in the context entry (and thus creating a time window when
> a DMA from the device may result in an unpredicted behavior).
> 
> Fix this in the same way as how this is already addressed for PASID
> entries, i.e. by using READ_ONCE/WRITE_ONCE in the helpers used for
> setting individual bits in context entries, so that memory writes done
> by those helpers are ordered in relation to each other (plus, prevent
> load/store tearing and so on).
> 
> While at it, similarly paranoidally fix updating root entries as well:
> use WRITE_ONCE to make sure that the present bit is set atomically
> together with the context table address bits, not before them.

The PASID entries should not be manipulated 'livel' in a haphazard way
like this in the first place!

Like AMD and ARM build the new PASID entry on the stack and then it
should be copied to the DMA'able memory in a way that is consistent
with the HW's atomicity granual, paying attention not to 'tear' it.

This manipulate-in-place is just asking for trouble, and can never
support replace or full viommu requirements.. :\

So while it is perhaps an improvement to do this work, it would be
better to fix the root cause issue if someone has time..

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates
  2026-01-05 18:12 ` [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates Jason Gunthorpe
@ 2026-01-05 18:54   ` Dmytro Maluka
  2026-01-05 19:14     ` Jason Gunthorpe
  2026-01-06  3:37   ` Baolu Lu
  1 sibling, 1 reply; 23+ messages in thread
From: Dmytro Maluka @ 2026-01-05 18:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Woodhouse, Lu Baolu, iommu, Joerg Roedel, Will Deacon,
	Robin Murphy, linux-kernel, Vineeth Pillai (Google),
	Aashish Sharma, Grzegorz Jaszczyk, Chuanxiao Dong, Kevin Tian

On Mon, Jan 05, 2026 at 02:12:00PM -0400, Jason Gunthorpe wrote:
> On Sat, Dec 27, 2025 at 06:57:23PM +0100, Dmytro Maluka wrote:
> > As discussed in [1], we don't currently prevent the compiler from
> > reordering memory writes when updating context entries, which is
> > potentially dangerous, as it may cause setting the present bit (i.e.
> > enabling DMA translation for the given device) before finishing setting
> > up other bits in the context entry (and thus creating a time window when
> > a DMA from the device may result in an unpredicted behavior).
> > 
> > Fix this in the same way as how this is already addressed for PASID
> > entries, i.e. by using READ_ONCE/WRITE_ONCE in the helpers used for
> > setting individual bits in context entries, so that memory writes done
> > by those helpers are ordered in relation to each other (plus, prevent
> > load/store tearing and so on).
> > 
> > While at it, similarly paranoidally fix updating root entries as well:
> > use WRITE_ONCE to make sure that the present bit is set atomically
> > together with the context table address bits, not before them.
> 
> The PASID entries should not be manipulated 'livel' in a haphazard way
> like this in the first place!
> 
> Like AMD and ARM build the new PASID entry on the stack and then it
> should be copied to the DMA'able memory in a way that is consistent
> with the HW's atomicity granual, paying attention not to 'tear' it.

As I understand, the "consistent with the HW's atomicity granual, paying
attention not to 'tear' it" part is already fulfilled for PASID entries
(and with this series, for context entries as well):

static inline void pasid_set_bits(u64 *ptr, u64 mask, u64 bits)
{
	u64 old;

	old = READ_ONCE(*ptr);
	WRITE_ONCE(*ptr, (old & ~mask) | bits);
}

I've been assuming it's ok to manipulate other bits in place as long as
we take care to only do that while the present bit it cleared (i.e.
while the entry is ignored by hardware)?

So IIUC the only problem with this approach is the redundancy: we do
this READ_ONCE+WRITE_ONCE for each invididual field in a PASID entry.

So while I agree it would be more more natural to build whole entries,
and the existing way looks strange and not the most efficient, I'm
wondering if it is causing any actual correctness issues (apart from
those addressed by this series).

> This manipulate-in-place is just asking for trouble, and can never
> support replace or full viommu requirements.. :\
> 
> So while it is perhaps an improvement to do this work, it would be
> better to fix the root cause issue if someone has time..
> 
> Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates
  2026-01-05 18:54   ` Dmytro Maluka
@ 2026-01-05 19:14     ` Jason Gunthorpe
  2026-01-05 20:05       ` Dmytro Maluka
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2026-01-05 19:14 UTC (permalink / raw)
  To: Dmytro Maluka
  Cc: David Woodhouse, Lu Baolu, iommu, Joerg Roedel, Will Deacon,
	Robin Murphy, linux-kernel, Vineeth Pillai (Google),
	Aashish Sharma, Grzegorz Jaszczyk, Chuanxiao Dong, Kevin Tian

On Mon, Jan 05, 2026 at 07:54:53PM +0100, Dmytro Maluka wrote:
> > Like AMD and ARM build the new PASID entry on the stack and then it
> > should be copied to the DMA'able memory in a way that is consistent
> > with the HW's atomicity granual, paying attention not to 'tear' it.
> 
> As I understand, the "consistent with the HW's atomicity granual, paying
> attention not to 'tear' it" part is already fulfilled for PASID entries
> (and with this series, for context entries as well):
> 
> static inline void pasid_set_bits(u64 *ptr, u64 mask, u64 bits)
> {
> 	u64 old;
> 
> 	old = READ_ONCE(*ptr);
> 	WRITE_ONCE(*ptr, (old & ~mask) | bits);
> }
> 
> I've been assuming it's ok to manipulate other bits in place as long as
> we take care to only do that while the present bit it cleared (i.e.
> while the entry is ignored by hardware)?

If these are only done while non-present then the only issue is
missing a barrier before setting present, that should be a one line
patch, no?

> So IIUC the only problem with this approach is the redundancy: we do
> this READ_ONCE+WRITE_ONCE for each invididual field in a PASID entry.

You don't need READ_ONCE if there isn't another thread concurrently
writing, and WRITE_ONCE is pointless if the HW is promising not to
read it due to non-present.

> So while I agree it would be more more natural to build whole entries,
> and the existing way looks strange and not the most efficient, I'm
> wondering if it is causing any actual correctness issues (apart from
> those addressed by this series).

It prevents doing the replace operation, which is a correctness issue
for VMs.

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates
  2026-01-05 19:14     ` Jason Gunthorpe
@ 2026-01-05 20:05       ` Dmytro Maluka
  2026-01-06  0:14         ` Jason Gunthorpe
  0 siblings, 1 reply; 23+ messages in thread
From: Dmytro Maluka @ 2026-01-05 20:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Woodhouse, Lu Baolu, iommu, Joerg Roedel, Will Deacon,
	Robin Murphy, linux-kernel, Vineeth Pillai (Google),
	Aashish Sharma, Grzegorz Jaszczyk, Chuanxiao Dong, Kevin Tian

On Mon, Jan 05, 2026 at 03:14:10PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 05, 2026 at 07:54:53PM +0100, Dmytro Maluka wrote:
> > > Like AMD and ARM build the new PASID entry on the stack and then it
> > > should be copied to the DMA'able memory in a way that is consistent
> > > with the HW's atomicity granual, paying attention not to 'tear' it.
> > 
> > As I understand, the "consistent with the HW's atomicity granual, paying
> > attention not to 'tear' it" part is already fulfilled for PASID entries
> > (and with this series, for context entries as well):
> > 
> > static inline void pasid_set_bits(u64 *ptr, u64 mask, u64 bits)
> > {
> > 	u64 old;
> > 
> > 	old = READ_ONCE(*ptr);
> > 	WRITE_ONCE(*ptr, (old & ~mask) | bits);
> > }
> > 
> > I've been assuming it's ok to manipulate other bits in place as long as
> > we take care to only do that while the present bit it cleared (i.e.
> > while the entry is ignored by hardware)?
> 
> If these are only done while non-present then the only issue is
> missing a barrier before setting present, that should be a one line
> patch, no?

Yes, a barrier, or alternatively WRITE_ONCE for any updates of entries.
The latter is how it was already done for PASID entries, before this
series (so IIUC for PASID entries there was already no issue), and my
main goal was to fix the issue for context entries as well, so I just
did it similarly.

> > So IIUC the only problem with this approach is the redundancy: we do
> > this READ_ONCE+WRITE_ONCE for each invididual field in a PASID entry.
> 
> You don't need READ_ONCE if there isn't another thread concurrently
> writing,

Good point. But it was there before this series.

> and WRITE_ONCE is pointless if the HW is promising not to
> read it due to non-present.

As long as we use a barrier. And IIUC vice versa, if we use WRITE_ONCE
for any updates, a barrier is not necessary (on x86). And WRITE_ONCE for
any updates (for PASID entries) is what was already done before this
series.

Although, perhaps even with a barrier, WRITE_ONCE is still desirable to
prevent the compiler from doing strange but legal things that involve
transiently setting and then clearing the present bit behind our back?
(Not that I'm aware of any compilers doing that, but I'm no compiler
expert.)

> > So while I agree it would be more more natural to build whole entries,
> > and the existing way looks strange and not the most efficient, I'm
> > wondering if it is causing any actual correctness issues (apart from
> > those addressed by this series).
> 
> It prevents doing the replace operation, which is a correctness issue
> for VMs.
> 
> Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates
  2026-01-05 20:05       ` Dmytro Maluka
@ 2026-01-06  0:14         ` Jason Gunthorpe
  2026-01-06  7:48           ` Tian, Kevin
  2026-01-06 13:51           ` Dmytro Maluka
  0 siblings, 2 replies; 23+ messages in thread
From: Jason Gunthorpe @ 2026-01-06  0:14 UTC (permalink / raw)
  To: Dmytro Maluka
  Cc: David Woodhouse, Lu Baolu, iommu, Joerg Roedel, Will Deacon,
	Robin Murphy, linux-kernel, Vineeth Pillai (Google),
	Aashish Sharma, Grzegorz Jaszczyk, Chuanxiao Dong, Kevin Tian

On Mon, Jan 05, 2026 at 09:05:36PM +0100, Dmytro Maluka wrote:
> > and WRITE_ONCE is pointless if the HW is promising not to
> > read it due to non-present.
> 
> As long as we use a barrier. And IIUC vice versa, if we use WRITE_ONCE
> for any updates, a barrier is not necessary (on x86). And WRITE_ONCE for
> any updates (for PASID entries) is what was already done before this
> series.

That is an x86 ism and it shouldn't be needlessly leaked into drivers.
As this is not performance it should have the portable flow:

   WRITE_ONCE(non-present)
   dma_wmb()
   <cmd to flush caches>

   [..]
   <writes to the entry>

   dma_wmb()
   WRITE_ONCE(present)

So, it seems to me like all you need here is the one line to add the
dma_wmb() prior to setting present.

> Although, perhaps even with a barrier, WRITE_ONCE is still desirable to
> prevent the compiler from doing strange but legal things that involve
> transiently setting and then clearing the present bit behind our back?
> (Not that I'm aware of any compilers doing that, but I'm no compiler
> expert.)

You do have to write once the present bit, but not the others. The
dma_wmb() will make sure the HW cannot observe other states when it
observes present.

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates
  2026-01-05 18:12 ` [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates Jason Gunthorpe
  2026-01-05 18:54   ` Dmytro Maluka
@ 2026-01-06  3:37   ` Baolu Lu
  1 sibling, 0 replies; 23+ messages in thread
From: Baolu Lu @ 2026-01-06  3:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Dmytro Maluka
  Cc: David Woodhouse, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	linux-kernel, Vineeth Pillai (Google), Aashish Sharma,
	Grzegorz Jaszczyk, Chuanxiao Dong, Kevin Tian

On 1/6/26 02:12, Jason Gunthorpe wrote:
> On Sat, Dec 27, 2025 at 06:57:23PM +0100, Dmytro Maluka wrote:
>> As discussed in [1], we don't currently prevent the compiler from
>> reordering memory writes when updating context entries, which is
>> potentially dangerous, as it may cause setting the present bit (i.e.
>> enabling DMA translation for the given device) before finishing setting
>> up other bits in the context entry (and thus creating a time window when
>> a DMA from the device may result in an unpredicted behavior).
>>
>> Fix this in the same way as how this is already addressed for PASID
>> entries, i.e. by using READ_ONCE/WRITE_ONCE in the helpers used for
>> setting individual bits in context entries, so that memory writes done
>> by those helpers are ordered in relation to each other (plus, prevent
>> load/store tearing and so on).
>>
>> While at it, similarly paranoidally fix updating root entries as well:
>> use WRITE_ONCE to make sure that the present bit is set atomically
>> together with the context table address bits, not before them.
> The PASID entries should not be manipulated 'livel' in a haphazard way
> like this in the first place!
> 
> Like AMD and ARM build the new PASID entry on the stack and then it
> should be copied to the DMA'able memory in a way that is consistent
> with the HW's atomicity granual, paying attention not to 'tear' it.
> 
> This manipulate-in-place is just asking for trouble, and can never
> support replace or full viommu requirements.. :\
> 
> So while it is perhaps an improvement to do this work, it would be
> better to fix the root cause issue if someone has time..

Agreed. The current Intel IOMMU driver uses a 'clear-populate-set'
pattern protected by a spinlock, which is why it doesn't support
'replace' yet. Dmytro's patch addresses the immediate risk of the
compiler reordering those writes and exposing invalid data to the
hardware.

Moving to an on-stack construction (like AMD/ARM) and updating
atomically is the right direction for the driver. We'll look into that
refactoring as a follow-up series to modernize the entry manipulation
logic.

Thanks,
baolu

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates
  2026-01-06  0:14         ` Jason Gunthorpe
@ 2026-01-06  7:48           ` Tian, Kevin
  2026-01-06 14:40             ` Dmytro Maluka
  2026-01-06 13:51           ` Dmytro Maluka
  1 sibling, 1 reply; 23+ messages in thread
From: Tian, Kevin @ 2026-01-06  7:48 UTC (permalink / raw)
  To: Jason Gunthorpe, Dmytro Maluka
  Cc: David Woodhouse, Lu Baolu, iommu@lists.linux.dev, Joerg Roedel,
	Will Deacon, Robin Murphy, linux-kernel@vger.kernel.org,
	Vineeth Pillai (Google), Aashish Sharma, Grzegorz Jaszczyk,
	Dong, Chuanxiao

> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Tuesday, January 6, 2026 8:14 AM
> 
> On Mon, Jan 05, 2026 at 09:05:36PM +0100, Dmytro Maluka wrote:
> > > and WRITE_ONCE is pointless if the HW is promising not to
> > > read it due to non-present.
> >
> > As long as we use a barrier. And IIUC vice versa, if we use WRITE_ONCE
> > for any updates, a barrier is not necessary (on x86). And WRITE_ONCE for
> > any updates (for PASID entries) is what was already done before this
> > series.
> 
> That is an x86 ism and it shouldn't be needlessly leaked into drivers.

yeah WRITE_ONCE() is not by definition to guarantee the ordering between
CPU and device.

lots of READ_ONCE()/WRITE_ONCE() in existing code are meaningless,
as 1) between CPUs there is already lock protection; 2) between CPU and
device it requires dma_wmb() to guarantee the order.

> As this is not performance it should have the portable flow:
> 
>    WRITE_ONCE(non-present)
>    dma_wmb()
>    <cmd to flush caches>
> 
>    [..]
>    <writes to the entry>
> 
>    dma_wmb()
>    WRITE_ONCE(present)
> 
> So, it seems to me like all you need here is the one line to add the
> dma_wmb() prior to setting present.

+1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates
  2026-01-06  0:14         ` Jason Gunthorpe
  2026-01-06  7:48           ` Tian, Kevin
@ 2026-01-06 13:51           ` Dmytro Maluka
  2026-01-06 14:23             ` Jason Gunthorpe
  1 sibling, 1 reply; 23+ messages in thread
From: Dmytro Maluka @ 2026-01-06 13:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Woodhouse, Lu Baolu, iommu, Joerg Roedel, Will Deacon,
	Robin Murphy, linux-kernel, Vineeth Pillai (Google),
	Aashish Sharma, Grzegorz Jaszczyk, Chuanxiao Dong, Kevin Tian

On Mon, Jan 05, 2026 at 08:14:18PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 05, 2026 at 09:05:36PM +0100, Dmytro Maluka wrote:
> > > and WRITE_ONCE is pointless if the HW is promising not to
> > > read it due to non-present.
> > 
> > As long as we use a barrier. And IIUC vice versa, if we use WRITE_ONCE
> > for any updates, a barrier is not necessary (on x86). And WRITE_ONCE for
> > any updates (for PASID entries) is what was already done before this
> > series.
> 
> That is an x86 ism and it shouldn't be needlessly leaked into drivers.
> As this is not performance it should have the portable flow:

Agree. And as a matter of fact, patch 4 in this series already does that
- it adds a barrier before setting present bit, despite it being
"redundant on x86".

It adds smp_wmb() as suggested by Baolu but I can change it to dma_wmb()
which indeed seems more suitable. (FWIW on x86 there is no difference
between smp_wmb() and dma_wmb(), both are just barrier().)

>    WRITE_ONCE(non-present)
>    dma_wmb()
>    <cmd to flush caches>

Regarding a barrier after clearing present bit - good point, I should
probably add that to my patch 4 as well.

Regarding flushing caches right after that - what for? (BTW the Intel
driver doesn't do that either.) If we don't do that and as a result the
HW is using an old entry cached before we cleared the present bit, it
is not affected by our later modifications anyway.

Of course we flush caches at the end, after setting the present bit, but
that's a different story.

> > Although, perhaps even with a barrier, WRITE_ONCE is still desirable to
> > prevent the compiler from doing strange but legal things that involve
> > transiently setting and then clearing the present bit behind our back?
> > (Not that I'm aware of any compilers doing that, but I'm no compiler
> > expert.)
> 
> You do have to write once the present bit, but not the others. The
> dma_wmb() will make sure the HW cannot observe other states when it
> observes present.

I was talking about compiler guarantees, not HW guarantees. I mean: when
setting some other bit in the entry before the barrier, if we do that
without WRITE_ONCE, with a mere "foo |= bar", are we certain the
compiler will not implement that as, for example, setting the value to
0xffffffffffffffff and then clearing other bits (for whatever crazy
reason)? That would be still a legal thing for the compiler to do, in
terms of its single-thread guarantees?

So it still seems a good idea to use WRITE_ONCE (or anything not weaker
than WRITE_ONCE) for any updates of memory structures used by hardware,
to prevent such things? (And using WRITE_ONCE for that doesn't really
cost us anything anyway.)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates
  2026-01-06 13:51           ` Dmytro Maluka
@ 2026-01-06 14:23             ` Jason Gunthorpe
  2026-01-06 15:50               ` Dmytro Maluka
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2026-01-06 14:23 UTC (permalink / raw)
  To: Dmytro Maluka
  Cc: David Woodhouse, Lu Baolu, iommu, Joerg Roedel, Will Deacon,
	Robin Murphy, linux-kernel, Vineeth Pillai (Google),
	Aashish Sharma, Grzegorz Jaszczyk, Chuanxiao Dong, Kevin Tian

On Tue, Jan 06, 2026 at 02:51:38PM +0100, Dmytro Maluka wrote:
> >    WRITE_ONCE(non-present)
> >    dma_wmb()
> >    <cmd to flush caches>
> 
> Regarding a barrier after clearing present bit - good point, I should
> probably add that to my patch 4 as well.

It is already integrated into <cmd to flush caches> I showed it here
for clarity.

> Regarding flushing caches right after that - what for? (BTW the Intel
> driver doesn't do that either.) If we don't do that and as a result the
> HW is using an old entry cached before we cleared the present bit, it
> is not affected by our later modifications anyway.

You don't know what state the HW fetcher is in. This kind of race is possible:

     CPU                 FETCHER
                        read present = 1
    present = 0
    mangle qword 1
                        read qword 1
                        < fail - HW sees a corrupted entry >

The flush is not just a flush but a barrier to synchronize with the HW
that it is done all fetches that may have been dependent on seeing
present = 1.

So missing a flush after clearing present is possibly a bug today - I
don't remember what guarenteed the atomic size is for Intel IOMMU
though, if the atomic size is the whole entry it is OK since there is
only one fetcher read. Though AMD is 128 bits and ARM is 64 bits.

> I was talking about compiler guarantees, not HW guarantees. I mean: when
> setting some other bit in the entry before the barrier, if we do that
> without WRITE_ONCE, with a mere "foo |= bar", are we certain the
> compiler will not implement that as, for example, setting the value to
> 0xffffffffffffffff and then clearing other bits (for whatever crazy
> reason)? That would be still a legal thing for the compiler to do, in
> terms of its single-thread guarantees?

The HW doesn't read the values the CPU is writing, so it doesn't
matter if the compiler does something strange. That is the whole
justification for why it is possible to code it like this at all.

The dma_mb() is also a compiler barrier and ensures all that
uncertainty is resolved. Once it completes a DMA from the HW will see
the program defined values only.

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates
  2026-01-06  7:48           ` Tian, Kevin
@ 2026-01-06 14:40             ` Dmytro Maluka
  2026-01-08  2:22               ` Tian, Kevin
  0 siblings, 1 reply; 23+ messages in thread
From: Dmytro Maluka @ 2026-01-06 14:40 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, David Woodhouse, Lu Baolu, iommu@lists.linux.dev,
	Joerg Roedel, Will Deacon, Robin Murphy,
	linux-kernel@vger.kernel.org, Vineeth Pillai (Google),
	Aashish Sharma, Grzegorz Jaszczyk, Dong, Chuanxiao

On Tue, Jan 06, 2026 at 07:48:50AM +0000, Tian, Kevin wrote:
> yeah WRITE_ONCE() is not by definition to guarantee the ordering between
> CPU and device.

Yes, WRITE_ONCE is not about HW guarantess at all, it is about compiler
guarantess. And it is not about ordering, it is about compiler's
guarantee to store the given 64-bit value once, with one instruction.
But this compiler guarantee is exactly my point (see my last reply to
Jason).

> lots of READ_ONCE()/WRITE_ONCE() in existing code are meaningless,
> as 1) between CPUs there is already lock protection; 2) between CPU and
> device it requires dma_wmb() to guarantee the order.

As I see it, those WRITE_ONCEs (maybe not READ_ONCEs) haven't been
meaningless (I mean, they have been actually useful) so long as we
haven't been using any barriers. Again, on x86, store ordering requires
just compiler ordering, and dma_wmb() is just a compiler barrier. So,
assuming this driver is only used on x86 (which is, well, true :)),
we are lacking even compiler barriers, but at least we have those
WRITE_ONCEs, which provide compiler ordering too (although only between
each other, not with any other memory accesses, but that seems enough
for our case).

And again, I agree it is not pretty to rely on arch-specific ordering
assumptions, and doing in-place updates via those context_xxx() and
pasid_xxx() helpers all over the place instead of updating whole entries
seems a strange choice. But that is how it was implemented 10 or so
years ago, and overhauling that hasn't been my goal.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates
  2026-01-06 14:23             ` Jason Gunthorpe
@ 2026-01-06 15:50               ` Dmytro Maluka
  2026-01-06 16:45                 ` Jason Gunthorpe
                                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Dmytro Maluka @ 2026-01-06 15:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Woodhouse, Lu Baolu, iommu, Joerg Roedel, Will Deacon,
	Robin Murphy, linux-kernel, Vineeth Pillai (Google),
	Aashish Sharma, Grzegorz Jaszczyk, Chuanxiao Dong, Kevin Tian

On Tue, Jan 06, 2026 at 10:23:01AM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 06, 2026 at 02:51:38PM +0100, Dmytro Maluka wrote:
> > Regarding flushing caches right after that - what for? (BTW the Intel
> > driver doesn't do that either.) If we don't do that and as a result the
> > HW is using an old entry cached before we cleared the present bit, it
> > is not affected by our later modifications anyway.
> 
> You don't know what state the HW fetcher is in. This kind of race is possible:
> 
>      CPU                 FETCHER
>                         read present = 1
>     present = 0
>     mangle qword 1
>                         read qword 1
>                         < fail - HW sees a corrupted entry >
> 
> The flush is not just a flush but a barrier to synchronize with the HW
> that it is done all fetches that may have been dependent on seeing
> present = 1.
> 
> So missing a flush after clearing present is possibly a bug today - I
> don't remember what guarenteed the atomic size is for Intel IOMMU
> though, if the atomic size is the whole entry it is OK since there is
> only one fetcher read. Though AMD is 128 bits and ARM is 64 bits.

Indeed, may be a bug... In the VT-d spec I don't immediately see a
guarantee that context and PASID entries are fetched atomically. (And
for PASID entries, which are 512 bits, that seems particularly
unlikely.)

> > I was talking about compiler guarantees, not HW guarantees. I mean: when
> > setting some other bit in the entry before the barrier, if we do that
> > without WRITE_ONCE, with a mere "foo |= bar", are we certain the
> > compiler will not implement that as, for example, setting the value to
> > 0xffffffffffffffff and then clearing other bits (for whatever crazy
> > reason)? That would be still a legal thing for the compiler to do, in
> > terms of its single-thread guarantees?
> 
> The HW doesn't read the values the CPU is writing, so it doesn't
> matter if the compiler does something strange. That is the whole
> justification for why it is possible to code it like this at all.
> 
> The dma_mb() is also a compiler barrier and ensures all that
> uncertainty is resolved. Once it completes a DMA from the HW will see
> the program defined values only.

The HW (the IOMMU) may read context / PASID / page table entries from
memory at any time, whenever a device makes a DMA request so the IOMMU
needs to translate it? We don't know if it happens before or after the
barrier?

So we'd better make sure that if it happens before the barrier (i.e.
when the device is not supposed to do DMA), the compiler (and thus
the CPU) doesn't set the present bit, so it stays non-present, so
the IOMMU will block this unexpected/malicious DMA?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates
  2026-01-06 15:50               ` Dmytro Maluka
@ 2026-01-06 16:45                 ` Jason Gunthorpe
  2026-01-06 17:14                   ` Dmytro Maluka
  2026-01-08  2:09                 ` Tian, Kevin
       [not found]                 ` <BN9PR11MB5276FB0F465DBFB4EE4D742B8C85A@BN9PR11MB5276.namprd11.prod.outlook.com>
  2 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2026-01-06 16:45 UTC (permalink / raw)
  To: Dmytro Maluka
  Cc: David Woodhouse, Lu Baolu, iommu, Joerg Roedel, Will Deacon,
	Robin Murphy, linux-kernel, Vineeth Pillai (Google),
	Aashish Sharma, Grzegorz Jaszczyk, Chuanxiao Dong, Kevin Tian

On Tue, Jan 06, 2026 at 04:50:11PM +0100, Dmytro Maluka wrote:
> So we'd better make sure that if it happens before the barrier (i.e.
> when the device is not supposed to do DMA), the compiler (and thus
> the CPU) doesn't set the present bit, so it stays non-present, so
> the IOMMU will block this unexpected/malicious DMA?

It is true that any write to the dword containing the present bit
(only!) should probably use WRITE_ONCE.

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates
  2026-01-06 16:45                 ` Jason Gunthorpe
@ 2026-01-06 17:14                   ` Dmytro Maluka
  0 siblings, 0 replies; 23+ messages in thread
From: Dmytro Maluka @ 2026-01-06 17:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Woodhouse, Lu Baolu, iommu, Joerg Roedel, Will Deacon,
	Robin Murphy, linux-kernel, Vineeth Pillai (Google),
	Aashish Sharma, Grzegorz Jaszczyk, Chuanxiao Dong, Kevin Tian

On Tue, Jan 06, 2026 at 12:45:18PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 06, 2026 at 04:50:11PM +0100, Dmytro Maluka wrote:
> > So we'd better make sure that if it happens before the barrier (i.e.
> > when the device is not supposed to do DMA), the compiler (and thus
> > the CPU) doesn't set the present bit, so it stays non-present, so
> > the IOMMU will block this unexpected/malicious DMA?
> 
> It is true that any write to the dword containing the present bit
> (only!) should probably use WRITE_ONCE.

Yeah, I agree that for other dwords within an entry it isn't necessary.

So, when it comes to the actual code, it's not like the Intel IOMMU
driver has lots of WRITE_ONCE spread all over the code. It's only in
pasid_set_bits() and pasid_clear_entry(). Though, since pasid_set_bits()
is used for setting any bits, as a result, WRITE_ONCE is also done for
dwords that don't need it, but that doesn't hurt either.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates
  2026-01-06 15:50               ` Dmytro Maluka
  2026-01-06 16:45                 ` Jason Gunthorpe
@ 2026-01-08  2:09                 ` Tian, Kevin
  2026-01-09  6:32                   ` Baolu Lu
       [not found]                 ` <BN9PR11MB5276FB0F465DBFB4EE4D742B8C85A@BN9PR11MB5276.namprd11.prod.outlook.com>
  2 siblings, 1 reply; 23+ messages in thread
From: Tian, Kevin @ 2026-01-08  2:09 UTC (permalink / raw)
  To: Dmytro Maluka, Jason Gunthorpe
  Cc: David Woodhouse, Lu Baolu, iommu@lists.linux.dev, Joerg Roedel,
	Will Deacon, Robin Murphy, linux-kernel@vger.kernel.org,
	Vineeth Pillai (Google), Aashish Sharma, Grzegorz Jaszczyk,
	Dong, Chuanxiao

> From: Dmytro Maluka <dmaluka@chromium.org>
> Sent: Tuesday, January 6, 2026 11:50 PM
> 
> On Tue, Jan 06, 2026 at 10:23:01AM -0400, Jason Gunthorpe wrote:
> > On Tue, Jan 06, 2026 at 02:51:38PM +0100, Dmytro Maluka wrote:
> > > Regarding flushing caches right after that - what for? (BTW the Intel
> > > driver doesn't do that either.) If we don't do that and as a result the
> > > HW is using an old entry cached before we cleared the present bit, it
> > > is not affected by our later modifications anyway.
> >
> > You don't know what state the HW fetcher is in. This kind of race is possible:
> >
> >      CPU                 FETCHER
> >                         read present = 1
> >     present = 0
> >     mangle qword 1
> >                         read qword 1
> >                         < fail - HW sees a corrupted entry >
> >
> > The flush is not just a flush but a barrier to synchronize with the HW
> > that it is done all fetches that may have been dependent on seeing
> > present = 1.
> >
> > So missing a flush after clearing present is possibly a bug today - I
> > don't remember what guarenteed the atomic size is for Intel IOMMU
> > though, if the atomic size is the whole entry it is OK since there is
> > only one fetcher read. Though AMD is 128 bits and ARM is 64 bits.
> 
> Indeed, may be a bug... In the VT-d spec I don't immediately see a
> guarantee that context and PASID entries are fetched atomically. (And
> for PASID entries, which are 512 bits, that seems particularly
> unlikely.)
> 

512bits atomicity is possible, but not on the PASID entry.

VT-d spec, head of section 9 (Translation Structure Formats):

"
This chapter describes the memory-resident structures for DMA and
interrupt remapping. Hardware must access structure entries that
are 64-bit or 128-bit atomically. Hardware must update a 512-bit
Posted Interrupt Descriptor (see Section 9.11 for details) atomically.
Other than the Posted Interrupt Descriptor (PID), hardware is allowed 
to break access to larger than 128-bit entries into multiple aligned
128-bit accesses.
"

root entry, scalable root entry, context entry and IRTE are 128bits
so they are OK.

scalable context entry are 256bits but only the lower 128bits are
defined so it's OK for now.

scalable PASID directory entry is 64bits. ok.

posted interrupt descriptor is 512bits with atomicity guaranteed.

but we do have problem on scalable pasid entry which is 512bits.
  - bits beyond 191 are for future hardware, not a problem now
  - bits 128-191 are for 1st-stage
  - bits 0-127 manages stage selection, 2nd-stage, and some 1st-stage

so in theory 1st-stage and nesting are affected by this bug.

In reality:
  - iommu driver shouldn't receive an attach request on an in-use pasid
    entry, so the cache should remain cleared (either at initial state or
    flushed by previous teardown) then hw won't use a partial 1st-stage
    config after seeing the entry as non-present.

  - replace is already broken, as the entry should not be cleared in the
    1st place then this bug will be fixed when replace is reworked.

If no oversight (Baolu?), probably we don't need to fix it strictly following
Jason's pseudo logic at this point. Instead, just rename pasid_clear_entry()
to pasid_clear_entry_no_flush() for now (with some comment to clarify
the expectation), and rework the replace path in parallel.

We may never require a pasid_clear_entry_flush_cache() once hitless
replace is in place. 😊

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates
  2026-01-06 14:40             ` Dmytro Maluka
@ 2026-01-08  2:22               ` Tian, Kevin
  0 siblings, 0 replies; 23+ messages in thread
From: Tian, Kevin @ 2026-01-08  2:22 UTC (permalink / raw)
  To: Dmytro Maluka
  Cc: Jason Gunthorpe, David Woodhouse, Lu Baolu, iommu@lists.linux.dev,
	Joerg Roedel, Will Deacon, Robin Murphy,
	linux-kernel@vger.kernel.org, Vineeth Pillai (Google),
	Aashish Sharma, Grzegorz Jaszczyk, Dong, Chuanxiao

> From: Dmytro Maluka <dmaluka@chromium.org>
> Sent: Tuesday, January 6, 2026 10:40 PM
> 
> On Tue, Jan 06, 2026 at 07:48:50AM +0000, Tian, Kevin wrote:
> > yeah WRITE_ONCE() is not by definition to guarantee the ordering between
> > CPU and device.
> 
> Yes, WRITE_ONCE is not about HW guarantess at all, it is about compiler
> guarantess. And it is not about ordering, it is about compiler's
> guarantee to store the given 64-bit value once, with one instruction.
> But this compiler guarantee is exactly my point (see my last reply to
> Jason).
> 
> > lots of READ_ONCE()/WRITE_ONCE() in existing code are meaningless,
> > as 1) between CPUs there is already lock protection; 2) between CPU and
> > device it requires dma_wmb() to guarantee the order.
> 
> As I see it, those WRITE_ONCEs (maybe not READ_ONCEs) haven't been
> meaningless (I mean, they have been actually useful) so long as we
> haven't been using any barriers. Again, on x86, store ordering requires
> just compiler ordering, and dma_wmb() is just a compiler barrier. So,
> assuming this driver is only used on x86 (which is, well, true :)),
> we are lacking even compiler barriers, but at least we have those
> WRITE_ONCEs, which provide compiler ordering too (although only between
> each other, not with any other memory accesses, but that seems enough
> for our case).
> 
> And again, I agree it is not pretty to rely on arch-specific ordering
> assumptions, and doing in-place updates via those context_xxx() and
> pasid_xxx() helpers all over the place instead of updating whole entries
> seems a strange choice. But that is how it was implemented 10 or so
> years ago, and overhauling that hasn't been my goal.

sure. the point here is to align on what is the right thing to do. then
we could have short-term fixes plus bigger refactoring later.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates
       [not found]                 ` <BN9PR11MB5276FB0F465DBFB4EE4D742B8C85A@BN9PR11MB5276.namprd11.prod.outlook.com>
@ 2026-01-08  7:00                   ` Tian, Kevin
  0 siblings, 0 replies; 23+ messages in thread
From: Tian, Kevin @ 2026-01-08  7:00 UTC (permalink / raw)
  To: Dmytro Maluka, Jason Gunthorpe
  Cc: David Woodhouse, Lu Baolu, iommu@lists.linux.dev, Joerg Roedel,
	Will Deacon, Robin Murphy, linux-kernel@vger.kernel.org,
	Vineeth Pillai (Google), Aashish Sharma, Grzegorz Jaszczyk,
	Dong, Chuanxiao

> From: Tian, Kevin
> Sent: Thursday, January 8, 2026 10:10 AM
> 
> > From: Dmytro Maluka <dmaluka@chromium.org>
> > Sent: Tuesday, January 6, 2026 11:50 PM
> >
> > On Tue, Jan 06, 2026 at 10:23:01AM -0400, Jason Gunthorpe wrote:
> > > On Tue, Jan 06, 2026 at 02:51:38PM +0100, Dmytro Maluka wrote:
> > > > Regarding flushing caches right after that - what for? (BTW the Intel
> > > > driver doesn't do that either.) If we don't do that and as a result the
> > > > HW is using an old entry cached before we cleared the present bit, it
> > > > is not affected by our later modifications anyway.
> > >
> > > You don't know what state the HW fetcher is in. This kind of race is
> possible:
> > >
> > >      CPU                 FETCHER
> > >                         read present = 1
> > >     present = 0
> > >     mangle qword 1
> > >                         read qword 1
> > >                         < fail - HW sees a corrupted entry >
> > >
> > > The flush is not just a flush but a barrier to synchronize with the HW
> > > that it is done all fetches that may have been dependent on seeing
> > > present = 1.
> > >
> > > So missing a flush after clearing present is possibly a bug today - I
> > > don't remember what guarenteed the atomic size is for Intel IOMMU
> > > though, if the atomic size is the whole entry it is OK since there is
> > > only one fetcher read. Though AMD is 128 bits and ARM is 64 bits.
> >
> > Indeed, may be a bug... In the VT-d spec I don't immediately see a
> > guarantee that context and PASID entries are fetched atomically. (And
> > for PASID entries, which are 512 bits, that seems particularly
> > unlikely.)
> >
> 
> 512bits atomicity is possible, but not on the PASID entry.
> 
> VT-d spec, head of section 9 (Translation Structure Formats):
> 
> "
> This chapter describes the memory-resident structures for DMA and
> interrupt remapping. Hardware must access structure entries that
> are 64-bit or 128-bit atomically. Hardware must update a 512-bit
> Posted Interrupt Descriptor (see Section 9.11 for details) atomically.
> Other than the Posted Interrupt Descriptor (PID), hardware is allowed
> to break access to larger than 128-bit entries into multiple aligned
> 128-bit accesses.
> "
> 
> root entry, scalable root entry, context entry and IRTE are 128bits
> so they are OK.
> 
> scalable context entry are 256bits but only the lower 128bits are
> defined so it's OK for now.
> 
> scalable PASID directory entry is 64bits. ok.
> 
> posted interrupt descriptor is 512bits with atomicity guaranteed.
> 
> but we do have problem on scalable pasid entry which is 512bits.
>   - bits beyond 191 are for future hardware, not a problem now
>   - bits 128-191 are for 1st-stage
>   - bits 0-127 manages stage selection, 2nd-stage, and some 1st-stage
> 
> so in theory 1st-stage and nesting are affected by this bug.
> 
> In reality:
>   - iommu driver shouldn't receive an attach request on an in-use pasid
>     entry, so the cache should remain cleared (either at initial state or
>     flushed by previous teardown) then hw won't use a partial 1st-stage
>     config after seeing the entry as non-present.
> 
>   - replace is already broken, as the entry should not be cleared in the
>     1st place then this bug will be fixed when replace is reworked.
> 
> If no oversight (Baolu?), probably we don't need to fix it strictly following
> Jason's pseudo logic at this point. Instead, just rename pasid_clear_entry()
> to pasid_clear_entry_no_flush() for now (with some comment to clarify
> the expectation), and rework the replace path in parallel.
> 

ah we do have a problem with teardown though:

     CPU                 FETCHER
                        read 1st 128bit (present = 1, use 1st-stage)
    clear 1st u64 (present = 0)
    clear 2nd u64
    clear 3rd u64 (FSPTPTR=0)
                        read 2nd 128bit (use addr0 as FSPTPTR)
                        < random memory corruption >
    clear remaining u64's
    flush cache

exactly as Jason's example shows. :/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates
  2026-01-08  2:09                 ` Tian, Kevin
@ 2026-01-09  6:32                   ` Baolu Lu
  0 siblings, 0 replies; 23+ messages in thread
From: Baolu Lu @ 2026-01-09  6:32 UTC (permalink / raw)
  To: Tian, Kevin, Dmytro Maluka, Jason Gunthorpe
  Cc: David Woodhouse, iommu@lists.linux.dev, Joerg Roedel, Will Deacon,
	Robin Murphy, linux-kernel@vger.kernel.org,
	Vineeth Pillai (Google), Aashish Sharma, Grzegorz Jaszczyk,
	Dong, Chuanxiao

On 1/8/26 10:09, Tian, Kevin wrote:
>> From: Dmytro Maluka <dmaluka@chromium.org>
>> Sent: Tuesday, January 6, 2026 11:50 PM
>>
>> On Tue, Jan 06, 2026 at 10:23:01AM -0400, Jason Gunthorpe wrote:
>>> On Tue, Jan 06, 2026 at 02:51:38PM +0100, Dmytro Maluka wrote:
>>>> Regarding flushing caches right after that - what for? (BTW the Intel
>>>> driver doesn't do that either.) If we don't do that and as a result the
>>>> HW is using an old entry cached before we cleared the present bit, it
>>>> is not affected by our later modifications anyway.
>>>
>>> You don't know what state the HW fetcher is in. This kind of race is possible:
>>>
>>>       CPU                 FETCHER
>>>                          read present = 1
>>>      present = 0
>>>      mangle qword 1
>>>                          read qword 1
>>>                          < fail - HW sees a corrupted entry >
>>>
>>> The flush is not just a flush but a barrier to synchronize with the HW
>>> that it is done all fetches that may have been dependent on seeing
>>> present = 1.
>>>
>>> So missing a flush after clearing present is possibly a bug today - I
>>> don't remember what guarenteed the atomic size is for Intel IOMMU
>>> though, if the atomic size is the whole entry it is OK since there is
>>> only one fetcher read. Though AMD is 128 bits and ARM is 64 bits.
>>
>> Indeed, may be a bug... In the VT-d spec I don't immediately see a
>> guarantee that context and PASID entries are fetched atomically. (And
>> for PASID entries, which are 512 bits, that seems particularly
>> unlikely.)
>>
> 
> 512bits atomicity is possible, but not on the PASID entry.
> 
> VT-d spec, head of section 9 (Translation Structure Formats):
> 
> "
> This chapter describes the memory-resident structures for DMA and
> interrupt remapping. Hardware must access structure entries that
> are 64-bit or 128-bit atomically. Hardware must update a 512-bit
> Posted Interrupt Descriptor (see Section 9.11 for details) atomically.
> Other than the Posted Interrupt Descriptor (PID), hardware is allowed
> to break access to larger than 128-bit entries into multiple aligned
> 128-bit accesses.
> "
> 
> root entry, scalable root entry, context entry and IRTE are 128bits
> so they are OK.
> 
> scalable context entry are 256bits but only the lower 128bits are
> defined so it's OK for now.
> 
> scalable PASID directory entry is 64bits. ok.
> 
> posted interrupt descriptor is 512bits with atomicity guaranteed.
> 
> but we do have problem on scalable pasid entry which is 512bits.
>    - bits beyond 191 are for future hardware, not a problem now
>    - bits 128-191 are for 1st-stage
>    - bits 0-127 manages stage selection, 2nd-stage, and some 1st-stage
> 
> so in theory 1st-stage and nesting are affected by this bug.

Yes. This is a real software bug. The hardware is legally allowed to
tear the pasid table entry read into 4 128-bit chunks. For first-stage
(first-only or nested) translation, chunk 1 (bit 0-127) and chunk 2 (bit
128-191) both contain active configuration data, hardware could possibly
read a entry composed of half-old and half-new data.

The VT-d spec defines software guide for pasid table entry manipulation
in section 6.5.3.3 (Guidance to Software for Invalidations), I think the
Linux driver doesn't handle the handshake between CPU and IOMMU hardware
in the right way.

The correct way should be a clear-flush-update sequence, that is, when
tearing down a present pasid table entry, the software should

1. Clear the Present bit;
2. Invalidate the cache according to section 6.5.3.3;
[now CPU owns the pasid table entry]
3. Update other fields;
4. Set the Present bit;
[now the VT-d hardware owns the pasid table entry].

> 
> In reality:
>    - iommu driver shouldn't receive an attach request on an in-use pasid
>      entry, so the cache should remain cleared (either at initial state or
>      flushed by previous teardown) then hw won't use a partial 1st-stage
>      config after seeing the entry as non-present.

Yes. But the current tear down process is buggy as described above.

> 
>    - replace is already broken, as the entry should not be cleared in the
>      1st place then this bug will be fixed when replace is reworked.

We still have a long way to go before achieving the real replace since
the hardware doesn't guarantee 512-bit atomicity, "hitless" is very
difficult.

> 
> If no oversight (Baolu?), probably we don't need to fix it strictly following
> Jason's pseudo logic at this point. Instead, just rename pasid_clear_entry()
> to pasid_clear_entry_no_flush() for now (with some comment to clarify
> the expectation), and rework the replace path in parallel.
> We may never require a pasid_clear_entry_flush_cache() once hitless> replace is in place. 😊

Perhaps we can first add pasid_entry_clear_present() to fulfill the
clear-flush-update handshake between the software and the IOMMU hardware
while reworking the PASID replace path?

Thanks,
baolu

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2026-01-09  6:32 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-27 17:57 [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates Dmytro Maluka
2025-12-27 17:57 ` [PATCH v2 1/5] iommu/vt-d: Sanitize set bits in pasid_set_bits() Dmytro Maluka
2025-12-27 17:57 ` [PATCH v2 2/5] iommu/vt-d: Generalize pasid_set_bits() Dmytro Maluka
2025-12-27 17:57 ` [PATCH v2 3/5] iommu/vt-d: Ensure memory ordering in context entry updates Dmytro Maluka
2025-12-27 17:57 ` [PATCH v2 4/5] iommu/vt-d: Use smp_wmb() before setting context/pasid present bit Dmytro Maluka
2025-12-27 17:57 ` [PATCH v2 5/5] iommu/vt-d: Use WRITE_ONCE for setting root table entries Dmytro Maluka
2026-01-05 18:12 ` [PATCH v2 0/5] iommu/vt-d: Ensure memory ordering in context & root entry updates Jason Gunthorpe
2026-01-05 18:54   ` Dmytro Maluka
2026-01-05 19:14     ` Jason Gunthorpe
2026-01-05 20:05       ` Dmytro Maluka
2026-01-06  0:14         ` Jason Gunthorpe
2026-01-06  7:48           ` Tian, Kevin
2026-01-06 14:40             ` Dmytro Maluka
2026-01-08  2:22               ` Tian, Kevin
2026-01-06 13:51           ` Dmytro Maluka
2026-01-06 14:23             ` Jason Gunthorpe
2026-01-06 15:50               ` Dmytro Maluka
2026-01-06 16:45                 ` Jason Gunthorpe
2026-01-06 17:14                   ` Dmytro Maluka
2026-01-08  2:09                 ` Tian, Kevin
2026-01-09  6:32                   ` Baolu Lu
     [not found]                 ` <BN9PR11MB5276FB0F465DBFB4EE4D742B8C85A@BN9PR11MB5276.namprd11.prod.outlook.com>
2026-01-08  7:00                   ` Tian, Kevin
2026-01-06  3:37   ` Baolu Lu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.