[PATCH 0/8] iommu/vt-d: Hitless PASID updates via entry

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/8] iommu/vt-d: Hitless PASID updates via entry_sync
@ 2026-03-09  6:06 Lu Baolu
  2026-03-09  6:06 ` [PATCH 1/8] iommu: Lift and generalize the STE/CD update code from SMMUv3 Lu Baolu
                   ` (7 more replies)
  0 siblings, 8 replies; 34+ messages in thread
From: Lu Baolu @ 2026-03-09  6:06 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe
  Cc: Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel, Lu Baolu

This is a follow-up to recent discussions on the iommu community mailing
list [1] [2] regarding potential race conditions in table entry updates.
After addressing atomicity in context and PASID entry updates [3], this
series modernizes Intel IOMMU driver by introducing a "hitless" update
mechanism.

The core of this series lifts the synchronization logic originally found
in the ARM SMMUv3 driver into a generic IOMMU library (entry_sync) and
plumbs it into the Intel IOMMU driver.

Traditionally, updating a PASID table entry while the hardware is
performing DMA required a disruptive "clear-then-update" sequence. By
analyzing "used bits" and enforcing 128-bit atomicity via CMPXCHG16B,
this library allows the driver to transition between translation modes
hitlessly whenever possible.

[1] https://lore.kernel.org/linux-iommu/20251227175728.4358-1-dmaluka@chromium.org/
[2] https://lore.kernel.org/linux-iommu/20260107201800.2486137-1-skhawaja@google.com/
[3] https://lore.kernel.org/linux-iommu/20260120061816.2132558-1-baolu.lu@linux.intel.com/

This series is also available on github:
[4] https://github.com/LuBaolu/intel-iommu/commits/pasid-entry-sync-v1

Best regards,
baolu

Jason Gunthorpe (1):
  iommu: Lift and generalize the STE/CD update code from SMMUv3

Lu Baolu (7):
  iommu/vt-d: Add entry_sync support for PASID entry updates
  iommu/vt-d: Require CMPXCHG16B for PASID support
  iommu/vt-d: Add trace events for PASID entry sync updates
  iommu/vt-d: Use intel_pasid_write() for first-stage setup
  iommu/vt-d: Use intel_pasid_write() for second-stage setup
  iommu/vt-d: Use intel_pasid_write() for pass-through setup
  iommu/vt-d: Use intel_pasid_write() for nested setup

 drivers/iommu/Kconfig               |  14 ++
 drivers/iommu/intel/Kconfig         |   4 +-
 drivers/iommu/Makefile              |   1 +
 drivers/iommu/entry_sync.h          |  66 +++++++
 drivers/iommu/entry_sync_template.h | 143 ++++++++++++++
 drivers/iommu/intel/iommu.h         |   8 +-
 drivers/iommu/intel/trace.h         | 107 ++++++++++
 drivers/iommu/entry_sync.c          |  68 +++++++
 drivers/iommu/intel/iommu.c         |  51 ++---
 drivers/iommu/intel/nested.c        |  13 +-
 drivers/iommu/intel/pasid.c         | 291 +++++++++++++++++++---------
 drivers/iommu/intel/svm.c           |   5 +-
 12 files changed, 620 insertions(+), 151 deletions(-)
 create mode 100644 drivers/iommu/entry_sync.h
 create mode 100644 drivers/iommu/entry_sync_template.h
 create mode 100644 drivers/iommu/entry_sync.c

-- 
2.43.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/8] iommu: Lift and generalize the STE/CD update code from SMMUv3
  2026-03-09  6:06 [PATCH 0/8] iommu/vt-d: Hitless PASID updates via entry_sync Lu Baolu
@ 2026-03-09  6:06 ` Lu Baolu
  2026-03-09 23:33   ` Samiullah Khawaja
  2026-03-13  5:39   ` Nicolin Chen
  2026-03-09  6:06 ` [PATCH 2/8] iommu/vt-d: Add entry_sync support for PASID entry updates Lu Baolu
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 34+ messages in thread
From: Lu Baolu @ 2026-03-09  6:06 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe
  Cc: Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel, Lu Baolu

From: Jason Gunthorpe <jgg@nvidia.com>

Many IOMMU implementations store data structures in host memory that can
be quite big. The iommu is able to DMA read the host memory using an
atomic quanta, usually 64 or 128 bits, and will read an entry using
multiple quanta reads.

Updating the host memory datastructure entry while the HW is concurrently
DMA'ing it is a little bit involved, but if you want to do this hitlessly,
while never making the entry non-valid, then it becomes quite complicated.

entry_sync is a library to handle this task. It works on the notion of
"used bits" which reflect which bits the HW is actually sensitive to and
which bits are ignored by hardware. Many hardware specifications say
things like 'if mode is X then bits ABC are ignored'.

Using the ignored bits entry_sync can often compute a series of ordered
writes and flushes that will allow the entry to be updated while keeping
it valid. If such an update is not possible then entry will be made
temporarily non-valid.

A 64 and 128 bit quanta version is provided to support existing iommus.

Co-developed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/Kconfig               |  14 +++
 drivers/iommu/Makefile              |   1 +
 drivers/iommu/entry_sync.h          |  66 +++++++++++++
 drivers/iommu/entry_sync_template.h | 143 ++++++++++++++++++++++++++++
 drivers/iommu/entry_sync.c          |  68 +++++++++++++
 5 files changed, 292 insertions(+)
 create mode 100644 drivers/iommu/entry_sync.h
 create mode 100644 drivers/iommu/entry_sync_template.h
 create mode 100644 drivers/iommu/entry_sync.c

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index f86262b11416..2650c9fa125b 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -145,6 +145,20 @@ config IOMMU_DEFAULT_PASSTHROUGH
 
 endchoice
 
+config IOMMU_ENTRY_SYNC
+	bool
+	default n
+
+config IOMMU_ENTRY_SYNC64
+	bool
+	select IOMMU_ENTRY_SYNC
+	default n
+
+config IOMMU_ENTRY_SYNC128
+	bool
+	select IOMMU_ENTRY_SYNC
+	default n
+
 config OF_IOMMU
 	def_bool y
 	depends on OF && IOMMU_API
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 0275821f4ef9..bd923995497a 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_IOMMU_API) += iommu-traces.o
 obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
 obj-$(CONFIG_IOMMU_DEBUGFS) += iommu-debugfs.o
 obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o
+obj-$(CONFIG_IOMMU_ENTRY_SYNC) += entry_sync.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
diff --git a/drivers/iommu/entry_sync.h b/drivers/iommu/entry_sync.h
new file mode 100644
index 000000000000..004d421c71c0
--- /dev/null
+++ b/drivers/iommu/entry_sync.h
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Many IOMMU implementations store data structures in host memory that can be
+ * quite big. The iommu is able to DMA read the host memory using an atomic
+ * quanta, usually 64 or 128 bits, and will read an entry using multiple quanta
+ * reads.
+ *
+ * Updating the host memory datastructure entry while the HW is concurrently
+ * DMA'ing it is a little bit involved, but if you want to do this hitlessly,
+ * while never making the entry non-valid, then it becomes quite complicated.
+ *
+ * entry_sync is a library to handle this task. It works on the notion of "used
+ * bits" which reflect which bits the HW is actually sensitive to and which bits
+ * are ignored by hardware. Many hardware specifications say things like 'if
+ * mode is X then bits ABC are ignored'.
+ *
+ * Using the ignored bits entry_sync can often compute a series of ordered
+ * writes and flushes that will allow the entry to be updated while keeping it
+ * valid. If such an update is not possible then entry will be made temporarily
+ * non-valid.
+ *
+ * A 64 and 128 bit quanta version is provided to support existing iommus.
+ */
+#ifndef IOMMU_ENTRY_SYNC_H
+#define IOMMU_ENTRY_SYNC_H
+
+#include <linux/types.h>
+#include <linux/compiler.h>
+#include <linux/bug.h>
+
+/* Caller allocates a stack array of this length to call entry_sync_write() */
+#define ENTRY_SYNC_MEMORY_LEN(writer) ((writer)->num_quantas * 3)
+
+struct entry_sync_writer_ops64;
+struct entry_sync_writer64 {
+	const struct entry_sync_writer_ops64 *ops;
+	size_t num_quantas;
+	size_t vbit_quanta;
+};
+
+struct entry_sync_writer_ops64 {
+	void (*get_used)(const __le64 *entry, __le64 *used);
+	void (*sync)(struct entry_sync_writer64 *writer);
+};
+
+void entry_sync_write64(struct entry_sync_writer64 *writer, __le64 *entry,
+			const __le64 *target, __le64 *memory,
+			size_t memory_len);
+
+struct entry_sync_writer_ops128;
+struct entry_sync_writer128 {
+	const struct entry_sync_writer_ops128 *ops;
+	size_t num_quantas;
+	size_t vbit_quanta;
+};
+
+struct entry_sync_writer_ops128 {
+	void (*get_used)(const u128 *entry, u128 *used);
+	void (*sync)(struct entry_sync_writer128 *writer);
+};
+
+void entry_sync_write128(struct entry_sync_writer128 *writer, u128 *entry,
+			 const u128 *target, u128 *memory,
+			 size_t memory_len);
+
+#endif
diff --git a/drivers/iommu/entry_sync_template.h b/drivers/iommu/entry_sync_template.h
new file mode 100644
index 000000000000..646f518b098e
--- /dev/null
+++ b/drivers/iommu/entry_sync_template.h
@@ -0,0 +1,143 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#include "entry_sync.h"
+#include <linux/args.h>
+#include <linux/bitops.h>
+
+#ifndef entry_sync_writer
+#define entry_sync_writer entry_sync_writer64
+#define quanta_t __le64
+#define NS(name) CONCATENATE(name, 64)
+#endif
+
+/*
+ * Figure out if we can do a hitless update of entry to become target. Returns a
+ * bit mask where 1 indicates that a quanta word needs to be set disruptively.
+ * unused_update is an intermediate value of entry that has unused bits set to
+ * their new values.
+ */
+static u8 NS(entry_quanta_diff)(struct entry_sync_writer *writer,
+				const quanta_t *entry, const quanta_t *target,
+				quanta_t *unused_update, quanta_t *memory)
+{
+	quanta_t *target_used = memory + writer->num_quantas * 1;
+	quanta_t *cur_used = memory + writer->num_quantas * 2;
+	u8 used_qword_diff = 0;
+	unsigned int i;
+
+	writer->ops->get_used(entry, cur_used);
+	writer->ops->get_used(target, target_used);
+
+	for (i = 0; i != writer->num_quantas; i++) {
+		/*
+		 * Check that masks are up to date, the make functions are not
+		 * allowed to set a bit to 1 if the used function doesn't say it
+		 * is used.
+		 */
+		WARN_ON_ONCE(target[i] & ~target_used[i]);
+
+		/* Bits can change because they are not currently being used */
+		unused_update[i] = (entry[i] & cur_used[i]) |
+				   (target[i] & ~cur_used[i]);
+		/*
+		 * Each bit indicates that a used bit in a qword needs to be
+		 * changed after unused_update is applied.
+		 */
+		if ((unused_update[i] & target_used[i]) != target[i])
+			used_qword_diff |= 1 << i;
+	}
+	return used_qword_diff;
+}
+
+/*
+ * Update the entry to the target configuration. The transition from the current
+ * entry to the target entry takes place over multiple steps that attempts to
+ * make the transition hitless if possible. This function takes care not to
+ * create a situation where the HW can perceive a corrupted entry. HW is only
+ * required to have a quanta-bit atomicity with stores from the CPU, while
+ * entries are many quanta bit values big.
+ *
+ * The difference between the current value and the target value is analyzed to
+ * determine which of three updates are required - disruptive, hitless or no
+ * change.
+ *
+ * In the most general disruptive case we can make any update in three steps:
+ *  - Disrupting the entry (V=0)
+ *  - Fill now unused quanta words, except qword 0 which contains V
+ *  - Make qword 0 have the final value and valid (V=1) with a single 64
+ *    bit store
+ *
+ * However this disrupts the HW while it is happening. There are several
+ * interesting cases where a STE/CD can be updated without disturbing the HW
+ * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
+ * because the used bits don't intersect. We can detect this by calculating how
+ * many 64 bit values need update after adjusting the unused bits and skip the
+ * V=0 process. This relies on the IGNORED behavior described in the
+ * specification.
+ */
+void NS(entry_sync_write)(struct entry_sync_writer *writer, quanta_t *entry,
+			  const quanta_t *target, quanta_t *memory,
+			  size_t memory_len)
+{
+	quanta_t *unused_update = memory + writer->num_quantas * 0;
+	u8 used_qword_diff;
+
+	if (WARN_ON(memory_len !=
+		    ENTRY_SYNC_MEMORY_LEN(writer) * sizeof(*memory)))
+		return;
+
+	used_qword_diff = NS(entry_quanta_diff)(writer, entry, target,
+						unused_update, memory);
+	if (hweight8(used_qword_diff) == 1) {
+		/*
+		 * Only one quanta needs its used bits to be changed. This is a
+		 * hitless update, update all bits the current entry is ignoring
+		 * to their new values, then update a single "critical quanta"
+		 * to change the entry and finally 0 out any bits that are now
+		 * unused in the target configuration.
+		 */
+		unsigned int critical_qword_index = ffs(used_qword_diff) - 1;
+
+		/*
+		 * Skip writing unused bits in the critical quanta since we'll
+		 * be writing it in the next step anyways. This can save a sync
+		 * when the only change is in that quanta.
+		 */
+		unused_update[critical_qword_index] =
+			entry[critical_qword_index];
+		NS(entry_set)(writer, entry, unused_update, 0,
+			      writer->num_quantas);
+		NS(entry_set)(writer, entry, target, critical_qword_index, 1);
+		NS(entry_set)(writer, entry, target, 0, writer->num_quantas);
+	} else if (used_qword_diff) {
+		/*
+		 * At least two quantas need their inuse bits to be changed.
+		 * This requires a breaking update, zero the V bit, write all
+		 * qwords but 0, then set qword 0
+		 */
+		unused_update[writer->vbit_quanta] = 0;
+		NS(entry_set)(writer, entry, unused_update, writer->vbit_quanta, 1);
+
+		if (writer->vbit_quanta != 0)
+			NS(entry_set)(writer, entry, target, 0,
+				      writer->vbit_quanta - 1);
+		if (writer->vbit_quanta != writer->num_quantas)
+			NS(entry_set)(writer, entry, target,
+				      writer->vbit_quanta,
+				      writer->num_quantas - 1);
+
+		NS(entry_set)(writer, entry, target, writer->vbit_quanta, 1);
+	} else {
+		/*
+		 * No inuse bit changed. Sanity check that all unused bits are 0
+		 * in the entry. The target was already sanity checked by
+		 * entry_quanta_diff().
+		 */
+		WARN_ON_ONCE(NS(entry_set)(writer, entry, target, 0,
+					   writer->num_quantas));
+	}
+}
+EXPORT_SYMBOL(NS(entry_sync_write));
+
+#undef entry_sync_writer
+#undef quanta_t
+#undef NS
diff --git a/drivers/iommu/entry_sync.c b/drivers/iommu/entry_sync.c
new file mode 100644
index 000000000000..48d31270dbba
--- /dev/null
+++ b/drivers/iommu/entry_sync.c
@@ -0,0 +1,68 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Helpers for drivers to update multi-quanta entries shared with HW without
+ * races to minimize breaking changes.
+ */
+#include "entry_sync.h"
+#include <linux/kconfig.h>
+#include <linux/atomic.h>
+
+#if IS_ENABLED(CONFIG_IOMMU_ENTRY_SYNC64)
+static bool entry_set64(struct entry_sync_writer64 *writer, __le64 *entry,
+			const __le64 *target, unsigned int start,
+			unsigned int len)
+{
+	bool changed = false;
+	unsigned int i;
+
+	for (i = start; len != 0; len--, i++) {
+		if (entry[i] != target[i]) {
+			WRITE_ONCE(entry[i], target[i]);
+			changed = true;
+		}
+	}
+
+	if (changed)
+		writer->ops->sync(writer);
+	return changed;
+}
+
+#define entry_sync_writer entry_sync_writer64
+#define quanta_t __le64
+#define NS(name) CONCATENATE(name, 64)
+#include "entry_sync_template.h"
+#endif
+
+#if IS_ENABLED(CONFIG_IOMMU_ENTRY_SYNC128)
+static bool entry_set128(struct entry_sync_writer128 *writer, u128 *entry,
+			 const u128 *target, unsigned int start,
+			 unsigned int len)
+{
+	bool changed = false;
+	unsigned int i;
+
+	for (i = start; len != 0; len--, i++) {
+		if (entry[i] != target[i]) {
+			/*
+			 * Use cmpxchg128 to generate an indivisible write from
+			 * the CPU to DMA'able memory. This must ensure that HW
+			 * sees either the new or old 128 bit value and not
+			 * something torn. As updates are serialized by a
+			 * spinlock, we use the local (unlocked) variant to
+			 * avoid unnecessary bus locking overhead.
+			 */
+			cmpxchg128_local(&entry[i], entry[i], target[i]);
+			changed = true;
+		}
+	}
+
+	if (changed)
+		writer->ops->sync(writer);
+	return changed;
+}
+
+#define entry_sync_writer entry_sync_writer128
+#define quanta_t u128
+#define NS(name) CONCATENATE(name, 128)
+#include "entry_sync_template.h"
+#endif
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/8] iommu/vt-d: Add entry_sync support for PASID entry updates
  2026-03-09  6:06 [PATCH 0/8] iommu/vt-d: Hitless PASID updates via entry_sync Lu Baolu
  2026-03-09  6:06 ` [PATCH 1/8] iommu: Lift and generalize the STE/CD update code from SMMUv3 Lu Baolu
@ 2026-03-09  6:06 ` Lu Baolu
  2026-03-09 13:41   ` Jason Gunthorpe
  2026-03-09  6:06 ` [PATCH 3/8] iommu/vt-d: Require CMPXCHG16B for PASID support Lu Baolu
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 34+ messages in thread
From: Lu Baolu @ 2026-03-09  6:06 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe
  Cc: Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel, Lu Baolu

Updating PASID table entries while the device hardware is possibly
performing DMA concurrently is complex. Traditionally, this required
a "clear-then-update" approach — clearing the Present bit, flushing
caches, updating the entry, and then restoring the Present bit. This
causes unnecessary latency or interruptions for transactions that might
not even be affected by the specific bits being changed.

Plumb this driver into the generic entry_sync library to modernize
this process. The library uses the concept of "Used bits" to determine
if a transition can be performed "hitlessly" (via a single atomic
128-bit swap) or if a disruptive 3-step update is truly required.

The implementation includes:

- intel_pasid_get_used(): Defines which bits the IOMMU hardware is
  sensitive to based on the PGTT.
- intel_pasid_sync(): Handles the required clflushes, PASID cache
  invalidations, and IOTLB/Dev-TLB flushes required between update
  steps.
- 128-bit atomicity: Depends on IOMMU_ENTRY_SYNC128 to ensure that
  256-bit PASID entries are updated in atomic 128-bit quanta,
  preventing the hardware from ever seeing a "torn" entry.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/Kconfig |   2 +
 drivers/iommu/intel/pasid.c | 173 ++++++++++++++++++++++++++++++++++++
 2 files changed, 175 insertions(+)

diff --git a/drivers/iommu/intel/Kconfig b/drivers/iommu/intel/Kconfig
index 5471f814e073..7fa31b9d4ef4 100644
--- a/drivers/iommu/intel/Kconfig
+++ b/drivers/iommu/intel/Kconfig
@@ -26,6 +26,8 @@ config INTEL_IOMMU
 	select PCI_ATS
 	select PCI_PRI
 	select PCI_PASID
+	select IOMMU_ENTRY_SYNC
+	select IOMMU_ENTRY_SYNC128
 	help
 	  DMA remapping (DMAR) devices support enables independent address
 	  translations for Direct Memory Access (DMA) from devices.
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 9d30015b8940..5b9eb5c8f42d 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -21,12 +21,185 @@
 #include "iommu.h"
 #include "pasid.h"
 #include "../iommu-pages.h"
+#include "../entry_sync.h"
 
 /*
  * Intel IOMMU system wide PASID name space:
  */
 u32 intel_pasid_max_id = PASID_MAX;
 
+/*
+ * Plumb into the generic entry_sync library:
+ */
+static struct pasid_entry *intel_pasid_get_entry(struct device *dev, u32 pasid);
+static void pasid_flush_caches(struct intel_iommu *iommu, struct pasid_entry *pte,
+			       u32 pasid, u16 did);
+static void intel_pasid_flush_present(struct intel_iommu *iommu, struct device *dev,
+				      u32 pasid, u16 did, struct pasid_entry *pte);
+static void pasid_cache_invalidation_with_pasid(struct intel_iommu *iommu,
+						u16 did, u32 pasid);
+static void devtlb_invalidation_with_pasid(struct intel_iommu *iommu,
+					   struct device *dev, u32 pasid);
+
+struct intel_pasid_writer {
+	struct entry_sync_writer128 writer;
+	struct intel_iommu *iommu;
+	struct device *dev;
+	u32 pasid;
+	struct pasid_entry orig_pte;
+	bool was_present;
+};
+
+/*
+ * Identify which bits of the 256-bit entry the HW is using. The "Used" bits
+ * are those that, if changed, would cause the IOMMU to behave differently
+ * for an active transaction.
+ */
+static void intel_pasid_get_used(const u128 *entry, u128 *used)
+{
+	struct pasid_entry *pe = (struct pasid_entry *)entry;
+	struct pasid_entry *ue = (struct pasid_entry *)used;
+	u16 pgtt;
+
+	/* Initialize used bits to 0. */
+	memset(ue, 0, sizeof(*ue));
+
+	/* Present bit always matters. */
+	ue->val[0] |= PASID_PTE_PRESENT;
+
+	/* Nothing more for non-present entries. */
+	if (!(pe->val[0] & PASID_PTE_PRESENT))
+		return;
+
+	pgtt = pasid_pte_get_pgtt(pe);
+	switch (pgtt) {
+	case PASID_ENTRY_PGTT_FL_ONLY:
+		/* AW, PGTT */
+		ue->val[0] |= GENMASK_ULL(4, 2) | GENMASK_ULL(8, 6);
+		/* DID, PWSNP, PGSNP */
+		ue->val[1] |= GENMASK_ULL(24, 23) | GENMASK_ULL(15, 0);
+		/* FSPTPTR, FSPM */
+		ue->val[2] |= GENMASK_ULL(63, 12) | GENMASK_ULL(3, 2);
+		break;
+	case PASID_ENTRY_PGTT_NESTED:
+		/* FPD, AW, PGTT, SSADE, SSPTPTR*/
+		ue->val[0] |= GENMASK_ULL(63, 12) | GENMASK_ULL(9, 6) |
+				GENMASK_ULL(4, 1);
+		/* PGSNP, DID, PWSNP */
+		ue->val[1] |= GENMASK_ULL(24, 23) | GENMASK_ULL(15, 0);
+		/* FSPTPTR, FSPM, EAFE, WPE, SRE */
+		ue->val[2] |= GENMASK_ULL(63, 12) | BIT_ULL(7) |
+				GENMASK_ULL(4, 2) | BIT_ULL(0);
+		break;
+	case PASID_ENTRY_PGTT_SL_ONLY:
+		/* FPD, AW, PGTT, SSADE, SSPTPTR */
+		ue->val[0] |= GENMASK_ULL(63, 12) | GENMASK_ULL(9, 6) |
+				GENMASK_ULL(4, 1);
+		/* DID, PWSNP */
+		ue->val[1] |= GENMASK_ULL(15, 0) | BIT_ULL(23);
+		break;
+	case PASID_ENTRY_PGTT_PT:
+		/* FPD, AW, PGTT */
+		ue->val[0] |= GENMASK_ULL(4, 2) | GENMASK_ULL(8, 6) | BIT_ULL(1);
+		/* DID, PWSNP */
+		ue->val[1] |= GENMASK_ULL(15, 0) | BIT_ULL(23);
+		break;
+	default:
+		WARN_ON(true);
+	}
+}
+
+static void intel_pasid_sync(struct entry_sync_writer128 *writer)
+{
+	struct intel_pasid_writer *p_writer = container_of(writer,
+			struct intel_pasid_writer, writer);
+	struct intel_iommu *iommu = p_writer->iommu;
+	struct device *dev = p_writer->dev;
+	bool was_present, is_present;
+	u32 pasid = p_writer->pasid;
+	struct pasid_entry *pte;
+	u16 old_did, old_pgtt;
+
+	pte = intel_pasid_get_entry(dev, pasid);
+	was_present = p_writer->was_present;
+	is_present = pasid_pte_is_present(pte);
+	old_did = pasid_get_domain_id(&p_writer->orig_pte);
+	old_pgtt = pasid_pte_get_pgtt(&p_writer->orig_pte);
+
+	/* Update the last present state: */
+	p_writer->was_present = is_present;
+
+	if (!ecap_coherent(iommu->ecap))
+		clflush_cache_range(pte, sizeof(*pte));
+
+	/* Sync for "P=0" to "P=1": */
+	if (!was_present) {
+		if (is_present)
+			pasid_flush_caches(iommu, pte, pasid,
+					   pasid_get_domain_id(pte));
+
+		return;
+	}
+
+	/* Sync for "P=1" to "P=1": */
+	if (is_present) {
+		intel_pasid_flush_present(iommu, dev, pasid, old_did, pte);
+		return;
+	}
+
+	/* Sync for "P=1" to "P=0": */
+	pasid_cache_invalidation_with_pasid(iommu, old_did, pasid);
+
+	if (old_pgtt == PASID_ENTRY_PGTT_PT || old_pgtt == PASID_ENTRY_PGTT_FL_ONLY)
+		qi_flush_piotlb(iommu, old_did, pasid, 0, -1, 0);
+	else
+		iommu->flush.flush_iotlb(iommu, old_did, 0, 0, DMA_TLB_DSI_FLUSH);
+
+	devtlb_invalidation_with_pasid(iommu, dev, pasid);
+}
+
+static const struct entry_sync_writer_ops128 writer_ops128 = {
+	.get_used = intel_pasid_get_used,
+	.sync = intel_pasid_sync,
+};
+
+#define INTEL_PASID_SYNC_MEM_COUNT	12
+
+static int __maybe_unused intel_pasid_write(struct intel_iommu *iommu,
+					    struct device *dev, u32 pasid,
+					    u128 *target)
+{
+	struct pasid_entry *pte = intel_pasid_get_entry(dev, pasid);
+	struct intel_pasid_writer p_writer = {
+		.writer = {
+			.ops = &writer_ops128,
+			/* 512 bits total (4 * 128-bit chunks) */
+			.num_quantas = 4,
+			/* The 'P' bit is in the first 128-bit chunk */
+			.vbit_quanta = 0,
+		},
+		.iommu = iommu,
+		.dev = dev,
+		.pasid = pasid,
+	};
+	u128 memory[INTEL_PASID_SYNC_MEM_COUNT];
+
+	if (!pte)
+		return -ENODEV;
+
+	p_writer.orig_pte = *pte;
+	p_writer.was_present = pasid_pte_is_present(pte);
+
+	/*
+	 * The library now does the heavy lifting:
+	 * 1. Checks if it can do a 1-quanta hitless flip.
+	 * 2. If not, it does a 3-step V=0 (disruptive) update.
+	 */
+	entry_sync_write128(&p_writer.writer, (u128 *)pte, target, memory, sizeof(memory));
+
+	return 0;
+}
+
 /*
  * Per device pasid table management:
  */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/8] iommu/vt-d: Require CMPXCHG16B for PASID support
  2026-03-09  6:06 [PATCH 0/8] iommu/vt-d: Hitless PASID updates via entry_sync Lu Baolu
  2026-03-09  6:06 ` [PATCH 1/8] iommu: Lift and generalize the STE/CD update code from SMMUv3 Lu Baolu
  2026-03-09  6:06 ` [PATCH 2/8] iommu/vt-d: Add entry_sync support for PASID entry updates Lu Baolu
@ 2026-03-09  6:06 ` Lu Baolu
  2026-03-09 13:42   ` Jason Gunthorpe
  2026-03-09  6:06 ` [PATCH 4/8] iommu/vt-d: Add trace events for PASID entry sync updates Lu Baolu
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 34+ messages in thread
From: Lu Baolu @ 2026-03-09  6:06 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe
  Cc: Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel, Lu Baolu

The Intel IOMMU driver is moving toward using the generic entry_sync
library for PASID table entry updates. This library requires 128-bit
atomic write operations (cmpxchg128) to update 512-bit PASID entries in
atomic quanta, ensuring the hardware never observes a torn entry.

On x86_64, 128-bit atomicity is provided by the CMPXCHG16B instruction.
Update the driver to:

1. Limit INTEL_IOMMU to X86_64, as 128-bit atomic operations are not
   available on 32-bit x86.
2. Gate pasid_supported() on the presence of X86_FEATURE_CX16.
3. Provide a boot-time warning if a PASID-capable IOMMU is detected on
   a CPU lacking the required instruction.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/Kconfig | 2 +-
 drivers/iommu/intel/iommu.h | 3 ++-
 drivers/iommu/intel/iommu.c | 4 ++++
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/Kconfig b/drivers/iommu/intel/Kconfig
index 7fa31b9d4ef4..fee7fea9dfcb 100644
--- a/drivers/iommu/intel/Kconfig
+++ b/drivers/iommu/intel/Kconfig
@@ -11,7 +11,7 @@ config DMAR_DEBUG
 
 config INTEL_IOMMU
 	bool "Support for Intel IOMMU using DMA Remapping Devices"
-	depends on PCI_MSI && ACPI && X86
+	depends on PCI_MSI && ACPI && X86_64
 	select IOMMU_API
 	select GENERIC_PT
 	select IOMMU_PT
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 599913fb65d5..54b58d01d0cb 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -535,7 +535,8 @@ enum {
 
 #define sm_supported(iommu)	(intel_iommu_sm && ecap_smts((iommu)->ecap))
 #define pasid_supported(iommu)	(sm_supported(iommu) &&			\
-				 ecap_pasid((iommu)->ecap))
+				 ecap_pasid((iommu)->ecap) &&		\
+				 boot_cpu_has(X86_FEATURE_CX16))
 #define ssads_supported(iommu) (sm_supported(iommu) &&                 \
 				ecap_slads((iommu)->ecap) &&           \
 				ecap_smpwc(iommu->ecap))
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ef7613b177b9..5369526e89d0 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2647,6 +2647,10 @@ int __init intel_iommu_init(void)
 			pr_info_once("IOMMU batching disallowed due to virtualization\n");
 			iommu_set_dma_strict();
 		}
+
+		if (ecap_pasid(iommu->ecap) && !boot_cpu_has(X86_FEATURE_CX16))
+			pr_info_once("PASID disabled due to lack of CMPXCHG16B support.\n");
+
 		iommu_device_sysfs_add(&iommu->iommu, NULL,
 				       intel_iommu_groups,
 				       "%s", iommu->name);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/8] iommu/vt-d: Add trace events for PASID entry sync updates
  2026-03-09  6:06 [PATCH 0/8] iommu/vt-d: Hitless PASID updates via entry_sync Lu Baolu
                   ` (2 preceding siblings ...)
  2026-03-09  6:06 ` [PATCH 3/8] iommu/vt-d: Require CMPXCHG16B for PASID support Lu Baolu
@ 2026-03-09  6:06 ` Lu Baolu
  2026-03-09  6:06 ` [PATCH 5/8] iommu/vt-d: Use intel_pasid_write() for first-stage setup Lu Baolu
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 34+ messages in thread
From: Lu Baolu @ 2026-03-09  6:06 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe
  Cc: Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel, Lu Baolu

The entry_sync library introduces a more complex, multi-step update
process for PASID table entries to enable hitless transitions. Add a
set of trace events specifically for the Intel PASID sync plumbing.

The implemented trace events introduce:

- entry_write_start / entry_write_complete: Captures the state of the
  512-bit PASID entry before and after the entry_sync library performs
  its update logic. This allows verification of the final output compared
  to the target.

- entry_get_used: Logs the current entry alongside the calculated "used
  bits" mask. This is critical for debugging the library's decision-making
  process regarding whether an update can be hitless or must be disruptive.

- entry_sync: Tracks the state transitions (was_present vs. is_present)
  within the entry_sync callback. This helps verify that the correct cache
  invalidations and IOTLB flushes are being triggered for specific
  transitions (e.g., P=1 to P=1 hitless vs. P=1 to P=0 disruptive).

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/trace.h | 107 ++++++++++++++++++++++++++++++++++++
 drivers/iommu/intel/pasid.c |  11 +++-
 2 files changed, 117 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/trace.h b/drivers/iommu/intel/trace.h
index 6311ba3f1691..b0ccda6f8dc5 100644
--- a/drivers/iommu/intel/trace.h
+++ b/drivers/iommu/intel/trace.h
@@ -181,6 +181,113 @@ DEFINE_EVENT(cache_tag_flush, cache_tag_flush_range_np,
 		 unsigned long addr, unsigned long pages, unsigned long mask),
 	TP_ARGS(tag, start, end, addr, pages, mask)
 );
+
+DECLARE_EVENT_CLASS(entry_write,
+	TP_PROTO(struct device *dev, u32 pasid, u128 *target, u128 *curr),
+	TP_ARGS(dev, pasid, target, curr),
+
+	TP_STRUCT__entry(
+		__string(dev, dev_name(dev))
+		__field(u32, pasid)
+		__field(u64, t_w3)
+		__field(u64, t_w2)
+		__field(u64, t_w1)
+		__field(u64, t_w0)
+		__field(u64, c_w3)
+		__field(u64, c_w2)
+		__field(u64, c_w1)
+		__field(u64, c_w0)
+	),
+
+	TP_fast_assign(
+		__assign_str(dev);
+		__entry->pasid = pasid;
+		/* Target Entry */
+		__entry->t_w0 = (u64)target[0];
+		__entry->t_w1 = (u64)(target[0] >> 64);
+		__entry->t_w2 = (u64)target[1];
+		__entry->t_w3 = (u64)(target[1] >> 64);
+		/* Current Entry */
+		__entry->c_w0 = (u64)curr[0];
+		__entry->c_w1 = (u64)(curr[0] >> 64);
+		__entry->c_w2 = (u64)curr[1];
+		__entry->c_w3 = (u64)(curr[1] >> 64);
+	),
+
+	TP_printk("%s[%u] target %016llx:%016llx:%016llx:%016llx, current %016llx:%016llx:%016llx:%016llx",
+		  __get_str(dev), __entry->pasid,
+		  __entry->t_w3, __entry->t_w2, __entry->t_w1, __entry->t_w0,
+		  __entry->c_w3, __entry->c_w2, __entry->c_w1, __entry->c_w0
+	)
+);
+
+DEFINE_EVENT(entry_write, entry_write_start,
+	TP_PROTO(struct device *dev, u32 pasid, u128 *target, u128 *curr),
+	TP_ARGS(dev, pasid, target, curr)
+);
+
+DEFINE_EVENT(entry_write, entry_write_complete,
+	TP_PROTO(struct device *dev, u32 pasid, u128 *target, u128 *curr),
+	TP_ARGS(dev, pasid, target, curr)
+);
+
+TRACE_EVENT(entry_get_used,
+	TP_PROTO(const u128 *pe, u128 *used),
+	TP_ARGS(pe, used),
+
+	TP_STRUCT__entry(
+		__field(u64, e_w3)
+		__field(u64, e_w2)
+		__field(u64, e_w1)
+		__field(u64, e_w0)
+		__field(u64, u_w3)
+		__field(u64, u_w2)
+		__field(u64, u_w1)
+		__field(u64, u_w0)
+	),
+
+	TP_fast_assign(
+		__entry->e_w0 = (u64)pe[0];
+		__entry->e_w1 = (u64)(pe[0] >> 64);
+		__entry->e_w2 = (u64)pe[1];
+		__entry->e_w3 = (u64)(pe[1] >> 64);
+
+		__entry->u_w0 = (u64)used[0];
+		__entry->u_w1 = (u64)(used[0] >> 64);
+		__entry->u_w2 = (u64)used[1];
+		__entry->u_w3 = (u64)(used[1] >> 64);
+	),
+
+	TP_printk("entry %016llx:%016llx:%016llx:%016llx, used %016llx:%016llx:%016llx:%016llx",
+		  __entry->e_w3, __entry->e_w2, __entry->e_w1, __entry->e_w0,
+		  __entry->u_w3, __entry->u_w2, __entry->u_w1, __entry->u_w0
+	)
+);
+
+TRACE_EVENT(entry_sync,
+	TP_PROTO(struct device *dev, u32 pasid, bool was_present, bool is_present),
+	TP_ARGS(dev, pasid, was_present, is_present),
+
+	TP_STRUCT__entry(
+		__string(dev, dev_name(dev))
+		__field(u32, pasid)
+		__field(bool, was_present)
+		__field(bool, is_present)
+	),
+
+	TP_fast_assign(
+		__assign_str(dev);
+		__entry->pasid = pasid;
+		__entry->was_present = was_present;
+		__entry->is_present = is_present;
+	),
+
+	TP_printk("%s[%u] was %s, is now %s",
+		  __get_str(dev), __entry->pasid,
+		  __entry->was_present ? "present" : "non-present",
+		  __entry->is_present ? "present" : "non-present"
+	)
+);
 #endif /* _TRACE_INTEL_IOMMU_H */
 
 /* This part must be outside protection */
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 5b9eb5c8f42d..b7c8888afaef 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -20,6 +20,7 @@
 
 #include "iommu.h"
 #include "pasid.h"
+#include "trace.h"
 #include "../iommu-pages.h"
 #include "../entry_sync.h"
 
@@ -68,8 +69,10 @@ static void intel_pasid_get_used(const u128 *entry, u128 *used)
 	ue->val[0] |= PASID_PTE_PRESENT;
 
 	/* Nothing more for non-present entries. */
-	if (!(pe->val[0] & PASID_PTE_PRESENT))
+	if (!(pe->val[0] & PASID_PTE_PRESENT)) {
+		trace_entry_get_used(entry, used);
 		return;
+	}
 
 	pgtt = pasid_pte_get_pgtt(pe);
 	switch (pgtt) {
@@ -107,6 +110,8 @@ static void intel_pasid_get_used(const u128 *entry, u128 *used)
 	default:
 		WARN_ON(true);
 	}
+
+	trace_entry_get_used(entry, used);
 }
 
 static void intel_pasid_sync(struct entry_sync_writer128 *writer)
@@ -132,6 +137,8 @@ static void intel_pasid_sync(struct entry_sync_writer128 *writer)
 	if (!ecap_coherent(iommu->ecap))
 		clflush_cache_range(pte, sizeof(*pte));
 
+	trace_entry_sync(dev, pasid, was_present, is_present);
+
 	/* Sync for "P=0" to "P=1": */
 	if (!was_present) {
 		if (is_present)
@@ -195,7 +202,9 @@ static int __maybe_unused intel_pasid_write(struct intel_iommu *iommu,
 	 * 1. Checks if it can do a 1-quanta hitless flip.
 	 * 2. If not, it does a 3-step V=0 (disruptive) update.
 	 */
+	trace_entry_write_start(dev, pasid, target, (u128 *)pte);
 	entry_sync_write128(&p_writer.writer, (u128 *)pte, target, memory, sizeof(memory));
+	trace_entry_write_complete(dev, pasid, target, (u128 *)pte);
 
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 5/8] iommu/vt-d: Use intel_pasid_write() for first-stage setup
  2026-03-09  6:06 [PATCH 0/8] iommu/vt-d: Hitless PASID updates via entry_sync Lu Baolu
                   ` (3 preceding siblings ...)
  2026-03-09  6:06 ` [PATCH 4/8] iommu/vt-d: Add trace events for PASID entry sync updates Lu Baolu
@ 2026-03-09  6:06 ` Lu Baolu
  2026-03-09  6:06 ` [PATCH 6/8] iommu/vt-d: Use intel_pasid_write() for second-stage setup Lu Baolu
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 34+ messages in thread
From: Lu Baolu @ 2026-03-09  6:06 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe
  Cc: Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel, Lu Baolu

Refactor intel_pasid_setup_first_level() to utilize the intel_pasid_write()
helper. By moving to the entry_sync library, the driver now constructs the
target entry in a local buffer and hands it off to intel_pasid_write().

This refactoring removes the need for __domain_setup_first_level(),
simplifies locking by using the group mutex, and ensures a consistent
update path for all first-stage setups.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/iommu.h |  5 -----
 drivers/iommu/intel/iommu.c | 16 +++-------------
 drivers/iommu/intel/pasid.c | 36 +++++++++---------------------------
 drivers/iommu/intel/svm.c   |  5 ++---
 4 files changed, 14 insertions(+), 48 deletions(-)

diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 54b58d01d0cb..fd6ca3b7f594 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -1202,11 +1202,6 @@ domain_add_dev_pasid(struct iommu_domain *domain,
 		     struct device *dev, ioasid_t pasid);
 void domain_remove_dev_pasid(struct iommu_domain *domain,
 			     struct device *dev, ioasid_t pasid);
-
-int __domain_setup_first_level(struct intel_iommu *iommu, struct device *dev,
-			       ioasid_t pasid, u16 did, phys_addr_t fsptptr,
-			       int flags, struct iommu_domain *old);
-
 int dmar_ir_support(void);
 
 void iommu_flush_write_buffer(struct intel_iommu *iommu);
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 5369526e89d0..db5e8dad50dc 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1248,16 +1248,6 @@ static void domain_context_clear_one(struct device_domain_info *info, u8 bus, u8
 	__iommu_flush_cache(iommu, context, sizeof(*context));
 }
 
-int __domain_setup_first_level(struct intel_iommu *iommu, struct device *dev,
-			       ioasid_t pasid, u16 did, phys_addr_t fsptptr,
-			       int flags, struct iommu_domain *old)
-{
-	if (old)
-		intel_pasid_tear_down_entry(iommu, dev, pasid, false);
-
-	return intel_pasid_setup_first_level(iommu, dev, fsptptr, pasid, did, flags);
-}
-
 static int domain_setup_second_level(struct intel_iommu *iommu,
 				     struct dmar_domain *domain,
 				     struct device *dev, ioasid_t pasid,
@@ -1301,9 +1291,9 @@ static int domain_setup_first_level(struct intel_iommu *iommu,
 	      BIT(PT_FEAT_DMA_INCOHERENT)))
 		flags |= PASID_FLAG_PWSNP;
 
-	return __domain_setup_first_level(iommu, dev, pasid,
-					  domain_id_iommu(domain, iommu),
-					  pt_info.gcr3_pt, flags, old);
+	return intel_pasid_setup_first_level(iommu, dev, pt_info.gcr3_pt, pasid,
+					     domain_id_iommu(domain, iommu),
+					     flags);
 }
 
 static int dmar_domain_attach_device(struct dmar_domain *domain,
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index b7c8888afaef..8ea1ac8cbf5e 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -172,9 +172,8 @@ static const struct entry_sync_writer_ops128 writer_ops128 = {
 
 #define INTEL_PASID_SYNC_MEM_COUNT	12
 
-static int __maybe_unused intel_pasid_write(struct intel_iommu *iommu,
-					    struct device *dev, u32 pasid,
-					    u128 *target)
+static int intel_pasid_write(struct intel_iommu *iommu, struct device *dev,
+			     u32 pasid, u128 *target)
 {
 	struct pasid_entry *pte = intel_pasid_get_entry(dev, pasid);
 	struct intel_pasid_writer p_writer = {
@@ -531,17 +530,14 @@ static void intel_pasid_flush_present(struct intel_iommu *iommu,
 
 /*
  * Set up the scalable mode pasid table entry for first only
- * translation type.
+ * translation type. Caller should zero out the entry before
+ * calling.
  */
 static void pasid_pte_config_first_level(struct intel_iommu *iommu,
 					 struct pasid_entry *pte,
 					 phys_addr_t fsptptr, u16 did,
 					 int flags)
 {
-	lockdep_assert_held(&iommu->lock);
-
-	pasid_clear_entry(pte);
-
 	/* Setup the first level page table pointer: */
 	pasid_set_flptr(pte, fsptptr);
 
@@ -564,7 +560,9 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu, struct device *dev,
 				  phys_addr_t fsptptr, u32 pasid, u16 did,
 				  int flags)
 {
-	struct pasid_entry *pte;
+	struct pasid_entry new_pte = {0};
+
+	iommu_group_mutex_assert(dev);
 
 	if (!ecap_flts(iommu->ecap)) {
 		pr_err("No first level translation support on %s\n",
@@ -578,25 +576,9 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu, struct device *dev,
 		return -EINVAL;
 	}
 
-	spin_lock(&iommu->lock);
-	pte = intel_pasid_get_entry(dev, pasid);
-	if (!pte) {
-		spin_unlock(&iommu->lock);
-		return -ENODEV;
-	}
+	pasid_pte_config_first_level(iommu, &new_pte, fsptptr, did, flags);
 
-	if (pasid_pte_is_present(pte)) {
-		spin_unlock(&iommu->lock);
-		return -EBUSY;
-	}
-
-	pasid_pte_config_first_level(iommu, pte, fsptptr, did, flags);
-
-	spin_unlock(&iommu->lock);
-
-	pasid_flush_caches(iommu, pte, pasid, did);
-
-	return 0;
+	return intel_pasid_write(iommu, dev, pasid, (u128 *)&new_pte);
 }
 
 /*
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index fea10acd4f02..978d63073e3b 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -171,9 +171,8 @@ static int intel_svm_set_dev_pasid(struct iommu_domain *domain,
 	/* Setup the pasid table: */
 	sflags = cpu_feature_enabled(X86_FEATURE_LA57) ? PASID_FLAG_FL5LP : 0;
 	sflags |= PASID_FLAG_PWSNP;
-	ret = __domain_setup_first_level(iommu, dev, pasid,
-					 FLPT_DEFAULT_DID, __pa(mm->pgd),
-					 sflags, old);
+	ret = intel_pasid_setup_first_level(iommu, dev, __pa(mm->pgd),
+					    pasid, FLPT_DEFAULT_DID, sflags);
 	if (ret)
 		goto out_unwind_iopf;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 6/8] iommu/vt-d: Use intel_pasid_write() for second-stage setup
  2026-03-09  6:06 [PATCH 0/8] iommu/vt-d: Hitless PASID updates via entry_sync Lu Baolu
                   ` (4 preceding siblings ...)
  2026-03-09  6:06 ` [PATCH 5/8] iommu/vt-d: Use intel_pasid_write() for first-stage setup Lu Baolu
@ 2026-03-09  6:06 ` Lu Baolu
  2026-03-09  6:06 ` [PATCH 7/8] iommu/vt-d: Use intel_pasid_write() for pass-through setup Lu Baolu
  2026-03-09  6:06 ` [PATCH 8/8] iommu/vt-d: Use intel_pasid_write() for nested setup Lu Baolu
  7 siblings, 0 replies; 34+ messages in thread
From: Lu Baolu @ 2026-03-09  6:06 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe
  Cc: Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel, Lu Baolu

Refactor intel_pasid_setup_second_level() to utilize the
intel_pasid_write() helper. Similar to the first-stage setup, moves the
second stage setup logic to the entry_sync library by constructing the
target PASID entry in a local buffer and committing it via
intel_pasid_write().

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/iommu.c | 19 ++++---------------
 drivers/iommu/intel/pasid.c | 26 ++++----------------------
 2 files changed, 8 insertions(+), 37 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index db5e8dad50dc..b98020ac9de2 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1248,17 +1248,6 @@ static void domain_context_clear_one(struct device_domain_info *info, u8 bus, u8
 	__iommu_flush_cache(iommu, context, sizeof(*context));
 }
 
-static int domain_setup_second_level(struct intel_iommu *iommu,
-				     struct dmar_domain *domain,
-				     struct device *dev, ioasid_t pasid,
-				     struct iommu_domain *old)
-{
-	if (old)
-		intel_pasid_tear_down_entry(iommu, dev, pasid, false);
-
-	return intel_pasid_setup_second_level(iommu, domain, dev, pasid);
-}
-
 static int domain_setup_passthrough(struct intel_iommu *iommu,
 				    struct device *dev, ioasid_t pasid,
 				    struct iommu_domain *old)
@@ -1323,8 +1312,8 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
 		ret = domain_setup_first_level(iommu, domain, dev,
 					       IOMMU_NO_PASID, NULL);
 	else if (intel_domain_is_ss_paging(domain))
-		ret = domain_setup_second_level(iommu, domain, dev,
-						IOMMU_NO_PASID, NULL);
+		ret = intel_pasid_setup_second_level(iommu, domain,
+						     dev, IOMMU_NO_PASID);
 	else if (WARN_ON(true))
 		ret = -EINVAL;
 
@@ -3634,8 +3623,8 @@ static int intel_iommu_set_dev_pasid(struct iommu_domain *domain,
 		ret = domain_setup_first_level(iommu, dmar_domain,
 					       dev, pasid, old);
 	else if (intel_domain_is_ss_paging(dmar_domain))
-		ret = domain_setup_second_level(iommu, dmar_domain,
-						dev, pasid, old);
+		ret = intel_pasid_setup_second_level(iommu, dmar_domain,
+						     dev, pasid);
 	else if (WARN_ON(true))
 		ret = -EINVAL;
 
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 8ea1ac8cbf5e..3084afb3d4a1 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -590,10 +590,7 @@ static void pasid_pte_config_second_level(struct intel_iommu *iommu,
 {
 	struct pt_iommu_vtdss_hw_info pt_info;
 
-	lockdep_assert_held(&iommu->lock);
-
 	pt_iommu_vtdss_hw_info(&domain->sspt, &pt_info);
-	pasid_clear_entry(pte);
 	pasid_set_domain_id(pte, did);
 	pasid_set_slptr(pte, pt_info.ssptptr);
 	pasid_set_address_width(pte, pt_info.aw);
@@ -611,9 +608,10 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
 				   struct dmar_domain *domain,
 				   struct device *dev, u32 pasid)
 {
-	struct pasid_entry *pte;
+	struct pasid_entry new_pte = {0};
 	u16 did;
 
+	iommu_group_mutex_assert(dev);
 
 	/*
 	 * If hardware advertises no support for second level
@@ -626,25 +624,9 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
 	}
 
 	did = domain_id_iommu(domain, iommu);
+	pasid_pte_config_second_level(iommu, &new_pte, domain, did);
 
-	spin_lock(&iommu->lock);
-	pte = intel_pasid_get_entry(dev, pasid);
-	if (!pte) {
-		spin_unlock(&iommu->lock);
-		return -ENODEV;
-	}
-
-	if (pasid_pte_is_present(pte)) {
-		spin_unlock(&iommu->lock);
-		return -EBUSY;
-	}
-
-	pasid_pte_config_second_level(iommu, pte, domain, did);
-	spin_unlock(&iommu->lock);
-
-	pasid_flush_caches(iommu, pte, pasid, did);
-
-	return 0;
+	return intel_pasid_write(iommu, dev, pasid, (u128 *)&new_pte);
 }
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 7/8] iommu/vt-d: Use intel_pasid_write() for pass-through setup
  2026-03-09  6:06 [PATCH 0/8] iommu/vt-d: Hitless PASID updates via entry_sync Lu Baolu
                   ` (5 preceding siblings ...)
  2026-03-09  6:06 ` [PATCH 6/8] iommu/vt-d: Use intel_pasid_write() for second-stage setup Lu Baolu
@ 2026-03-09  6:06 ` Lu Baolu
  2026-03-09  6:06 ` [PATCH 8/8] iommu/vt-d: Use intel_pasid_write() for nested setup Lu Baolu
  7 siblings, 0 replies; 34+ messages in thread
From: Lu Baolu @ 2026-03-09  6:06 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe
  Cc: Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel, Lu Baolu

Refactor intel_pasid_setup_pass_through() to utilize
the intel_pasid_write() helper. Move the pass-through setup implementation
to the entry_sync library, where the target PASID entry is constructed
locally and committed via the centralized intel_pasid_write() wrapper.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/iommu.c | 12 +-----------
 drivers/iommu/intel/pasid.c | 26 ++++----------------------
 2 files changed, 5 insertions(+), 33 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index b98020ac9de2..f1f9fafd3984 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1248,16 +1248,6 @@ static void domain_context_clear_one(struct device_domain_info *info, u8 bus, u8
 	__iommu_flush_cache(iommu, context, sizeof(*context));
 }
 
-static int domain_setup_passthrough(struct intel_iommu *iommu,
-				    struct device *dev, ioasid_t pasid,
-				    struct iommu_domain *old)
-{
-	if (old)
-		intel_pasid_tear_down_entry(iommu, dev, pasid, false);
-
-	return intel_pasid_setup_pass_through(iommu, dev, pasid);
-}
-
 static int domain_setup_first_level(struct intel_iommu *iommu,
 				    struct dmar_domain *domain,
 				    struct device *dev,
@@ -3848,7 +3838,7 @@ static int identity_domain_set_dev_pasid(struct iommu_domain *domain,
 	if (ret)
 		return ret;
 
-	ret = domain_setup_passthrough(iommu, dev, pasid, old);
+	ret = intel_pasid_setup_pass_through(iommu, dev, pasid);
 	if (ret) {
 		iopf_for_domain_replace(old, domain, dev);
 		return ret;
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 3084afb3d4a1..cb55ff422d7d 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -704,9 +704,6 @@ int intel_pasid_setup_dirty_tracking(struct intel_iommu *iommu,
 static void pasid_pte_config_pass_through(struct intel_iommu *iommu,
 					  struct pasid_entry *pte, u16 did)
 {
-	lockdep_assert_held(&iommu->lock);
-
-	pasid_clear_entry(pte);
 	pasid_set_domain_id(pte, did);
 	pasid_set_address_width(pte, iommu->agaw);
 	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_PT);
@@ -718,27 +715,12 @@ static void pasid_pte_config_pass_through(struct intel_iommu *iommu,
 int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
 				   struct device *dev, u32 pasid)
 {
-	u16 did = FLPT_DEFAULT_DID;
-	struct pasid_entry *pte;
+	struct pasid_entry new_pte = {0};
 
-	spin_lock(&iommu->lock);
-	pte = intel_pasid_get_entry(dev, pasid);
-	if (!pte) {
-		spin_unlock(&iommu->lock);
-		return -ENODEV;
-	}
+	iommu_group_mutex_assert(dev);
+	pasid_pte_config_pass_through(iommu, &new_pte, FLPT_DEFAULT_DID);
 
-	if (pasid_pte_is_present(pte)) {
-		spin_unlock(&iommu->lock);
-		return -EBUSY;
-	}
-
-	pasid_pte_config_pass_through(iommu, pte, did);
-	spin_unlock(&iommu->lock);
-
-	pasid_flush_caches(iommu, pte, pasid, did);
-
-	return 0;
+	return intel_pasid_write(iommu, dev, pasid, (u128 *)&new_pte);
 }
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 8/8] iommu/vt-d: Use intel_pasid_write() for nested setup
  2026-03-09  6:06 [PATCH 0/8] iommu/vt-d: Hitless PASID updates via entry_sync Lu Baolu
                   ` (6 preceding siblings ...)
  2026-03-09  6:06 ` [PATCH 7/8] iommu/vt-d: Use intel_pasid_write() for pass-through setup Lu Baolu
@ 2026-03-09  6:06 ` Lu Baolu
  7 siblings, 0 replies; 34+ messages in thread
From: Lu Baolu @ 2026-03-09  6:06 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe
  Cc: Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel, Lu Baolu

Refactor intel_pasid_setup_nested() to utilize the intel_pasid_write()
helper. Move the implementation to the entry_sync infrastructure, where
the nested PASID entry is constructed in a local buffer and committed
via the centralized intel_pasid_write() wrapper.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/nested.c | 13 +------------
 drivers/iommu/intel/pasid.c  | 27 +++++----------------------
 2 files changed, 6 insertions(+), 34 deletions(-)

diff --git a/drivers/iommu/intel/nested.c b/drivers/iommu/intel/nested.c
index 2b979bec56ce..1cebc1232f70 100644
--- a/drivers/iommu/intel/nested.c
+++ b/drivers/iommu/intel/nested.c
@@ -131,17 +131,6 @@ static int intel_nested_cache_invalidate_user(struct iommu_domain *domain,
 	return ret;
 }
 
-static int domain_setup_nested(struct intel_iommu *iommu,
-			       struct dmar_domain *domain,
-			       struct device *dev, ioasid_t pasid,
-			       struct iommu_domain *old)
-{
-	if (old)
-		intel_pasid_tear_down_entry(iommu, dev, pasid, false);
-
-	return intel_pasid_setup_nested(iommu, dev, pasid, domain);
-}
-
 static int intel_nested_set_dev_pasid(struct iommu_domain *domain,
 				      struct device *dev, ioasid_t pasid,
 				      struct iommu_domain *old)
@@ -170,7 +159,7 @@ static int intel_nested_set_dev_pasid(struct iommu_domain *domain,
 	if (ret)
 		goto out_remove_dev_pasid;
 
-	ret = domain_setup_nested(iommu, dmar_domain, dev, pasid, old);
+	ret = intel_pasid_setup_nested(iommu, dev, pasid, dmar_domain);
 	if (ret)
 		goto out_unwind_iopf;
 
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index cb55ff422d7d..5e0548dd8388 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -754,12 +754,8 @@ static void pasid_pte_config_nestd(struct intel_iommu *iommu,
 {
 	struct pt_iommu_vtdss_hw_info pt_info;
 
-	lockdep_assert_held(&iommu->lock);
-
 	pt_iommu_vtdss_hw_info(&s2_domain->sspt, &pt_info);
 
-	pasid_clear_entry(pte);
-
 	if (s1_cfg->addr_width == ADDR_WIDTH_5LEVEL)
 		pasid_set_flpm(pte, 1);
 
@@ -806,7 +802,9 @@ int intel_pasid_setup_nested(struct intel_iommu *iommu, struct device *dev,
 	struct iommu_hwpt_vtd_s1 *s1_cfg = &domain->s1_cfg;
 	struct dmar_domain *s2_domain = domain->s2_domain;
 	u16 did = domain_id_iommu(domain, iommu);
-	struct pasid_entry *pte;
+	struct pasid_entry new_pte = {0};
+
+	iommu_group_mutex_assert(dev);
 
 	/* Address width should match the address width supported by hardware */
 	switch (s1_cfg->addr_width) {
@@ -837,23 +835,8 @@ int intel_pasid_setup_nested(struct intel_iommu *iommu, struct device *dev,
 		return -EINVAL;
 	}
 
-	spin_lock(&iommu->lock);
-	pte = intel_pasid_get_entry(dev, pasid);
-	if (!pte) {
-		spin_unlock(&iommu->lock);
-		return -ENODEV;
-	}
-	if (pasid_pte_is_present(pte)) {
-		spin_unlock(&iommu->lock);
-		return -EBUSY;
-	}
-
-	pasid_pte_config_nestd(iommu, pte, s1_cfg, s2_domain, did);
-	spin_unlock(&iommu->lock);
-
-	pasid_flush_caches(iommu, pte, pasid, did);
-
-	return 0;
+	pasid_pte_config_nestd(iommu, &new_pte, s1_cfg, s2_domain, did);
+	return intel_pasid_write(iommu, dev, pasid, (u128 *)&new_pte);
 }
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] iommu/vt-d: Add entry_sync support for PASID entry updates
  2026-03-09  6:06 ` [PATCH 2/8] iommu/vt-d: Add entry_sync support for PASID entry updates Lu Baolu
@ 2026-03-09 13:41   ` Jason Gunthorpe
  2026-03-11  8:42     ` Baolu Lu
  2026-03-12  7:50     ` Baolu Lu
  0 siblings, 2 replies; 34+ messages in thread
From: Jason Gunthorpe @ 2026-03-09 13:41 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel

On Mon, Mar 09, 2026 at 02:06:42PM +0800, Lu Baolu wrote:
> +static void intel_pasid_get_used(const u128 *entry, u128 *used)
> +{
> +	struct pasid_entry *pe = (struct pasid_entry *)entry;
> +	struct pasid_entry *ue = (struct pasid_entry *)used;
> +	u16 pgtt;
> +
> +	/* Initialize used bits to 0. */
> +	memset(ue, 0, sizeof(*ue));
> +
> +	/* Present bit always matters. */
> +	ue->val[0] |= PASID_PTE_PRESENT;
> +
> +	/* Nothing more for non-present entries. */
> +	if (!(pe->val[0] & PASID_PTE_PRESENT))
> +		return;
> +
> +	pgtt = pasid_pte_get_pgtt(pe);
> +	switch (pgtt) {
> +	case PASID_ENTRY_PGTT_FL_ONLY:
> +		/* AW, PGTT */
> +		ue->val[0] |= GENMASK_ULL(4, 2) | GENMASK_ULL(8, 6);
> +		/* DID, PWSNP, PGSNP */
> +		ue->val[1] |= GENMASK_ULL(24, 23) | GENMASK_ULL(15, 0);
> +		/* FSPTPTR, FSPM */
> +		ue->val[2] |= GENMASK_ULL(63, 12) | GENMASK_ULL(3, 2);

This would be an excellent time to properly add these constants :(

/* 9.6 Scalable-Mode PASID Table Entry */
#define SM_PASID0_P		BIT_U64(0)
#define SM_PASID0_FPD		BIT_U64(1)
#define SM_PASID0_AW		GENMASK_U64(4, 2)
#define SM_PASID0_SSEE		BIT_U64(5)
#define SM_PASID0_PGTT		GENMASK_U64(8, 6)
#define SM_PASID0_SSADE		BIT_U64(9)
#define SM_PASID0_SSPTPTR	GENMASK_U64(63, 12)

#define SM_PASID1_DID		GENMASK_U64(15, 0)
#define SM_PASID1_PWSNP		BIT_U64(23)
#define SM_PASID1_PGSNP		BIT_U64(24)
#define SM_PASID1_CD		BIT_U64(25)
#define SM_PASID1_EMTE		BIT_U64(26)
#define SM_PASID1_PAT		GENMASK_U64(63, 32)

#define SM_PASID2_SRE		BIT_U64(0)
#define SM_PASID2_ERE		BIT_U64(1)
#define SM_PASID2_FSPM		GENMASK_U64(3, 2)
#define SM_PASID2_WPE		BIT_U64(4)
#define SM_PASID2_NXE		BIT_U64(5)
#define SM_PASID2_SMEP		BIT_U64(6)
#define SM_PASID2_EAFE		BIT_U64(7)
#define SM_PASID2_FSPTPTR	GENMASK_U64(63, 12)

> +static void intel_pasid_sync(struct entry_sync_writer128 *writer)
> +{
> +	struct intel_pasid_writer *p_writer = container_of(writer,
> +			struct intel_pasid_writer, writer);
> +	struct intel_iommu *iommu = p_writer->iommu;
> +	struct device *dev = p_writer->dev;
> +	bool was_present, is_present;
> +	u32 pasid = p_writer->pasid;
> +	struct pasid_entry *pte;
> +	u16 old_did, old_pgtt;
> +
> +	pte = intel_pasid_get_entry(dev, pasid);
> +	was_present = p_writer->was_present;
> +	is_present = pasid_pte_is_present(pte);
> +	old_did = pasid_get_domain_id(&p_writer->orig_pte);
> +	old_pgtt = pasid_pte_get_pgtt(&p_writer->orig_pte);
> +
> +	/* Update the last present state: */
> +	p_writer->was_present = is_present;
> +
> +	if (!ecap_coherent(iommu->ecap))
> +		clflush_cache_range(pte, sizeof(*pte));
> +
> +	/* Sync for "P=0" to "P=1": */
> +	if (!was_present) {
> +		if (is_present)
> +			pasid_flush_caches(iommu, pte, pasid,
> +					   pasid_get_domain_id(pte));
> +
> +		return;
> +	}
> +
> +	/* Sync for "P=1" to "P=1": */
> +	if (is_present) {
> +		intel_pasid_flush_present(iommu, dev, pasid, old_did, pte);
> +		return;
> +	}
> +
> +	/* Sync for "P=1" to "P=0": */
> +	pasid_cache_invalidation_with_pasid(iommu, old_did, pasid);

Why all this logic? All this different stuff does is meddle with the
IOTLB and it should not seen below.

If the sync is called it should just always call
pasid_cache_invalidation_with_pasid(), that's it.

Writer has already eliminated all cases where sync isn't needed.

> +	if (old_pgtt == PASID_ENTRY_PGTT_PT || old_pgtt == PASID_ENTRY_PGTT_FL_ONLY)
> +		qi_flush_piotlb(iommu, old_did, pasid, 0, -1, 0);
> +	else
> +		iommu->flush.flush_iotlb(iommu, old_did, 0, 0, DMA_TLB_DSI_FLUSH);
> +	devtlb_invalidation_with_pasid(iommu, dev, pasid);

The IOTLB should already be clean'd before the new entry using the
cache tag is programmed. Cleaning it after the entry is live is buggy.

The writer logic ensures it never sees a corrupted entry, so the clean
cache tag cannot be mangled during the writing process.

The way ARM is structured has the cache tags clean if they are in the
allocator bitmap, so when the driver fetches a new tag and starts
using it is clean and non cleaning is needed

When it frees a tag it cleans it and then returns it to the allocator.

ATC invalidations should always be done after the PASID entry is
written. During a hitless update both translations are unpredictably
combined, this is unavoidable and OK.

Jason


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/8] iommu/vt-d: Require CMPXCHG16B for PASID support
  2026-03-09  6:06 ` [PATCH 3/8] iommu/vt-d: Require CMPXCHG16B for PASID support Lu Baolu
@ 2026-03-09 13:42   ` Jason Gunthorpe
  2026-03-12  7:59     ` Baolu Lu
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2026-03-09 13:42 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel

On Mon, Mar 09, 2026 at 02:06:43PM +0800, Lu Baolu wrote:
> The Intel IOMMU driver is moving toward using the generic entry_sync
> library for PASID table entry updates. This library requires 128-bit
> atomic write operations (cmpxchg128) to update 512-bit PASID entries in
> atomic quanta, ensuring the hardware never observes a torn entry.
> 
> On x86_64, 128-bit atomicity is provided by the CMPXCHG16B instruction.
> Update the driver to:
> 
> 1. Limit INTEL_IOMMU to X86_64, as 128-bit atomic operations are not
>    available on 32-bit x86.
> 2. Gate pasid_supported() on the presence of X86_FEATURE_CX16.
> 3. Provide a boot-time warning if a PASID-capable IOMMU is detected on
>    a CPU lacking the required instruction.

This is fine, but it also occured to me that we could change the
writer somewhat to just detect what the update granual is and fall
back to 64 bit in this case. So everything still works, it just does
non-present alot more often.

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] iommu: Lift and generalize the STE/CD update code from SMMUv3
  2026-03-09  6:06 ` [PATCH 1/8] iommu: Lift and generalize the STE/CD update code from SMMUv3 Lu Baolu
@ 2026-03-09 23:33   ` Samiullah Khawaja
  2026-03-10  0:06     ` Samiullah Khawaja
  2026-03-13  5:39   ` Nicolin Chen
  1 sibling, 1 reply; 34+ messages in thread
From: Samiullah Khawaja @ 2026-03-09 23:33 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Dmytro Maluka, iommu, linux-kernel

On Mon, Mar 09, 2026 at 02:06:41PM +0800, Lu Baolu wrote:
>From: Jason Gunthorpe <jgg@nvidia.com>
>
>Many IOMMU implementations store data structures in host memory that can
>be quite big. The iommu is able to DMA read the host memory using an
>atomic quanta, usually 64 or 128 bits, and will read an entry using
>multiple quanta reads.
>
>Updating the host memory datastructure entry while the HW is concurrently
>DMA'ing it is a little bit involved, but if you want to do this hitlessly,
>while never making the entry non-valid, then it becomes quite complicated.
>
>entry_sync is a library to handle this task. It works on the notion of
>"used bits" which reflect which bits the HW is actually sensitive to and
>which bits are ignored by hardware. Many hardware specifications say
>things like 'if mode is X then bits ABC are ignored'.
>
>Using the ignored bits entry_sync can often compute a series of ordered
>writes and flushes that will allow the entry to be updated while keeping
>it valid. If such an update is not possible then entry will be made
>temporarily non-valid.
>
>A 64 and 128 bit quanta version is provided to support existing iommus.
>
>Co-developed-by: Lu Baolu <baolu.lu@linux.intel.com>
>Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
>Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>---
> drivers/iommu/Kconfig               |  14 +++
> drivers/iommu/Makefile              |   1 +
> drivers/iommu/entry_sync.h          |  66 +++++++++++++
> drivers/iommu/entry_sync_template.h | 143 ++++++++++++++++++++++++++++
> drivers/iommu/entry_sync.c          |  68 +++++++++++++
> 5 files changed, 292 insertions(+)
> create mode 100644 drivers/iommu/entry_sync.h
> create mode 100644 drivers/iommu/entry_sync_template.h
> create mode 100644 drivers/iommu/entry_sync.c
>
>diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
>index f86262b11416..2650c9fa125b 100644
>--- a/drivers/iommu/Kconfig
>+++ b/drivers/iommu/Kconfig
>@@ -145,6 +145,20 @@ config IOMMU_DEFAULT_PASSTHROUGH
>
> endchoice
>
>+config IOMMU_ENTRY_SYNC
>+	bool
>+	default n
>+
>+config IOMMU_ENTRY_SYNC64
>+	bool
>+	select IOMMU_ENTRY_SYNC
>+	default n
>+
>+config IOMMU_ENTRY_SYNC128
>+	bool
>+	select IOMMU_ENTRY_SYNC
>+	default n
>+
> config OF_IOMMU
> 	def_bool y
> 	depends on OF && IOMMU_API
>diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
>index 0275821f4ef9..bd923995497a 100644
>--- a/drivers/iommu/Makefile
>+++ b/drivers/iommu/Makefile
>@@ -10,6 +10,7 @@ obj-$(CONFIG_IOMMU_API) += iommu-traces.o
> obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
> obj-$(CONFIG_IOMMU_DEBUGFS) += iommu-debugfs.o
> obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o
>+obj-$(CONFIG_IOMMU_ENTRY_SYNC) += entry_sync.o
> obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
> obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
> obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
>diff --git a/drivers/iommu/entry_sync.h b/drivers/iommu/entry_sync.h
>new file mode 100644
>index 000000000000..004d421c71c0
>--- /dev/null
>+++ b/drivers/iommu/entry_sync.h
>@@ -0,0 +1,66 @@
>+/* SPDX-License-Identifier: GPL-2.0-only */
>+/*
>+ * Many IOMMU implementations store data structures in host memory that can be
>+ * quite big. The iommu is able to DMA read the host memory using an atomic
>+ * quanta, usually 64 or 128 bits, and will read an entry using multiple quanta
>+ * reads.
>+ *
>+ * Updating the host memory datastructure entry while the HW is concurrently
>+ * DMA'ing it is a little bit involved, but if you want to do this hitlessly,
>+ * while never making the entry non-valid, then it becomes quite complicated.
>+ *
>+ * entry_sync is a library to handle this task. It works on the notion of "used
>+ * bits" which reflect which bits the HW is actually sensitive to and which bits
>+ * are ignored by hardware. Many hardware specifications say things like 'if
>+ * mode is X then bits ABC are ignored'.
>+ *
>+ * Using the ignored bits entry_sync can often compute a series of ordered
>+ * writes and flushes that will allow the entry to be updated while keeping it
>+ * valid. If such an update is not possible then entry will be made temporarily
>+ * non-valid.
>+ *
>+ * A 64 and 128 bit quanta version is provided to support existing iommus.
>+ */
>+#ifndef IOMMU_ENTRY_SYNC_H
>+#define IOMMU_ENTRY_SYNC_H
>+
>+#include <linux/types.h>
>+#include <linux/compiler.h>
>+#include <linux/bug.h>
>+
>+/* Caller allocates a stack array of this length to call entry_sync_write() */
>+#define ENTRY_SYNC_MEMORY_LEN(writer) ((writer)->num_quantas * 3)
>+
>+struct entry_sync_writer_ops64;
>+struct entry_sync_writer64 {
>+	const struct entry_sync_writer_ops64 *ops;
>+	size_t num_quantas;
>+	size_t vbit_quanta;
>+};
>+
>+struct entry_sync_writer_ops64 {
>+	void (*get_used)(const __le64 *entry, __le64 *used);
>+	void (*sync)(struct entry_sync_writer64 *writer);
>+};
>+
>+void entry_sync_write64(struct entry_sync_writer64 *writer, __le64 *entry,
>+			const __le64 *target, __le64 *memory,
>+			size_t memory_len);
>+
>+struct entry_sync_writer_ops128;
>+struct entry_sync_writer128 {
>+	const struct entry_sync_writer_ops128 *ops;
>+	size_t num_quantas;
>+	size_t vbit_quanta;
>+};
>+
>+struct entry_sync_writer_ops128 {
>+	void (*get_used)(const u128 *entry, u128 *used);
>+	void (*sync)(struct entry_sync_writer128 *writer);
>+};
>+
>+void entry_sync_write128(struct entry_sync_writer128 *writer, u128 *entry,
>+			 const u128 *target, u128 *memory,
>+			 size_t memory_len);
>+
>+#endif
>diff --git a/drivers/iommu/entry_sync_template.h b/drivers/iommu/entry_sync_template.h
>new file mode 100644
>index 000000000000..646f518b098e
>--- /dev/null
>+++ b/drivers/iommu/entry_sync_template.h
>@@ -0,0 +1,143 @@
>+/* SPDX-License-Identifier: GPL-2.0-only */
>+#include "entry_sync.h"
>+#include <linux/args.h>
>+#include <linux/bitops.h>
>+
>+#ifndef entry_sync_writer
>+#define entry_sync_writer entry_sync_writer64
>+#define quanta_t __le64
>+#define NS(name) CONCATENATE(name, 64)
>+#endif
>+
>+/*
>+ * Figure out if we can do a hitless update of entry to become target. Returns a
>+ * bit mask where 1 indicates that a quanta word needs to be set disruptively.
>+ * unused_update is an intermediate value of entry that has unused bits set to
>+ * their new values.
>+ */
>+static u8 NS(entry_quanta_diff)(struct entry_sync_writer *writer,
>+				const quanta_t *entry, const quanta_t *target,
>+				quanta_t *unused_update, quanta_t *memory)
>+{
>+	quanta_t *target_used = memory + writer->num_quantas * 1;
>+	quanta_t *cur_used = memory + writer->num_quantas * 2;
>+	u8 used_qword_diff = 0;
>+	unsigned int i;
>+
>+	writer->ops->get_used(entry, cur_used);
>+	writer->ops->get_used(target, target_used);
>+
>+	for (i = 0; i != writer->num_quantas; i++) {
>+		/*
>+		 * Check that masks are up to date, the make functions are not

nit: "the make functions" looks like a typo.
>+		 * allowed to set a bit to 1 if the used function doesn't say it
>+		 * is used.
>+		 */
>+		WARN_ON_ONCE(target[i] & ~target_used[i]);
>+
>+		/* Bits can change because they are not currently being used */
>+		unused_update[i] = (entry[i] & cur_used[i]) |
>+				   (target[i] & ~cur_used[i]);
>+		/*
>+		 * Each bit indicates that a used bit in a qword needs to be
>+		 * changed after unused_update is applied.
>+		 */
>+		if ((unused_update[i] & target_used[i]) != target[i])
>+			used_qword_diff |= 1 << i;
>+	}
>+	return used_qword_diff;
>+}
>+
>+/*
>+ * Update the entry to the target configuration. The transition from the current
>+ * entry to the target entry takes place over multiple steps that attempts to
>+ * make the transition hitless if possible. This function takes care not to
>+ * create a situation where the HW can perceive a corrupted entry. HW is only
>+ * required to have a quanta-bit atomicity with stores from the CPU, while
>+ * entries are many quanta bit values big.
>+ *
>+ * The difference between the current value and the target value is analyzed to
>+ * determine which of three updates are required - disruptive, hitless or no
>+ * change.
>+ *
>+ * In the most general disruptive case we can make any update in three steps:
>+ *  - Disrupting the entry (V=0)
>+ *  - Fill now unused quanta words, except qword 0 which contains V
>+ *  - Make qword 0 have the final value and valid (V=1) with a single 64
>+ *    bit store
>+ *
>+ * However this disrupts the HW while it is happening. There are several
>+ * interesting cases where a STE/CD can be updated without disturbing the HW
>+ * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
>+ * because the used bits don't intersect. We can detect this by calculating how
>+ * many 64 bit values need update after adjusting the unused bits and skip the
>+ * V=0 process. This relies on the IGNORED behavior described in the
>+ * specification.
>+ */
>+void NS(entry_sync_write)(struct entry_sync_writer *writer, quanta_t *entry,
>+			  const quanta_t *target, quanta_t *memory,
>+			  size_t memory_len)
>+{
>+	quanta_t *unused_update = memory + writer->num_quantas * 0;
>+	u8 used_qword_diff;
>+
>+	if (WARN_ON(memory_len !=
>+		    ENTRY_SYNC_MEMORY_LEN(writer) * sizeof(*memory)))
>+		return;
>+
>+	used_qword_diff = NS(entry_quanta_diff)(writer, entry, target,
>+						unused_update, memory);
>+	if (hweight8(used_qword_diff) == 1) {
>+		/*
>+		 * Only one quanta needs its used bits to be changed. This is a
>+		 * hitless update, update all bits the current entry is ignoring
>+		 * to their new values, then update a single "critical quanta"
>+		 * to change the entry and finally 0 out any bits that are now
>+		 * unused in the target configuration.
>+		 */
>+		unsigned int critical_qword_index = ffs(used_qword_diff) - 1;
>+
>+		/*
>+		 * Skip writing unused bits in the critical quanta since we'll
>+		 * be writing it in the next step anyways. This can save a sync
>+		 * when the only change is in that quanta.
>+		 */
>+		unused_update[critical_qword_index] =
>+			entry[critical_qword_index];
>+		NS(entry_set)(writer, entry, unused_update, 0,
>+			      writer->num_quantas);
>+		NS(entry_set)(writer, entry, target, critical_qword_index, 1);
>+		NS(entry_set)(writer, entry, target, 0, writer->num_quantas);
>+	} else if (used_qword_diff) {
>+		/*
>+		 * At least two quantas need their inuse bits to be changed.
>+		 * This requires a breaking update, zero the V bit, write all
>+		 * qwords but 0, then set qword 0
>+		 */
>+		unused_update[writer->vbit_quanta] = 0;
>+		NS(entry_set)(writer, entry, unused_update, writer->vbit_quanta, 1);
>+
>+		if (writer->vbit_quanta != 0)
>+			NS(entry_set)(writer, entry, target, 0,
>+				      writer->vbit_quanta - 1);

Looking at the definition of the entry_set below, the last argument is
length. So if vbit_quanta 1 then it would write zero len. Shouldn't it
be writing quantas before the vbit_quanta?
>+		if (writer->vbit_quanta != writer->num_quantas)
>+			NS(entry_set)(writer, entry, target,
>+				      writer->vbit_quanta,
>+				      writer->num_quantas - 1);

Sami here, the last argument should not have "- 1".
>+
>+		NS(entry_set)(writer, entry, target, writer->vbit_quanta, 1);
>+	} else {
>+		/*
>+		 * No inuse bit changed. Sanity check that all unused bits are 0
>+		 * in the entry. The target was already sanity checked by
>+		 * entry_quanta_diff().
>+		 */
>+		WARN_ON_ONCE(NS(entry_set)(writer, entry, target, 0,
>+					   writer->num_quantas));
>+	}
>+}
>+EXPORT_SYMBOL(NS(entry_sync_write));
>+
>+#undef entry_sync_writer
>+#undef quanta_t
>+#undef NS
>diff --git a/drivers/iommu/entry_sync.c b/drivers/iommu/entry_sync.c
>new file mode 100644
>index 000000000000..48d31270dbba
>--- /dev/null
>+++ b/drivers/iommu/entry_sync.c
>@@ -0,0 +1,68 @@
>+// SPDX-License-Identifier: GPL-2.0-only
>+/*
>+ * Helpers for drivers to update multi-quanta entries shared with HW without
>+ * races to minimize breaking changes.
>+ */
>+#include "entry_sync.h"
>+#include <linux/kconfig.h>
>+#include <linux/atomic.h>
>+
>+#if IS_ENABLED(CONFIG_IOMMU_ENTRY_SYNC64)
>+static bool entry_set64(struct entry_sync_writer64 *writer, __le64 *entry,
>+			const __le64 *target, unsigned int start,
>+			unsigned int len)
>+{
>+	bool changed = false;
>+	unsigned int i;
>+
>+	for (i = start; len != 0; len--, i++) {
>+		if (entry[i] != target[i]) {
>+			WRITE_ONCE(entry[i], target[i]);
>+			changed = true;
>+		}
>+	}
>+
>+	if (changed)
>+		writer->ops->sync(writer);
>+	return changed;
>+}
>+
>+#define entry_sync_writer entry_sync_writer64
>+#define quanta_t __le64
>+#define NS(name) CONCATENATE(name, 64)
>+#include "entry_sync_template.h"
>+#endif
>+
>+#if IS_ENABLED(CONFIG_IOMMU_ENTRY_SYNC128)
>+static bool entry_set128(struct entry_sync_writer128 *writer, u128 *entry,
>+			 const u128 *target, unsigned int start,
>+			 unsigned int len)
>+{
>+	bool changed = false;
>+	unsigned int i;
>+
>+	for (i = start; len != 0; len--, i++) {
>+		if (entry[i] != target[i]) {
>+			/*
>+			 * Use cmpxchg128 to generate an indivisible write from
>+			 * the CPU to DMA'able memory. This must ensure that HW
>+			 * sees either the new or old 128 bit value and not
>+			 * something torn. As updates are serialized by a
>+			 * spinlock, we use the local (unlocked) variant to
>+			 * avoid unnecessary bus locking overhead.
>+			 */
>+			cmpxchg128_local(&entry[i], entry[i], target[i]);
>+			changed = true;
>+		}
>+	}
>+
>+	if (changed)
>+		writer->ops->sync(writer);
>+	return changed;
>+}
>+
>+#define entry_sync_writer entry_sync_writer128
>+#define quanta_t u128
>+#define NS(name) CONCATENATE(name, 128)
>+#include "entry_sync_template.h"
>+#endif
>-- 
>2.43.0
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] iommu: Lift and generalize the STE/CD update code from SMMUv3
  2026-03-09 23:33   ` Samiullah Khawaja
@ 2026-03-10  0:06     ` Samiullah Khawaja
  2026-03-14  8:13       ` Baolu Lu
  0 siblings, 1 reply; 34+ messages in thread
From: Samiullah Khawaja @ 2026-03-10  0:06 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Dmytro Maluka, iommu, linux-kernel

On Mon, Mar 09, 2026 at 11:33:23PM +0000, Samiullah Khawaja wrote:
>On Mon, Mar 09, 2026 at 02:06:41PM +0800, Lu Baolu wrote:
>>From: Jason Gunthorpe <jgg@nvidia.com>
>>
>>Many IOMMU implementations store data structures in host memory that can
>>be quite big. The iommu is able to DMA read the host memory using an
>>atomic quanta, usually 64 or 128 bits, and will read an entry using
>>multiple quanta reads.
>>
>>Updating the host memory datastructure entry while the HW is concurrently
>>DMA'ing it is a little bit involved, but if you want to do this hitlessly,
>>while never making the entry non-valid, then it becomes quite complicated.
>>
>>entry_sync is a library to handle this task. It works on the notion of
>>"used bits" which reflect which bits the HW is actually sensitive to and
>>which bits are ignored by hardware. Many hardware specifications say
>>things like 'if mode is X then bits ABC are ignored'.
>>
>>Using the ignored bits entry_sync can often compute a series of ordered
>>writes and flushes that will allow the entry to be updated while keeping
>>it valid. If such an update is not possible then entry will be made
>>temporarily non-valid.
>>
>>A 64 and 128 bit quanta version is provided to support existing iommus.
>>
>>Co-developed-by: Lu Baolu <baolu.lu@linux.intel.com>
>>Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
>>Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>>---
>>drivers/iommu/Kconfig               |  14 +++
>>drivers/iommu/Makefile              |   1 +
>>drivers/iommu/entry_sync.h          |  66 +++++++++++++
>>drivers/iommu/entry_sync_template.h | 143 ++++++++++++++++++++++++++++
>>drivers/iommu/entry_sync.c          |  68 +++++++++++++
>>5 files changed, 292 insertions(+)
>>create mode 100644 drivers/iommu/entry_sync.h
>>create mode 100644 drivers/iommu/entry_sync_template.h
>>create mode 100644 drivers/iommu/entry_sync.c
>>
>>diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
>>index f86262b11416..2650c9fa125b 100644
>>--- a/drivers/iommu/Kconfig
>>+++ b/drivers/iommu/Kconfig
>>@@ -145,6 +145,20 @@ config IOMMU_DEFAULT_PASSTHROUGH
>>
>>endchoice
>>
>>+config IOMMU_ENTRY_SYNC
>>+	bool
>>+	default n
>>+
>>+config IOMMU_ENTRY_SYNC64
>>+	bool
>>+	select IOMMU_ENTRY_SYNC
>>+	default n
>>+
>>+config IOMMU_ENTRY_SYNC128
>>+	bool
>>+	select IOMMU_ENTRY_SYNC
>>+	default n
>>+
>>config OF_IOMMU
>>	def_bool y
>>	depends on OF && IOMMU_API
>>diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
>>index 0275821f4ef9..bd923995497a 100644
>>--- a/drivers/iommu/Makefile
>>+++ b/drivers/iommu/Makefile
>>@@ -10,6 +10,7 @@ obj-$(CONFIG_IOMMU_API) += iommu-traces.o
>>obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
>>obj-$(CONFIG_IOMMU_DEBUGFS) += iommu-debugfs.o
>>obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o
>>+obj-$(CONFIG_IOMMU_ENTRY_SYNC) += entry_sync.o
>>obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
>>obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
>>obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
>>diff --git a/drivers/iommu/entry_sync.h b/drivers/iommu/entry_sync.h
>>new file mode 100644
>>index 000000000000..004d421c71c0
>>--- /dev/null
>>+++ b/drivers/iommu/entry_sync.h
>>@@ -0,0 +1,66 @@
>>+/* SPDX-License-Identifier: GPL-2.0-only */
>>+/*
>>+ * Many IOMMU implementations store data structures in host memory that can be
>>+ * quite big. The iommu is able to DMA read the host memory using an atomic
>>+ * quanta, usually 64 or 128 bits, and will read an entry using multiple quanta
>>+ * reads.
>>+ *
>>+ * Updating the host memory datastructure entry while the HW is concurrently
>>+ * DMA'ing it is a little bit involved, but if you want to do this hitlessly,
>>+ * while never making the entry non-valid, then it becomes quite complicated.
>>+ *
>>+ * entry_sync is a library to handle this task. It works on the notion of "used
>>+ * bits" which reflect which bits the HW is actually sensitive to and which bits
>>+ * are ignored by hardware. Many hardware specifications say things like 'if
>>+ * mode is X then bits ABC are ignored'.
>>+ *
>>+ * Using the ignored bits entry_sync can often compute a series of ordered
>>+ * writes and flushes that will allow the entry to be updated while keeping it
>>+ * valid. If such an update is not possible then entry will be made temporarily
>>+ * non-valid.
>>+ *
>>+ * A 64 and 128 bit quanta version is provided to support existing iommus.
>>+ */
>>+#ifndef IOMMU_ENTRY_SYNC_H
>>+#define IOMMU_ENTRY_SYNC_H
>>+
>>+#include <linux/types.h>
>>+#include <linux/compiler.h>
>>+#include <linux/bug.h>
>>+
>>+/* Caller allocates a stack array of this length to call entry_sync_write() */
>>+#define ENTRY_SYNC_MEMORY_LEN(writer) ((writer)->num_quantas * 3)
>>+
>>+struct entry_sync_writer_ops64;
>>+struct entry_sync_writer64 {
>>+	const struct entry_sync_writer_ops64 *ops;
>>+	size_t num_quantas;
>>+	size_t vbit_quanta;
>>+};
>>+
>>+struct entry_sync_writer_ops64 {
>>+	void (*get_used)(const __le64 *entry, __le64 *used);
>>+	void (*sync)(struct entry_sync_writer64 *writer);
>>+};
>>+
>>+void entry_sync_write64(struct entry_sync_writer64 *writer, __le64 *entry,
>>+			const __le64 *target, __le64 *memory,
>>+			size_t memory_len);
>>+
>>+struct entry_sync_writer_ops128;
>>+struct entry_sync_writer128 {
>>+	const struct entry_sync_writer_ops128 *ops;
>>+	size_t num_quantas;
>>+	size_t vbit_quanta;
>>+};
>>+
>>+struct entry_sync_writer_ops128 {
>>+	void (*get_used)(const u128 *entry, u128 *used);
>>+	void (*sync)(struct entry_sync_writer128 *writer);
>>+};
>>+
>>+void entry_sync_write128(struct entry_sync_writer128 *writer, u128 *entry,
>>+			 const u128 *target, u128 *memory,
>>+			 size_t memory_len);
>>+
>>+#endif
>>diff --git a/drivers/iommu/entry_sync_template.h b/drivers/iommu/entry_sync_template.h
>>new file mode 100644
>>index 000000000000..646f518b098e
>>--- /dev/null
>>+++ b/drivers/iommu/entry_sync_template.h
>>@@ -0,0 +1,143 @@
>>+/* SPDX-License-Identifier: GPL-2.0-only */
>>+#include "entry_sync.h"
>>+#include <linux/args.h>
>>+#include <linux/bitops.h>
>>+
>>+#ifndef entry_sync_writer
>>+#define entry_sync_writer entry_sync_writer64
>>+#define quanta_t __le64
>>+#define NS(name) CONCATENATE(name, 64)
>>+#endif
>>+
>>+/*
>>+ * Figure out if we can do a hitless update of entry to become target. Returns a
>>+ * bit mask where 1 indicates that a quanta word needs to be set disruptively.
>>+ * unused_update is an intermediate value of entry that has unused bits set to
>>+ * their new values.
>>+ */
>>+static u8 NS(entry_quanta_diff)(struct entry_sync_writer *writer,
>>+				const quanta_t *entry, const quanta_t *target,
>>+				quanta_t *unused_update, quanta_t *memory)
>>+{
>>+	quanta_t *target_used = memory + writer->num_quantas * 1;
>>+	quanta_t *cur_used = memory + writer->num_quantas * 2;
>>+	u8 used_qword_diff = 0;
>>+	unsigned int i;
>>+
>>+	writer->ops->get_used(entry, cur_used);
>>+	writer->ops->get_used(target, target_used);
>>+
>>+	for (i = 0; i != writer->num_quantas; i++) {
>>+		/*
>>+		 * Check that masks are up to date, the make functions are not
>
>nit: "the make functions" looks like a typo.
>>+		 * allowed to set a bit to 1 if the used function doesn't say it
>>+		 * is used.
>>+		 */
>>+		WARN_ON_ONCE(target[i] & ~target_used[i]);
>>+
>>+		/* Bits can change because they are not currently being used */
>>+		unused_update[i] = (entry[i] & cur_used[i]) |
>>+				   (target[i] & ~cur_used[i]);
>>+		/*
>>+		 * Each bit indicates that a used bit in a qword needs to be
>>+		 * changed after unused_update is applied.
>>+		 */
>>+		if ((unused_update[i] & target_used[i]) != target[i])
>>+			used_qword_diff |= 1 << i;
>>+	}
>>+	return used_qword_diff;
>>+}
>>+
>>+/*
>>+ * Update the entry to the target configuration. The transition from the current
>>+ * entry to the target entry takes place over multiple steps that attempts to
>>+ * make the transition hitless if possible. This function takes care not to
>>+ * create a situation where the HW can perceive a corrupted entry. HW is only
>>+ * required to have a quanta-bit atomicity with stores from the CPU, while
>>+ * entries are many quanta bit values big.
>>+ *
>>+ * The difference between the current value and the target value is analyzed to
>>+ * determine which of three updates are required - disruptive, hitless or no
>>+ * change.
>>+ *
>>+ * In the most general disruptive case we can make any update in three steps:
>>+ *  - Disrupting the entry (V=0)
>>+ *  - Fill now unused quanta words, except qword 0 which contains V
>>+ *  - Make qword 0 have the final value and valid (V=1) with a single 64
>>+ *    bit store
>>+ *
>>+ * However this disrupts the HW while it is happening. There are several
>>+ * interesting cases where a STE/CD can be updated without disturbing the HW
>>+ * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
>>+ * because the used bits don't intersect. We can detect this by calculating how
>>+ * many 64 bit values need update after adjusting the unused bits and skip the
>>+ * V=0 process. This relies on the IGNORED behavior described in the
>>+ * specification.
>>+ */
>>+void NS(entry_sync_write)(struct entry_sync_writer *writer, quanta_t *entry,
>>+			  const quanta_t *target, quanta_t *memory,
>>+			  size_t memory_len)
>>+{
>>+	quanta_t *unused_update = memory + writer->num_quantas * 0;
>>+	u8 used_qword_diff;
>>+
>>+	if (WARN_ON(memory_len !=
>>+		    ENTRY_SYNC_MEMORY_LEN(writer) * sizeof(*memory)))
>>+		return;
>>+
>>+	used_qword_diff = NS(entry_quanta_diff)(writer, entry, target,
>>+						unused_update, memory);
>>+	if (hweight8(used_qword_diff) == 1) {
>>+		/*
>>+		 * Only one quanta needs its used bits to be changed. This is a
>>+		 * hitless update, update all bits the current entry is ignoring
>>+		 * to their new values, then update a single "critical quanta"
>>+		 * to change the entry and finally 0 out any bits that are now
>>+		 * unused in the target configuration.
>>+		 */
>>+		unsigned int critical_qword_index = ffs(used_qword_diff) - 1;
>>+
>>+		/*
>>+		 * Skip writing unused bits in the critical quanta since we'll
>>+		 * be writing it in the next step anyways. This can save a sync
>>+		 * when the only change is in that quanta.
>>+		 */
>>+		unused_update[critical_qword_index] =
>>+			entry[critical_qword_index];
>>+		NS(entry_set)(writer, entry, unused_update, 0,
>>+			      writer->num_quantas);
>>+		NS(entry_set)(writer, entry, target, critical_qword_index, 1);
>>+		NS(entry_set)(writer, entry, target, 0, writer->num_quantas);
>>+	} else if (used_qword_diff) {
>>+		/*
>>+		 * At least two quantas need their inuse bits to be changed.
>>+		 * This requires a breaking update, zero the V bit, write all
>>+		 * qwords but 0, then set qword 0
>>+		 */
>>+		unused_update[writer->vbit_quanta] = 0;
>>+		NS(entry_set)(writer, entry, unused_update, writer->vbit_quanta, 1);
>>+
>>+		if (writer->vbit_quanta != 0)
>>+			NS(entry_set)(writer, entry, target, 0,
>>+				      writer->vbit_quanta - 1);
>
>Looking at the definition of the entry_set below, the last argument is
>length. So if vbit_quanta 1 then it would write zero len. Shouldn't it
>be writing quantas before the vbit_quanta?
>>+		if (writer->vbit_quanta != writer->num_quantas)

Looking at this again, I think vbit_quanta can never be equal to
num_quanta as num_quantas is length and vbit_quanta is index?
>>+			NS(entry_set)(writer, entry, target,
>>+				      writer->vbit_quanta,

Staring from vbit_quanta will set the present bit if it is set in the
target?
>>+				      writer->num_quantas - 1);
>
>Sami here, the last argument should not have "- 1".

I meant "Same here".
>>+
>>+		NS(entry_set)(writer, entry, target, writer->vbit_quanta, 1);
>>+	} else {
>>+		/*
>>+		 * No inuse bit changed. Sanity check that all unused bits are 0
>>+		 * in the entry. The target was already sanity checked by
>>+		 * entry_quanta_diff().
>>+		 */
>>+		WARN_ON_ONCE(NS(entry_set)(writer, entry, target, 0,
>>+					   writer->num_quantas));
>>+	}
>>+}
>>+EXPORT_SYMBOL(NS(entry_sync_write));
>>+
>>+#undef entry_sync_writer
>>+#undef quanta_t
>>+#undef NS
>>diff --git a/drivers/iommu/entry_sync.c b/drivers/iommu/entry_sync.c
>>new file mode 100644
>>index 000000000000..48d31270dbba
>>--- /dev/null
>>+++ b/drivers/iommu/entry_sync.c
>>@@ -0,0 +1,68 @@
>>+// SPDX-License-Identifier: GPL-2.0-only
>>+/*
>>+ * Helpers for drivers to update multi-quanta entries shared with HW without
>>+ * races to minimize breaking changes.
>>+ */
>>+#include "entry_sync.h"
>>+#include <linux/kconfig.h>
>>+#include <linux/atomic.h>
>>+
>>+#if IS_ENABLED(CONFIG_IOMMU_ENTRY_SYNC64)
>>+static bool entry_set64(struct entry_sync_writer64 *writer, __le64 *entry,
>>+			const __le64 *target, unsigned int start,
>>+			unsigned int len)
>>+{
>>+	bool changed = false;
>>+	unsigned int i;
>>+
>>+	for (i = start; len != 0; len--, i++) {
>>+		if (entry[i] != target[i]) {
>>+			WRITE_ONCE(entry[i], target[i]);
>>+			changed = true;
>>+		}
>>+	}
>>+
>>+	if (changed)
>>+		writer->ops->sync(writer);
>>+	return changed;
>>+}
>>+
>>+#define entry_sync_writer entry_sync_writer64
>>+#define quanta_t __le64
>>+#define NS(name) CONCATENATE(name, 64)
>>+#include "entry_sync_template.h"
>>+#endif
>>+
>>+#if IS_ENABLED(CONFIG_IOMMU_ENTRY_SYNC128)
>>+static bool entry_set128(struct entry_sync_writer128 *writer, u128 *entry,
>>+			 const u128 *target, unsigned int start,
>>+			 unsigned int len)
>>+{
>>+	bool changed = false;
>>+	unsigned int i;
>>+
>>+	for (i = start; len != 0; len--, i++) {
>>+		if (entry[i] != target[i]) {
>>+			/*
>>+			 * Use cmpxchg128 to generate an indivisible write from
>>+			 * the CPU to DMA'able memory. This must ensure that HW
>>+			 * sees either the new or old 128 bit value and not
>>+			 * something torn. As updates are serialized by a
>>+			 * spinlock, we use the local (unlocked) variant to
>>+			 * avoid unnecessary bus locking overhead.
>>+			 */
>>+			cmpxchg128_local(&entry[i], entry[i], target[i]);
>>+			changed = true;
>>+		}
>>+	}
>>+
>>+	if (changed)
>>+		writer->ops->sync(writer);
>>+	return changed;
>>+}
>>+
>>+#define entry_sync_writer entry_sync_writer128
>>+#define quanta_t u128
>>+#define NS(name) CONCATENATE(name, 128)
>>+#include "entry_sync_template.h"
>>+#endif
>>-- 
>>2.43.0
>>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] iommu/vt-d: Add entry_sync support for PASID entry updates
  2026-03-09 13:41   ` Jason Gunthorpe
@ 2026-03-11  8:42     ` Baolu Lu
  2026-03-11 12:23       ` Jason Gunthorpe
  2026-03-12  7:50     ` Baolu Lu
  1 sibling, 1 reply; 34+ messages in thread
From: Baolu Lu @ 2026-03-11  8:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel

On 3/9/26 21:41, Jason Gunthorpe wrote:
> On Mon, Mar 09, 2026 at 02:06:42PM +0800, Lu Baolu wrote:
>> +static void intel_pasid_get_used(const u128 *entry, u128 *used)
>> +{
>> +	struct pasid_entry *pe = (struct pasid_entry *)entry;
>> +	struct pasid_entry *ue = (struct pasid_entry *)used;
>> +	u16 pgtt;
>> +
>> +	/* Initialize used bits to 0. */
>> +	memset(ue, 0, sizeof(*ue));
>> +
>> +	/* Present bit always matters. */
>> +	ue->val[0] |= PASID_PTE_PRESENT;
>> +
>> +	/* Nothing more for non-present entries. */
>> +	if (!(pe->val[0] & PASID_PTE_PRESENT))
>> +		return;
>> +
>> +	pgtt = pasid_pte_get_pgtt(pe);
>> +	switch (pgtt) {
>> +	case PASID_ENTRY_PGTT_FL_ONLY:
>> +		/* AW, PGTT */
>> +		ue->val[0] |= GENMASK_ULL(4, 2) | GENMASK_ULL(8, 6);
>> +		/* DID, PWSNP, PGSNP */
>> +		ue->val[1] |= GENMASK_ULL(24, 23) | GENMASK_ULL(15, 0);
>> +		/* FSPTPTR, FSPM */
>> +		ue->val[2] |= GENMASK_ULL(63, 12) | GENMASK_ULL(3, 2);
> This would be an excellent time to properly add these constants 🙁
> 
> /* 9.6 Scalable-Mode PASID Table Entry */
> #define SM_PASID0_P		BIT_U64(0)
> #define SM_PASID0_FPD		BIT_U64(1)
> #define SM_PASID0_AW		GENMASK_U64(4, 2)
> #define SM_PASID0_SSEE		BIT_U64(5)
> #define SM_PASID0_PGTT		GENMASK_U64(8, 6)
> #define SM_PASID0_SSADE		BIT_U64(9)
> #define SM_PASID0_SSPTPTR	GENMASK_U64(63, 12)
> 
> #define SM_PASID1_DID		GENMASK_U64(15, 0)
> #define SM_PASID1_PWSNP		BIT_U64(23)
> #define SM_PASID1_PGSNP		BIT_U64(24)
> #define SM_PASID1_CD		BIT_U64(25)
> #define SM_PASID1_EMTE		BIT_U64(26)
> #define SM_PASID1_PAT		GENMASK_U64(63, 32)
> 
> #define SM_PASID2_SRE		BIT_U64(0)
> #define SM_PASID2_ERE		BIT_U64(1)
> #define SM_PASID2_FSPM		GENMASK_U64(3, 2)
> #define SM_PASID2_WPE		BIT_U64(4)
> #define SM_PASID2_NXE		BIT_U64(5)
> #define SM_PASID2_SMEP		BIT_U64(6)
> #define SM_PASID2_EAFE		BIT_U64(7)
> #define SM_PASID2_FSPTPTR	GENMASK_U64(63, 12)

Yeah, code updated like this,

drivers/iommu/intel/pasid.h:

/* 9.6 Scalable-Mode PASID Table Entry */
#define SM_PASID0_P             BIT_U64(0)
#define SM_PASID0_FPD           BIT_U64(1)
#define SM_PASID0_AW            GENMASK_U64(4, 2)
#define SM_PASID0_PGTT          GENMASK_U64(8, 6)
#define SM_PASID0_SSADE         BIT_U64(9)
#define SM_PASID0_SSPTPTR       GENMASK_U64(63, 12)

#define SM_PASID1_DID           GENMASK_U64(15, 0)
#define SM_PASID1_PWSNP         BIT_U64(23)
#define SM_PASID1_PGSNP         BIT_U64(24)
#define SM_PASID1_CD            BIT_U64(25)
#define SM_PASID1_EMTE          BIT_U64(26)
#define SM_PASID1_PAT           GENMASK_U64(63, 32)

#define SM_PASID2_SRE           BIT_U64(0)
#define SM_PASID2_FSPM          GENMASK_U64(3, 2)
#define SM_PASID2_WPE           BIT_U64(4)
#define SM_PASID2_EAFE          BIT_U64(7)
#define SM_PASID2_FSPTPTR       GENMASK_U64(63, 12)

drivers/iommu/intel/pasid.c:

static void intel_pasid_get_used(const u128 *entry, u128 *used)
{
         struct pasid_entry *pe = (struct pasid_entry *)entry;
         struct pasid_entry *ue = (struct pasid_entry *)used;
         u16 pgtt;

         /* Initialize used bits to 0. */
         memset(ue, 0, sizeof(*ue));

         /* Present bit always matters. */
         ue->val[0] |= SM_PASID0_P;

         /* Nothing more for non-present entries. */
         if (!(pe->val[0] & SM_PASID0_P)) {
                 trace_entry_get_used(entry, used);
                 return;
         }

         pgtt = pasid_pte_get_pgtt(pe);
         switch (pgtt) {
         case PASID_ENTRY_PGTT_FL_ONLY:
                 /* AW, PGTT */
                 ue->val[0] |= SM_PASID0_AW | SM_PASID0_PGTT;
                 /* DID, PWSNP, PGSNP */
                 ue->val[1] |= SM_PASID1_DID | SM_PASID1_PWSNP | 
SM_PASID1_PGSNP;
                 /* FSPTPTR, FSPM */
                 ue->val[2] |= SM_PASID2_FSPTPTR | SM_PASID2_FSPM;
                 break;
         case PASID_ENTRY_PGTT_NESTED:
                 /* FPD, AW, PGTT, SSADE, SSPTPTR*/
                 ue->val[0] |= SM_PASID0_FPD | SM_PASID0_AW | 
SM_PASID0_PGTT |
                                 SM_PASID0_SSADE | SM_PASID0_SSPTPTR;
                 /* PGSNP, DID, PWSNP */
                 ue->val[1] |= SM_PASID1_DID | SM_PASID1_PWSNP | 
SM_PASID1_PGSNP;
                 /* FSPTPTR, FSPM, EAFE, WPE, SRE */
                 ue->val[2] |= SM_PASID2_SRE | SM_PASID2_WPE | 
SM_PASID2_EAFE |
                                 SM_PASID2_FSPM | SM_PASID2_FSPTPTR;
                 break;
         case PASID_ENTRY_PGTT_SL_ONLY:
                 /* FPD, AW, PGTT, SSADE, SSPTPTR */
                 ue->val[0] |= SM_PASID0_FPD | SM_PASID0_AW | 
SM_PASID0_PGTT |
                                 SM_PASID0_SSADE | SM_PASID0_SSPTPTR;
                 /* PGSNP, DID, PWSNP */
                 ue->val[1] |= SM_PASID1_DID | SM_PASID1_PWSNP | 
SM_PASID1_PGSNP;
                 break;
         case PASID_ENTRY_PGTT_PT:
                 /* FPD, AW, PGTT */
                 ue->val[0] |= SM_PASID0_FPD | SM_PASID0_AW | 
SM_PASID0_PGTT;
                 /* PGSNP, DID, PWSNP */
                 ue->val[1] |= SM_PASID1_DID | SM_PASID1_PWSNP | 
SM_PASID1_PGSNP;
                 break;
         default:
                 WARN_ON(true);
         }

         trace_entry_get_used(entry, used);
}

Thanks,
baolu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] iommu/vt-d: Add entry_sync support for PASID entry updates
  2026-03-11  8:42     ` Baolu Lu
@ 2026-03-11 12:23       ` Jason Gunthorpe
  2026-03-12  7:51         ` Baolu Lu
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2026-03-11 12:23 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel

On Wed, Mar 11, 2026 at 04:42:37PM +0800, Baolu Lu wrote:
>         switch (pgtt) {
>         case PASID_ENTRY_PGTT_FL_ONLY:
>                 /* AW, PGTT */
>                 ue->val[0] |= SM_PASID0_AW | SM_PASID0_PGTT;

Probably don't need the comments too :)

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] iommu/vt-d: Add entry_sync support for PASID entry updates
  2026-03-09 13:41   ` Jason Gunthorpe
  2026-03-11  8:42     ` Baolu Lu
@ 2026-03-12  7:50     ` Baolu Lu
  2026-03-12 11:44       ` Jason Gunthorpe
  1 sibling, 1 reply; 34+ messages in thread
From: Baolu Lu @ 2026-03-12  7:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel

On 3/9/26 21:41, Jason Gunthorpe wrote:
>> +static void intel_pasid_sync(struct entry_sync_writer128 *writer)
>> +{
>> +	struct intel_pasid_writer *p_writer = container_of(writer,
>> +			struct intel_pasid_writer, writer);
>> +	struct intel_iommu *iommu = p_writer->iommu;
>> +	struct device *dev = p_writer->dev;
>> +	bool was_present, is_present;
>> +	u32 pasid = p_writer->pasid;
>> +	struct pasid_entry *pte;
>> +	u16 old_did, old_pgtt;
>> +
>> +	pte = intel_pasid_get_entry(dev, pasid);
>> +	was_present = p_writer->was_present;
>> +	is_present = pasid_pte_is_present(pte);
>> +	old_did = pasid_get_domain_id(&p_writer->orig_pte);
>> +	old_pgtt = pasid_pte_get_pgtt(&p_writer->orig_pte);
>> +
>> +	/* Update the last present state: */
>> +	p_writer->was_present = is_present;
>> +
>> +	if (!ecap_coherent(iommu->ecap))
>> +		clflush_cache_range(pte, sizeof(*pte));
>> +
>> +	/* Sync for "P=0" to "P=1": */
>> +	if (!was_present) {
>> +		if (is_present)
>> +			pasid_flush_caches(iommu, pte, pasid,
>> +					   pasid_get_domain_id(pte));
>> +
>> +		return;
>> +	}
>> +
>> +	/* Sync for "P=1" to "P=1": */
>> +	if (is_present) {
>> +		intel_pasid_flush_present(iommu, dev, pasid, old_did, pte);
>> +		return;
>> +	}
>> +
>> +	/* Sync for "P=1" to "P=0": */
>> +	pasid_cache_invalidation_with_pasid(iommu, old_did, pasid);
> Why all this logic? All this different stuff does is meddle with the
> IOTLB and it should not seen below.
> 
> If the sync is called it should just always call
> pasid_cache_invalidation_with_pasid(), that's it.
> 
> Writer has already eliminated all cases where sync isn't needed.

You're right. The library should simplify things. I will remove the
state tracking. The callback will only ensure that memory is flushed
(for non-coherent mode) and the relevant PASID cache is invalidated.

> 
>> +	if (old_pgtt == PASID_ENTRY_PGTT_PT || old_pgtt == PASID_ENTRY_PGTT_FL_ONLY)
>> +		qi_flush_piotlb(iommu, old_did, pasid, 0, -1, 0);
>> +	else
>> +		iommu->flush.flush_iotlb(iommu, old_did, 0, 0, DMA_TLB_DSI_FLUSH);
>> +	devtlb_invalidation_with_pasid(iommu, dev, pasid);
> The IOTLB should already be clean'd before the new entry using the
> cache tag is programmed. Cleaning it after the entry is live is buggy.
 > > The writer logic ensures it never sees a corrupted entry, so the clean
> cache tag cannot be mangled during the writing process.
> 
> The way ARM is structured has the cache tags clean if they are in the
> allocator bitmap, so when the driver fetches a new tag and starts
> using it is clean and non cleaning is needed
> 
> When it frees a tag it cleans it and then returns it to the allocator.

If I understand your remark correctly, the driver should only need the
following in the sync callback:

- clflush (if non-coherent) to ensure the entry is in physical memory.
- PASID cache invalidation to force the hardware to re-read the entry.
- Device-TLB invalidation to drop local device caches.

Does that sound right? I can move the general IOTLB/PIOTLB invalidation
logic to the domain detach/free paths.

> ATC invalidations should always be done after the PASID entry is
> written. During a hitless update both translations are unpredictably
> combined, this is unavoidable and OK.

The VT-d spec (Sections 6.5.2.5 and 6.5.2.6) explicitly mandates that an
IOTLB invalidation must precede the Device-TLB invalidation. If we only
do the device-TLB invalidation in the sync callback, we risk the device
re-fetching a stale translation from the IOMMU's internal IOTLB.

Thanks,
baolu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] iommu/vt-d: Add entry_sync support for PASID entry updates
  2026-03-11 12:23       ` Jason Gunthorpe
@ 2026-03-12  7:51         ` Baolu Lu
  0 siblings, 0 replies; 34+ messages in thread
From: Baolu Lu @ 2026-03-12  7:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel

On 3/11/26 20:23, Jason Gunthorpe wrote:
> On Wed, Mar 11, 2026 at 04:42:37PM +0800, Baolu Lu wrote:
>>          switch (pgtt) {
>>          case PASID_ENTRY_PGTT_FL_ONLY:
>>                  /* AW, PGTT */
>>                  ue->val[0] |= SM_PASID0_AW | SM_PASID0_PGTT;
> Probably don't need the comments too 🙂

Yeah, the code itself is self-explained.

Thanks,
baolu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/8] iommu/vt-d: Require CMPXCHG16B for PASID support
  2026-03-09 13:42   ` Jason Gunthorpe
@ 2026-03-12  7:59     ` Baolu Lu
  0 siblings, 0 replies; 34+ messages in thread
From: Baolu Lu @ 2026-03-12  7:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel

On 3/9/26 21:42, Jason Gunthorpe wrote:
> On Mon, Mar 09, 2026 at 02:06:43PM +0800, Lu Baolu wrote:
>> The Intel IOMMU driver is moving toward using the generic entry_sync
>> library for PASID table entry updates. This library requires 128-bit
>> atomic write operations (cmpxchg128) to update 512-bit PASID entries in
>> atomic quanta, ensuring the hardware never observes a torn entry.
>>
>> On x86_64, 128-bit atomicity is provided by the CMPXCHG16B instruction.
>> Update the driver to:
>>
>> 1. Limit INTEL_IOMMU to X86_64, as 128-bit atomic operations are not
>>     available on 32-bit x86.
>> 2. Gate pasid_supported() on the presence of X86_FEATURE_CX16.
>> 3. Provide a boot-time warning if a PASID-capable IOMMU is detected on
>>     a CPU lacking the required instruction.
> This is fine, but it also occured to me that we could change the
> writer somewhat to just detect what the update granual is and fall
> back to 64 bit in this case. So everything still works, it just does
> non-present alot more often.

That's a good point. Though I don't expect many real-world use cases for
PASID on platforms lacking CX16, making the entry_sync library and the
driver adaptive would make the infrastructure more robust. I will look
into supporting a 64-bit fallback.

Thanks,
baolu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] iommu/vt-d: Add entry_sync support for PASID entry updates
  2026-03-12  7:50     ` Baolu Lu
@ 2026-03-12 11:44       ` Jason Gunthorpe
  2026-03-15  8:11         ` Baolu Lu
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2026-03-12 11:44 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel

On Thu, Mar 12, 2026 at 03:50:03PM +0800, Baolu Lu wrote:
> If I understand your remark correctly, the driver should only need the
> following in the sync callback:
> 
> - clflush (if non-coherent) to ensure the entry is in physical memory.
> - PASID cache invalidation to force the hardware to re-read the entry.

Yes

> - Device-TLB invalidation to drop local device caches.

I have prefered to keep this outside the entry_set system since it has
nothing to do with updating the context entry.

There should be only one ATS flush after the new entry is installed.

> > ATC invalidations should always be done after the PASID entry is
> > written. During a hitless update both translations are unpredictably
> > combined, this is unavoidable and OK.
> 
> The VT-d spec (Sections 6.5.2.5 and 6.5.2.6) explicitly mandates that an
> IOTLB invalidation must precede the Device-TLB invalidation. If we only
> do the device-TLB invalidation in the sync callback, we risk the device
> re-fetching a stale translation from the IOMMU's internal IOTLB.

It is a little weird that is says that, that is worth checking into.

The other text is clear that the IOTLB is cached by DID,PASID only, so
if the new PASID entry has a DID,PASID which is already coherent in
the IOTLB it should not need any IOTLB flushing.

ie flushing the PASID table should immediately change any ATC fetches
from using DID,old_PASID to DID,new_PASID.

If there is some issue where the PASID flush doesn't fence everything
(ie an ATC fetch of DID,old_PASID can be passed by an ATC invalidation)
then you may need IOTLB invalidations not to manage coherence but to
manage ordering. That is an important detail if true.

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] iommu: Lift and generalize the STE/CD update code from SMMUv3
  2026-03-09  6:06 ` [PATCH 1/8] iommu: Lift and generalize the STE/CD update code from SMMUv3 Lu Baolu
  2026-03-09 23:33   ` Samiullah Khawaja
@ 2026-03-13  5:39   ` Nicolin Chen
  2026-03-16  6:24     ` Baolu Lu
  1 sibling, 1 reply; 34+ messages in thread
From: Nicolin Chen @ 2026-03-13  5:39 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Dmytro Maluka, Samiullah Khawaja, iommu,
	linux-kernel

Hi Baolu,

On Mon, Mar 09, 2026 at 02:06:41PM +0800, Lu Baolu wrote:
> +struct entry_sync_writer_ops64;
> +struct entry_sync_writer64 {
> +	const struct entry_sync_writer_ops64 *ops;
> +	size_t num_quantas;
> +	size_t vbit_quanta;
> +};

Though I could guess what the @num_quantas and @vbit_quanta likely
mean, it'd be nicer to have some notes elaborating them.

> +/*
> + * Figure out if we can do a hitless update of entry to become target. Returns a
> + * bit mask where 1 indicates that a quanta word needs to be set disruptively.
> + * unused_update is an intermediate value of entry that has unused bits set to
> + * their new values.
> + */
> +static u8 NS(entry_quanta_diff)(struct entry_sync_writer *writer,
> +				const quanta_t *entry, const quanta_t *target,
> +				quanta_t *unused_update, quanta_t *memory)
> +{
> +	quanta_t *target_used = memory + writer->num_quantas * 1;
> +	quanta_t *cur_used = memory + writer->num_quantas * 2;

Should we have a kdoc somewhere mentioning that the two arrays are
neighbors (IIUIC)?

> +	u8 used_qword_diff = 0;

It seems to me that we want use "quanta" v.s. "qword"? 128 bits can
be called "dqword" as well though.

> +	unsigned int i;
> +
> +	writer->ops->get_used(entry, cur_used);
> +	writer->ops->get_used(target, target_used);

SMMU has get_update_safe now. Can we take it together?

> +void NS(entry_sync_write)(struct entry_sync_writer *writer, quanta_t *entry,
> +			  const quanta_t *target, quanta_t *memory,
> +			  size_t memory_len)
> +{
> +	quanta_t *unused_update = memory + writer->num_quantas * 0;
> +	u8 used_qword_diff;
> +
> +	if (WARN_ON(memory_len !=
> +		    ENTRY_SYNC_MEMORY_LEN(writer) * sizeof(*memory)))
> +		return;
> +
> +	used_qword_diff = NS(entry_quanta_diff)(writer, entry, target,
> +						unused_update, memory);
> +	if (hweight8(used_qword_diff) == 1) {
> +		/*
> +		 * Only one quanta needs its used bits to be changed. This is a
> +		 * hitless update, update all bits the current entry is ignoring
> +		 * to their new values, then update a single "critical quanta"
> +		 * to change the entry and finally 0 out any bits that are now
> +		 * unused in the target configuration.
> +		 */
> +		unsigned int critical_qword_index = ffs(used_qword_diff) - 1;
> +
> +		/*
> +		 * Skip writing unused bits in the critical quanta since we'll
> +		 * be writing it in the next step anyways. This can save a sync
> +		 * when the only change is in that quanta.
> +		 */
> +		unused_update[critical_qword_index] =
> +			entry[critical_qword_index];
> +		NS(entry_set)(writer, entry, unused_update, 0,
> +			      writer->num_quantas);
> +		NS(entry_set)(writer, entry, target, critical_qword_index, 1);
> +		NS(entry_set)(writer, entry, target, 0, writer->num_quantas);
> +	} else if (used_qword_diff) {
> +		/*
> +		 * At least two quantas need their inuse bits to be changed.
> +		 * This requires a breaking update, zero the V bit, write all
> +		 * qwords but 0, then set qword 0
> +		 */

Still, it'd be nicer to unify the wording between "quanta" and
"qword".

[..]
> +EXPORT_SYMBOL(NS(entry_sync_write));

There is also a KUNIT test coverage in arm-smmu-v3 for all of these
functions. Maybe we can make that generic as well?

> +#define entry_sync_writer entry_sync_writer64
> +#define quanta_t __le64
[..]
> +#define entry_sync_writer entry_sync_writer128
> +#define quanta_t u128

u64 can be called 64 too, though we might not have use case for now.

But maybe we could just call them:
    entry_sync_writer_le64
    entry_sync_writer_u128
?

Nicolin

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] iommu: Lift and generalize the STE/CD update code from SMMUv3
  2026-03-10  0:06     ` Samiullah Khawaja
@ 2026-03-14  8:13       ` Baolu Lu
  2026-03-16  9:51         ` Will Deacon
  2026-03-16 16:35         ` Samiullah Khawaja
  0 siblings, 2 replies; 34+ messages in thread
From: Baolu Lu @ 2026-03-14  8:13 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Dmytro Maluka, iommu, linux-kernel

On 3/10/26 08:06, Samiullah Khawaja wrote:
> On Mon, Mar 09, 2026 at 11:33:23PM +0000, Samiullah Khawaja wrote:
>> On Mon, Mar 09, 2026 at 02:06:41PM +0800, Lu Baolu wrote:
>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>
>>> Many IOMMU implementations store data structures in host memory that can
>>> be quite big. The iommu is able to DMA read the host memory using an
>>> atomic quanta, usually 64 or 128 bits, and will read an entry using
>>> multiple quanta reads.
>>>
>>> Updating the host memory datastructure entry while the HW is 
>>> concurrently
>>> DMA'ing it is a little bit involved, but if you want to do this 
>>> hitlessly,
>>> while never making the entry non-valid, then it becomes quite 
>>> complicated.
>>>
>>> entry_sync is a library to handle this task. It works on the notion of
>>> "used bits" which reflect which bits the HW is actually sensitive to and
>>> which bits are ignored by hardware. Many hardware specifications say
>>> things like 'if mode is X then bits ABC are ignored'.
>>>
>>> Using the ignored bits entry_sync can often compute a series of ordered
>>> writes and flushes that will allow the entry to be updated while keeping
>>> it valid. If such an update is not possible then entry will be made
>>> temporarily non-valid.
>>>
>>> A 64 and 128 bit quanta version is provided to support existing iommus.
>>>
>>> Co-developed-by: Lu Baolu <baolu.lu@linux.intel.com>
>>> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
>>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>>> ---
>>> drivers/iommu/Kconfig               |  14 +++
>>> drivers/iommu/Makefile              |   1 +
>>> drivers/iommu/entry_sync.h          |  66 +++++++++++++
>>> drivers/iommu/entry_sync_template.h | 143 ++++++++++++++++++++++++++++
>>> drivers/iommu/entry_sync.c          |  68 +++++++++++++
>>> 5 files changed, 292 insertions(+)
>>> create mode 100644 drivers/iommu/entry_sync.h
>>> create mode 100644 drivers/iommu/entry_sync_template.h
>>> create mode 100644 drivers/iommu/entry_sync.c
>>>
>>> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
>>> index f86262b11416..2650c9fa125b 100644
>>> --- a/drivers/iommu/Kconfig
>>> +++ b/drivers/iommu/Kconfig
>>> @@ -145,6 +145,20 @@ config IOMMU_DEFAULT_PASSTHROUGH
>>>
>>> endchoice
>>>
>>> +config IOMMU_ENTRY_SYNC
>>> +    bool
>>> +    default n
>>> +
>>> +config IOMMU_ENTRY_SYNC64
>>> +    bool
>>> +    select IOMMU_ENTRY_SYNC
>>> +    default n
>>> +
>>> +config IOMMU_ENTRY_SYNC128
>>> +    bool
>>> +    select IOMMU_ENTRY_SYNC
>>> +    default n
>>> +
>>> config OF_IOMMU
>>>     def_bool y
>>>     depends on OF && IOMMU_API
>>> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
>>> index 0275821f4ef9..bd923995497a 100644
>>> --- a/drivers/iommu/Makefile
>>> +++ b/drivers/iommu/Makefile
>>> @@ -10,6 +10,7 @@ obj-$(CONFIG_IOMMU_API) += iommu-traces.o
>>> obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
>>> obj-$(CONFIG_IOMMU_DEBUGFS) += iommu-debugfs.o
>>> obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o
>>> +obj-$(CONFIG_IOMMU_ENTRY_SYNC) += entry_sync.o
>>> obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
>>> obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
>>> obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
>>> diff --git a/drivers/iommu/entry_sync.h b/drivers/iommu/entry_sync.h
>>> new file mode 100644
>>> index 000000000000..004d421c71c0
>>> --- /dev/null
>>> +++ b/drivers/iommu/entry_sync.h
>>> @@ -0,0 +1,66 @@
>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>> +/*
>>> + * Many IOMMU implementations store data structures in host memory 
>>> that can be
>>> + * quite big. The iommu is able to DMA read the host memory using an 
>>> atomic
>>> + * quanta, usually 64 or 128 bits, and will read an entry using 
>>> multiple quanta
>>> + * reads.
>>> + *
>>> + * Updating the host memory datastructure entry while the HW is 
>>> concurrently
>>> + * DMA'ing it is a little bit involved, but if you want to do this 
>>> hitlessly,
>>> + * while never making the entry non-valid, then it becomes quite 
>>> complicated.
>>> + *
>>> + * entry_sync is a library to handle this task. It works on the 
>>> notion of "used
>>> + * bits" which reflect which bits the HW is actually sensitive to 
>>> and which bits
>>> + * are ignored by hardware. Many hardware specifications say things 
>>> like 'if
>>> + * mode is X then bits ABC are ignored'.
>>> + *
>>> + * Using the ignored bits entry_sync can often compute a series of 
>>> ordered
>>> + * writes and flushes that will allow the entry to be updated while 
>>> keeping it
>>> + * valid. If such an update is not possible then entry will be made 
>>> temporarily
>>> + * non-valid.
>>> + *
>>> + * A 64 and 128 bit quanta version is provided to support existing 
>>> iommus.
>>> + */
>>> +#ifndef IOMMU_ENTRY_SYNC_H
>>> +#define IOMMU_ENTRY_SYNC_H
>>> +
>>> +#include <linux/types.h>
>>> +#include <linux/compiler.h>
>>> +#include <linux/bug.h>
>>> +
>>> +/* Caller allocates a stack array of this length to call 
>>> entry_sync_write() */
>>> +#define ENTRY_SYNC_MEMORY_LEN(writer) ((writer)->num_quantas * 3)
>>> +
>>> +struct entry_sync_writer_ops64;
>>> +struct entry_sync_writer64 {
>>> +    const struct entry_sync_writer_ops64 *ops;
>>> +    size_t num_quantas;
>>> +    size_t vbit_quanta;
>>> +};
>>> +
>>> +struct entry_sync_writer_ops64 {
>>> +    void (*get_used)(const __le64 *entry, __le64 *used);
>>> +    void (*sync)(struct entry_sync_writer64 *writer);
>>> +};
>>> +
>>> +void entry_sync_write64(struct entry_sync_writer64 *writer, __le64 
>>> *entry,
>>> +            const __le64 *target, __le64 *memory,
>>> +            size_t memory_len);
>>> +
>>> +struct entry_sync_writer_ops128;
>>> +struct entry_sync_writer128 {
>>> +    const struct entry_sync_writer_ops128 *ops;
>>> +    size_t num_quantas;
>>> +    size_t vbit_quanta;
>>> +};
>>> +
>>> +struct entry_sync_writer_ops128 {
>>> +    void (*get_used)(const u128 *entry, u128 *used);
>>> +    void (*sync)(struct entry_sync_writer128 *writer);
>>> +};
>>> +
>>> +void entry_sync_write128(struct entry_sync_writer128 *writer, u128 
>>> *entry,
>>> +             const u128 *target, u128 *memory,
>>> +             size_t memory_len);
>>> +
>>> +#endif
>>> diff --git a/drivers/iommu/entry_sync_template.h b/drivers/iommu/ 
>>> entry_sync_template.h
>>> new file mode 100644
>>> index 000000000000..646f518b098e
>>> --- /dev/null
>>> +++ b/drivers/iommu/entry_sync_template.h
>>> @@ -0,0 +1,143 @@
>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>> +#include "entry_sync.h"
>>> +#include <linux/args.h>
>>> +#include <linux/bitops.h>
>>> +
>>> +#ifndef entry_sync_writer
>>> +#define entry_sync_writer entry_sync_writer64
>>> +#define quanta_t __le64
>>> +#define NS(name) CONCATENATE(name, 64)
>>> +#endif
>>> +
>>> +/*
>>> + * Figure out if we can do a hitless update of entry to become 
>>> target. Returns a
>>> + * bit mask where 1 indicates that a quanta word needs to be set 
>>> disruptively.
>>> + * unused_update is an intermediate value of entry that has unused 
>>> bits set to
>>> + * their new values.
>>> + */
>>> +static u8 NS(entry_quanta_diff)(struct entry_sync_writer *writer,
>>> +                const quanta_t *entry, const quanta_t *target,
>>> +                quanta_t *unused_update, quanta_t *memory)
>>> +{
>>> +    quanta_t *target_used = memory + writer->num_quantas * 1;
>>> +    quanta_t *cur_used = memory + writer->num_quantas * 2;
>>> +    u8 used_qword_diff = 0;
>>> +    unsigned int i;
>>> +
>>> +    writer->ops->get_used(entry, cur_used);
>>> +    writer->ops->get_used(target, target_used);
>>> +
>>> +    for (i = 0; i != writer->num_quantas; i++) {
>>> +        /*
>>> +         * Check that masks are up to date, the make functions are not
>>
>> nit: "the make functions" looks like a typo.

That seems to be a typo. Will clear it in v2.

>>> +         * allowed to set a bit to 1 if the used function doesn't 
>>> say it
>>> +         * is used.
>>> +         */
>>> +        WARN_ON_ONCE(target[i] & ~target_used[i]);
>>> +
>>> +        /* Bits can change because they are not currently being used */
>>> +        unused_update[i] = (entry[i] & cur_used[i]) |
>>> +                   (target[i] & ~cur_used[i]);
>>> +        /*
>>> +         * Each bit indicates that a used bit in a qword needs to be
>>> +         * changed after unused_update is applied.
>>> +         */
>>> +        if ((unused_update[i] & target_used[i]) != target[i])
>>> +            used_qword_diff |= 1 << i;
>>> +    }
>>> +    return used_qword_diff;
>>> +}
>>> +
>>> +/*
>>> + * Update the entry to the target configuration. The transition from 
>>> the current
>>> + * entry to the target entry takes place over multiple steps that 
>>> attempts to
>>> + * make the transition hitless if possible. This function takes care 
>>> not to
>>> + * create a situation where the HW can perceive a corrupted entry. 
>>> HW is only
>>> + * required to have a quanta-bit atomicity with stores from the CPU, 
>>> while
>>> + * entries are many quanta bit values big.
>>> + *
>>> + * The difference between the current value and the target value is 
>>> analyzed to
>>> + * determine which of three updates are required - disruptive, 
>>> hitless or no
>>> + * change.
>>> + *
>>> + * In the most general disruptive case we can make any update in 
>>> three steps:
>>> + *  - Disrupting the entry (V=0)
>>> + *  - Fill now unused quanta words, except qword 0 which contains V
>>> + *  - Make qword 0 have the final value and valid (V=1) with a 
>>> single 64
>>> + *    bit store
>>> + *
>>> + * However this disrupts the HW while it is happening. There are 
>>> several
>>> + * interesting cases where a STE/CD can be updated without 
>>> disturbing the HW
>>> + * because only a small number of bits are changing (S1DSS, CONFIG, 
>>> etc) or
>>> + * because the used bits don't intersect. We can detect this by 
>>> calculating how
>>> + * many 64 bit values need update after adjusting the unused bits 
>>> and skip the
>>> + * V=0 process. This relies on the IGNORED behavior described in the
>>> + * specification.
>>> + */
>>> +void NS(entry_sync_write)(struct entry_sync_writer *writer, quanta_t 
>>> *entry,
>>> +              const quanta_t *target, quanta_t *memory,
>>> +              size_t memory_len)
>>> +{
>>> +    quanta_t *unused_update = memory + writer->num_quantas * 0;
>>> +    u8 used_qword_diff;
>>> +
>>> +    if (WARN_ON(memory_len !=
>>> +            ENTRY_SYNC_MEMORY_LEN(writer) * sizeof(*memory)))
>>> +        return;
>>> +
>>> +    used_qword_diff = NS(entry_quanta_diff)(writer, entry, target,
>>> +                        unused_update, memory);
>>> +    if (hweight8(used_qword_diff) == 1) {
>>> +        /*
>>> +         * Only one quanta needs its used bits to be changed. This is a
>>> +         * hitless update, update all bits the current entry is 
>>> ignoring
>>> +         * to their new values, then update a single "critical quanta"
>>> +         * to change the entry and finally 0 out any bits that are now
>>> +         * unused in the target configuration.
>>> +         */
>>> +        unsigned int critical_qword_index = ffs(used_qword_diff) - 1;
>>> +
>>> +        /*
>>> +         * Skip writing unused bits in the critical quanta since we'll
>>> +         * be writing it in the next step anyways. This can save a sync
>>> +         * when the only change is in that quanta.
>>> +         */
>>> +        unused_update[critical_qword_index] =
>>> +            entry[critical_qword_index];
>>> +        NS(entry_set)(writer, entry, unused_update, 0,
>>> +                  writer->num_quantas);
>>> +        NS(entry_set)(writer, entry, target, critical_qword_index, 1);
>>> +        NS(entry_set)(writer, entry, target, 0, writer->num_quantas);
>>> +    } else if (used_qword_diff) {
>>> +        /*
>>> +         * At least two quantas need their inuse bits to be changed.
>>> +         * This requires a breaking update, zero the V bit, write all
>>> +         * qwords but 0, then set qword 0
>>> +         */
>>> +        unused_update[writer->vbit_quanta] = 0;
>>> +        NS(entry_set)(writer, entry, unused_update, writer- 
>>> >vbit_quanta, 1);
>>> +
>>> +        if (writer->vbit_quanta != 0)
>>> +            NS(entry_set)(writer, entry, target, 0,
>>> +                      writer->vbit_quanta - 1);
>>
>> Looking at the definition of the entry_set below, the last argument is
>> length. So if vbit_quanta 1 then it would write zero len. Shouldn't it
>> be writing quantas before the vbit_quanta?
>>> +        if (writer->vbit_quanta != writer->num_quantas)
> 
> Looking at this again, I think vbit_quanta can never be equal to
> num_quanta as num_quantas is length and vbit_quanta is index?
>>> +            NS(entry_set)(writer, entry, target,
>>> +                      writer->vbit_quanta,
> 
> Staring from vbit_quanta will set the present bit if it is set in the
> target?
>>> +                      writer->num_quantas - 1);
>>
>> Sami here, the last argument should not have "- 1".
> 
> I meant "Same here".

This branch is the disruptive update path. The process is:

1. Clear the Valid bit. The hardware now ignores this entry.
2. Write all the new data for the words before the Valid bit.
3. Write all the new data for the words after the Valid bit.
4. Write the word containing the Valid bit. The entry is now live again
    with all the new data.

Yes. The last argument for entry_set is length, not index. So perhaps I
could update it like this?

diff --git a/drivers/iommu/entry_sync_template.h 
b/drivers/iommu/entry_sync_template.h
index 646f518b098e..423cbb874919 100644
--- a/drivers/iommu/entry_sync_template.h
+++ b/drivers/iommu/entry_sync_template.h
@@ -118,12 +118,11 @@ void NS(entry_sync_write)(struct entry_sync_writer 
*writer, quanta_t *entry,
                 NS(entry_set)(writer, entry, unused_update, 
writer->vbit_quanta, 1);

                 if (writer->vbit_quanta != 0)
-                       NS(entry_set)(writer, entry, target, 0,
-                                     writer->vbit_quanta - 1);
-               if (writer->vbit_quanta != writer->num_quantas)
+                       NS(entry_set)(writer, entry, target, 0, 
writer->vbit_quanta);
+               if (writer->vbit_quanta + 1 < writer->num_quantas)
                         NS(entry_set)(writer, entry, target,
-                                     writer->vbit_quanta,
-                                     writer->num_quantas - 1);
+                                     writer->vbit_quanta + 1,
+                                     writer->num_quantas - 
writer->vbit_quanta - 1);

                 NS(entry_set)(writer, entry, target, 
writer->vbit_quanta, 1);
         } else {

Thanks,
baolu

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] iommu/vt-d: Add entry_sync support for PASID entry updates
  2026-03-12 11:44       ` Jason Gunthorpe
@ 2026-03-15  8:11         ` Baolu Lu
  2026-03-23 13:07           ` Jason Gunthorpe
  0 siblings, 1 reply; 34+ messages in thread
From: Baolu Lu @ 2026-03-15  8:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel

On 3/12/26 19:44, Jason Gunthorpe wrote:
> On Thu, Mar 12, 2026 at 03:50:03PM +0800, Baolu Lu wrote:
>> If I understand your remark correctly, the driver should only need the
>> following in the sync callback:
>>
>> - clflush (if non-coherent) to ensure the entry is in physical memory.
>> - PASID cache invalidation to force the hardware to re-read the entry.
> 
> Yes
> 
>> - Device-TLB invalidation to drop local device caches.
> 
> I have prefered to keep this outside the entry_set system since it has
> nothing to do with updating the context entry.
> 
> There should be only one ATS flush after the new entry is installed.

Okay, I will move the devtlb_invalidation_with_pasid() calls outside of
the entry_sync system, right after the call to the writer returns.

> 
>>> ATC invalidations should always be done after the PASID entry is
>>> written. During a hitless update both translations are unpredictably
>>> combined, this is unavoidable and OK.
>>
>> The VT-d spec (Sections 6.5.2.5 and 6.5.2.6) explicitly mandates that an
>> IOTLB invalidation must precede the Device-TLB invalidation. If we only
>> do the device-TLB invalidation in the sync callback, we risk the device
>> re-fetching a stale translation from the IOMMU's internal IOTLB.
> 
> It is a little weird that is says that, that is worth checking into.
> 
> The other text is clear that the IOTLB is cached by DID,PASID only, so
> if the new PASID entry has a DID,PASID which is already coherent in
> the IOTLB it should not need any IOTLB flushing.
> 
> ie flushing the PASID table should immediately change any ATC fetches
> from using DID,old_PASID to DID,new_PASID.
> 
> If there is some issue where the PASID flush doesn't fence everything
> (ie an ATC fetch of DID,old_PASID can be passed by an ATC invalidation)
> then you may need IOTLB invalidations not to manage coherence but to
> manage ordering. That is an important detail if true.

On Intel hardware, the PASID-cache and IOTLB are not inclusive. A PASID-
cache invalidation forces a re-fetch of the pasid entry, but it does not
automatically purge downstream IOTLB entries. The spec-mandated IOTLB
flush serves as a synchronization barrier to ensure that in-flight
translation requests are drained and the internal IOMMU state is
consistent before the invalidation request is sent over PCIe to the
device's ATC.

Without this "IOTLB -> Wait Descriptor -> ATC" sequence, there is a risk
that the device re-populates its ATC from a stale entry still residing
in the IOMMU's internal IOTLB, even after the PASID entry itself has
been updated.

Thanks,
baolu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] iommu: Lift and generalize the STE/CD update code from SMMUv3
  2026-03-13  5:39   ` Nicolin Chen
@ 2026-03-16  6:24     ` Baolu Lu
  2026-03-23 12:59       ` Jason Gunthorpe
  0 siblings, 1 reply; 34+ messages in thread
From: Baolu Lu @ 2026-03-16  6:24 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Dmytro Maluka, Samiullah Khawaja, iommu,
	linux-kernel

On 3/13/26 13:39, Nicolin Chen wrote:
> Hi Baolu,

Hi Nicolin,

Thanks for the comments.

> 
> On Mon, Mar 09, 2026 at 02:06:41PM +0800, Lu Baolu wrote:
>> +struct entry_sync_writer_ops64;
>> +struct entry_sync_writer64 {
>> +	const struct entry_sync_writer_ops64 *ops;
>> +	size_t num_quantas;
>> +	size_t vbit_quanta;
>> +};
> 
> Though I could guess what the @num_quantas and @vbit_quanta likely
> mean, it'd be nicer to have some notes elaborating them.

Yes. I will make it like this,

struct entry_sync_writer64 {
         const struct entry_sync_writer_ops64 *ops;
         /* Total size of the entry in atomic units: */
         size_t num_quantas;
         /* The index of the quanta containing the Valid bit: */
         size_t vbit_quanta;
};

The same to entry_sync_writer128.

> 
>> +/*
>> + * Figure out if we can do a hitless update of entry to become target. Returns a
>> + * bit mask where 1 indicates that a quanta word needs to be set disruptively.
>> + * unused_update is an intermediate value of entry that has unused bits set to
>> + * their new values.
>> + */
>> +static u8 NS(entry_quanta_diff)(struct entry_sync_writer *writer,
>> +				const quanta_t *entry, const quanta_t *target,
>> +				quanta_t *unused_update, quanta_t *memory)
>> +{
>> +	quanta_t *target_used = memory + writer->num_quantas * 1;
>> +	quanta_t *cur_used = memory + writer->num_quantas * 2;
> 
> Should we have a kdoc somewhere mentioning that the two arrays are
> neighbors (IIUIC)?

The library uses a single block of scratchpad memory and offsets into 
it. A WARN_ON() is added in NS(entry_sync_write) to ensure this:

         if (WARN_ON(memory_len !=
                     ENTRY_SYNC_MEMORY_LEN(writer) * sizeof(*memory)))
                 return;

How about adding below comments around this WARN_ON()?

/*
  * The scratchpad memory is organized into three neighbors:
  * 1. [0, num_quantas): 'unused_update' - intermediate state with
  *    ignored bits updated.
  * 2. [num_quantas, 2*num_quantas): 'target_used' - bits active in
  *    the target state.
  * 3. [2*num_quantas, 3*num_quantas): 'cur_used' - bits active in
  *    the current state.
  */

>> +	u8 used_qword_diff = 0;
> 
> It seems to me that we want use "quanta" v.s. "qword"? 128 bits can
> be called "dqword" as well though.

Yes. "qword" is a bit too x86-centric. Since the library is designed
around the concept of an atomic "quanta" of update, I will unify the
terminology ("quanta" in general) and use used_quanta_diff

> 
>> +	unsigned int i;
>> +
>> +	writer->ops->get_used(entry, cur_used);
>> +	writer->ops->get_used(target, target_used);
> 
> SMMU has get_update_safe now. Can we take it together?

I will look into the SMMUv3 get_update_safe implementation. Or integrate
that specially when we transition the ARM SMMUv3 driver to use this
generic entry_sync library.

> 
>> +void NS(entry_sync_write)(struct entry_sync_writer *writer, quanta_t *entry,
>> +			  const quanta_t *target, quanta_t *memory,
>> +			  size_t memory_len)
>> +{
>> +	quanta_t *unused_update = memory + writer->num_quantas * 0;
>> +	u8 used_qword_diff;
>> +
>> +	if (WARN_ON(memory_len !=
>> +		    ENTRY_SYNC_MEMORY_LEN(writer) * sizeof(*memory)))
>> +		return;
>> +
>> +	used_qword_diff = NS(entry_quanta_diff)(writer, entry, target,
>> +						unused_update, memory);
>> +	if (hweight8(used_qword_diff) == 1) {
>> +		/*
>> +		 * Only one quanta needs its used bits to be changed. This is a
>> +		 * hitless update, update all bits the current entry is ignoring
>> +		 * to their new values, then update a single "critical quanta"
>> +		 * to change the entry and finally 0 out any bits that are now
>> +		 * unused in the target configuration.
>> +		 */
>> +		unsigned int critical_qword_index = ffs(used_qword_diff) - 1;
>> +
>> +		/*
>> +		 * Skip writing unused bits in the critical quanta since we'll
>> +		 * be writing it in the next step anyways. This can save a sync
>> +		 * when the only change is in that quanta.
>> +		 */
>> +		unused_update[critical_qword_index] =
>> +			entry[critical_qword_index];
>> +		NS(entry_set)(writer, entry, unused_update, 0,
>> +			      writer->num_quantas);
>> +		NS(entry_set)(writer, entry, target, critical_qword_index, 1);
>> +		NS(entry_set)(writer, entry, target, 0, writer->num_quantas);
>> +	} else if (used_qword_diff) {
>> +		/*
>> +		 * At least two quantas need their inuse bits to be changed.
>> +		 * This requires a breaking update, zero the V bit, write all
>> +		 * qwords but 0, then set qword 0
>> +		 */
> 
> Still, it'd be nicer to unify the wording between "quanta" and
> "qword".

Yes.

> 
> [..]
>> +EXPORT_SYMBOL(NS(entry_sync_write));
> 
> There is also a KUNIT test coverage in arm-smmu-v3 for all of these
> functions. Maybe we can make that generic as well?

Same here.

> 
>> +#define entry_sync_writer entry_sync_writer64
>> +#define quanta_t __le64
> [..]
>> +#define entry_sync_writer entry_sync_writer128
>> +#define quanta_t u128
> 
> u64 can be called 64 too, though we might not have use case for now.
> 
> But maybe we could just call them:
>      entry_sync_writer_le64
>      entry_sync_writer_u128
> ?
I'm fine with the new naming. It is more explicit. I will update the
names unless there are further objections.

Thanks,
baolu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] iommu: Lift and generalize the STE/CD update code from SMMUv3
  2026-03-14  8:13       ` Baolu Lu
@ 2026-03-16  9:51         ` Will Deacon
  2026-03-18  3:10           ` Baolu Lu
  2026-03-16 16:35         ` Samiullah Khawaja
  1 sibling, 1 reply; 34+ messages in thread
From: Will Deacon @ 2026-03-16  9:51 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Samiullah Khawaja, Joerg Roedel, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Dmytro Maluka, iommu, linux-kernel

On Sat, Mar 14, 2026 at 04:13:27PM +0800, Baolu Lu wrote:
> On 3/10/26 08:06, Samiullah Khawaja wrote:
> > On Mon, Mar 09, 2026 at 11:33:23PM +0000, Samiullah Khawaja wrote:
> > > On Mon, Mar 09, 2026 at 02:06:41PM +0800, Lu Baolu wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > 
> > > > Many IOMMU implementations store data structures in host memory that can
> > > > be quite big. The iommu is able to DMA read the host memory using an
> > > > atomic quanta, usually 64 or 128 bits, and will read an entry using
> > > > multiple quanta reads.
> > > > 
> > > > Updating the host memory datastructure entry while the HW is
> > > > concurrently
> > > > DMA'ing it is a little bit involved, but if you want to do this
> > > > hitlessly,
> > > > while never making the entry non-valid, then it becomes quite
> > > > complicated.
> > > > 
> > > > entry_sync is a library to handle this task. It works on the notion of
> > > > "used bits" which reflect which bits the HW is actually sensitive to and
> > > > which bits are ignored by hardware. Many hardware specifications say
> > > > things like 'if mode is X then bits ABC are ignored'.
> > > > 
> > > > Using the ignored bits entry_sync can often compute a series of ordered
> > > > writes and flushes that will allow the entry to be updated while keeping
> > > > it valid. If such an update is not possible then entry will be made
> > > > temporarily non-valid.
> > > > 
> > > > A 64 and 128 bit quanta version is provided to support existing iommus.
> > > > 
> > > > Co-developed-by: Lu Baolu <baolu.lu@linux.intel.com>
> > > > Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> > > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > > > ---
> > > > drivers/iommu/Kconfig               |  14 +++
> > > > drivers/iommu/Makefile              |   1 +
> > > > drivers/iommu/entry_sync.h          |  66 +++++++++++++
> > > > drivers/iommu/entry_sync_template.h | 143 ++++++++++++++++++++++++++++
> > > > drivers/iommu/entry_sync.c          |  68 +++++++++++++
> > > > 5 files changed, 292 insertions(+)
> > > > create mode 100644 drivers/iommu/entry_sync.h
> > > > create mode 100644 drivers/iommu/entry_sync_template.h
> > > > create mode 100644 drivers/iommu/entry_sync.c

Shouldn't we move the SMMU driver over to this, rather than copy-pasting
everything? If not, then why is it in generic IOMMU code?

Will

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] iommu: Lift and generalize the STE/CD update code from SMMUv3
  2026-03-14  8:13       ` Baolu Lu
  2026-03-16  9:51         ` Will Deacon
@ 2026-03-16 16:35         ` Samiullah Khawaja
  2026-03-18  3:23           ` Baolu Lu
  1 sibling, 1 reply; 34+ messages in thread
From: Samiullah Khawaja @ 2026-03-16 16:35 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Dmytro Maluka, iommu, linux-kernel

On Sat, Mar 14, 2026 at 04:13:27PM +0800, Baolu Lu wrote:
>On 3/10/26 08:06, Samiullah Khawaja wrote:
>>On Mon, Mar 09, 2026 at 11:33:23PM +0000, Samiullah Khawaja wrote:
>>>On Mon, Mar 09, 2026 at 02:06:41PM +0800, Lu Baolu wrote:
>>>>From: Jason Gunthorpe <jgg@nvidia.com>
>>>>
>>>>Many IOMMU implementations store data structures in host memory that can
>>>>be quite big. The iommu is able to DMA read the host memory using an
>>>>atomic quanta, usually 64 or 128 bits, and will read an entry using
>>>>multiple quanta reads.
>>>>
>>>>Updating the host memory datastructure entry while the HW is 
>>>>concurrently
>>>>DMA'ing it is a little bit involved, but if you want to do this 
>>>>hitlessly,
>>>>while never making the entry non-valid, then it becomes quite 
>>>>complicated.
>>>>
>>>>entry_sync is a library to handle this task. It works on the notion of
>>>>"used bits" which reflect which bits the HW is actually sensitive to and
>>>>which bits are ignored by hardware. Many hardware specifications say
>>>>things like 'if mode is X then bits ABC are ignored'.
>>>>
>>>>Using the ignored bits entry_sync can often compute a series of ordered
>>>>writes and flushes that will allow the entry to be updated while keeping
>>>>it valid. If such an update is not possible then entry will be made
>>>>temporarily non-valid.
>>>>
>>>>A 64 and 128 bit quanta version is provided to support existing iommus.
>>>>
>>>>Co-developed-by: Lu Baolu <baolu.lu@linux.intel.com>
>>>>Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
>>>>Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>>>>---
>>>>drivers/iommu/Kconfig               |  14 +++
>>>>drivers/iommu/Makefile              |   1 +
>>>>drivers/iommu/entry_sync.h          |  66 +++++++++++++
>>>>drivers/iommu/entry_sync_template.h | 143 ++++++++++++++++++++++++++++
>>>>drivers/iommu/entry_sync.c          |  68 +++++++++++++
>>>>5 files changed, 292 insertions(+)
>>>>create mode 100644 drivers/iommu/entry_sync.h
>>>>create mode 100644 drivers/iommu/entry_sync_template.h
>>>>create mode 100644 drivers/iommu/entry_sync.c
>>>>
>>>>diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
>>>>index f86262b11416..2650c9fa125b 100644
>>>>--- a/drivers/iommu/Kconfig
>>>>+++ b/drivers/iommu/Kconfig
>>>>@@ -145,6 +145,20 @@ config IOMMU_DEFAULT_PASSTHROUGH
>>>>
>>>>endchoice
>>>>
>>>>+config IOMMU_ENTRY_SYNC
>>>>+    bool
>>>>+    default n
>>>>+
>>>>+config IOMMU_ENTRY_SYNC64
>>>>+    bool
>>>>+    select IOMMU_ENTRY_SYNC
>>>>+    default n
>>>>+
>>>>+config IOMMU_ENTRY_SYNC128
>>>>+    bool
>>>>+    select IOMMU_ENTRY_SYNC
>>>>+    default n
>>>>+
>>>>config OF_IOMMU
>>>>    def_bool y
>>>>    depends on OF && IOMMU_API
>>>>diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
>>>>index 0275821f4ef9..bd923995497a 100644
>>>>--- a/drivers/iommu/Makefile
>>>>+++ b/drivers/iommu/Makefile
>>>>@@ -10,6 +10,7 @@ obj-$(CONFIG_IOMMU_API) += iommu-traces.o
>>>>obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
>>>>obj-$(CONFIG_IOMMU_DEBUGFS) += iommu-debugfs.o
>>>>obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o
>>>>+obj-$(CONFIG_IOMMU_ENTRY_SYNC) += entry_sync.o
>>>>obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
>>>>obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
>>>>obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
>>>>diff --git a/drivers/iommu/entry_sync.h b/drivers/iommu/entry_sync.h
>>>>new file mode 100644
>>>>index 000000000000..004d421c71c0
>>>>--- /dev/null
>>>>+++ b/drivers/iommu/entry_sync.h
>>>>@@ -0,0 +1,66 @@
>>>>+/* SPDX-License-Identifier: GPL-2.0-only */
>>>>+/*
>>>>+ * Many IOMMU implementations store data structures in host 
>>>>memory that can be
>>>>+ * quite big. The iommu is able to DMA read the host memory 
>>>>using an atomic
>>>>+ * quanta, usually 64 or 128 bits, and will read an entry using 
>>>>multiple quanta
>>>>+ * reads.
>>>>+ *
>>>>+ * Updating the host memory datastructure entry while the HW is 
>>>>concurrently
>>>>+ * DMA'ing it is a little bit involved, but if you want to do 
>>>>this hitlessly,
>>>>+ * while never making the entry non-valid, then it becomes 
>>>>quite complicated.
>>>>+ *
>>>>+ * entry_sync is a library to handle this task. It works on the 
>>>>notion of "used
>>>>+ * bits" which reflect which bits the HW is actually sensitive 
>>>>to and which bits
>>>>+ * are ignored by hardware. Many hardware specifications say 
>>>>things like 'if
>>>>+ * mode is X then bits ABC are ignored'.
>>>>+ *
>>>>+ * Using the ignored bits entry_sync can often compute a series 
>>>>of ordered
>>>>+ * writes and flushes that will allow the entry to be updated 
>>>>while keeping it
>>>>+ * valid. If such an update is not possible then entry will be 
>>>>made temporarily
>>>>+ * non-valid.
>>>>+ *
>>>>+ * A 64 and 128 bit quanta version is provided to support 
>>>>existing iommus.
>>>>+ */
>>>>+#ifndef IOMMU_ENTRY_SYNC_H
>>>>+#define IOMMU_ENTRY_SYNC_H
>>>>+
>>>>+#include <linux/types.h>
>>>>+#include <linux/compiler.h>
>>>>+#include <linux/bug.h>
>>>>+
>>>>+/* Caller allocates a stack array of this length to call 
>>>>entry_sync_write() */
>>>>+#define ENTRY_SYNC_MEMORY_LEN(writer) ((writer)->num_quantas * 3)
>>>>+
>>>>+struct entry_sync_writer_ops64;
>>>>+struct entry_sync_writer64 {
>>>>+    const struct entry_sync_writer_ops64 *ops;
>>>>+    size_t num_quantas;
>>>>+    size_t vbit_quanta;
>>>>+};
>>>>+
>>>>+struct entry_sync_writer_ops64 {
>>>>+    void (*get_used)(const __le64 *entry, __le64 *used);
>>>>+    void (*sync)(struct entry_sync_writer64 *writer);
>>>>+};
>>>>+
>>>>+void entry_sync_write64(struct entry_sync_writer64 *writer, 
>>>>__le64 *entry,
>>>>+            const __le64 *target, __le64 *memory,
>>>>+            size_t memory_len);
>>>>+
>>>>+struct entry_sync_writer_ops128;
>>>>+struct entry_sync_writer128 {
>>>>+    const struct entry_sync_writer_ops128 *ops;
>>>>+    size_t num_quantas;
>>>>+    size_t vbit_quanta;
>>>>+};
>>>>+
>>>>+struct entry_sync_writer_ops128 {
>>>>+    void (*get_used)(const u128 *entry, u128 *used);
>>>>+    void (*sync)(struct entry_sync_writer128 *writer);
>>>>+};
>>>>+
>>>>+void entry_sync_write128(struct entry_sync_writer128 *writer, 
>>>>u128 *entry,
>>>>+             const u128 *target, u128 *memory,
>>>>+             size_t memory_len);
>>>>+
>>>>+#endif
>>>>diff --git a/drivers/iommu/entry_sync_template.h 
>>>>b/drivers/iommu/ entry_sync_template.h
>>>>new file mode 100644
>>>>index 000000000000..646f518b098e
>>>>--- /dev/null
>>>>+++ b/drivers/iommu/entry_sync_template.h
>>>>@@ -0,0 +1,143 @@
>>>>+/* SPDX-License-Identifier: GPL-2.0-only */
>>>>+#include "entry_sync.h"
>>>>+#include <linux/args.h>
>>>>+#include <linux/bitops.h>
>>>>+
>>>>+#ifndef entry_sync_writer
>>>>+#define entry_sync_writer entry_sync_writer64
>>>>+#define quanta_t __le64
>>>>+#define NS(name) CONCATENATE(name, 64)
>>>>+#endif
>>>>+
>>>>+/*
>>>>+ * Figure out if we can do a hitless update of entry to become 
>>>>target. Returns a
>>>>+ * bit mask where 1 indicates that a quanta word needs to be 
>>>>set disruptively.
>>>>+ * unused_update is an intermediate value of entry that has 
>>>>unused bits set to
>>>>+ * their new values.
>>>>+ */
>>>>+static u8 NS(entry_quanta_diff)(struct entry_sync_writer *writer,
>>>>+                const quanta_t *entry, const quanta_t *target,
>>>>+                quanta_t *unused_update, quanta_t *memory)
>>>>+{
>>>>+    quanta_t *target_used = memory + writer->num_quantas * 1;
>>>>+    quanta_t *cur_used = memory + writer->num_quantas * 2;
>>>>+    u8 used_qword_diff = 0;
>>>>+    unsigned int i;
>>>>+
>>>>+    writer->ops->get_used(entry, cur_used);
>>>>+    writer->ops->get_used(target, target_used);
>>>>+
>>>>+    for (i = 0; i != writer->num_quantas; i++) {
>>>>+        /*
>>>>+         * Check that masks are up to date, the make functions are not
>>>
>>>nit: "the make functions" looks like a typo.
>
>That seems to be a typo. Will clear it in v2.
>
>>>>+         * allowed to set a bit to 1 if the used function 
>>>>doesn't say it
>>>>+         * is used.
>>>>+         */
>>>>+        WARN_ON_ONCE(target[i] & ~target_used[i]);
>>>>+
>>>>+        /* Bits can change because they are not currently being used */
>>>>+        unused_update[i] = (entry[i] & cur_used[i]) |
>>>>+                   (target[i] & ~cur_used[i]);
>>>>+        /*
>>>>+         * Each bit indicates that a used bit in a qword needs to be
>>>>+         * changed after unused_update is applied.
>>>>+         */
>>>>+        if ((unused_update[i] & target_used[i]) != target[i])
>>>>+            used_qword_diff |= 1 << i;
>>>>+    }
>>>>+    return used_qword_diff;
>>>>+}
>>>>+
>>>>+/*
>>>>+ * Update the entry to the target configuration. The transition 
>>>>from the current
>>>>+ * entry to the target entry takes place over multiple steps 
>>>>that attempts to
>>>>+ * make the transition hitless if possible. This function takes 
>>>>care not to
>>>>+ * create a situation where the HW can perceive a corrupted 
>>>>entry. HW is only
>>>>+ * required to have a quanta-bit atomicity with stores from the 
>>>>CPU, while
>>>>+ * entries are many quanta bit values big.
>>>>+ *
>>>>+ * The difference between the current value and the target 
>>>>value is analyzed to
>>>>+ * determine which of three updates are required - disruptive, 
>>>>hitless or no
>>>>+ * change.
>>>>+ *
>>>>+ * In the most general disruptive case we can make any update 
>>>>in three steps:
>>>>+ *  - Disrupting the entry (V=0)
>>>>+ *  - Fill now unused quanta words, except qword 0 which contains V
>>>>+ *  - Make qword 0 have the final value and valid (V=1) with a 
>>>>single 64
>>>>+ *    bit store
>>>>+ *
>>>>+ * However this disrupts the HW while it is happening. There 
>>>>are several
>>>>+ * interesting cases where a STE/CD can be updated without 
>>>>disturbing the HW
>>>>+ * because only a small number of bits are changing (S1DSS, 
>>>>CONFIG, etc) or
>>>>+ * because the used bits don't intersect. We can detect this by 
>>>>calculating how
>>>>+ * many 64 bit values need update after adjusting the unused 
>>>>bits and skip the
>>>>+ * V=0 process. This relies on the IGNORED behavior described in the
>>>>+ * specification.
>>>>+ */
>>>>+void NS(entry_sync_write)(struct entry_sync_writer *writer, 
>>>>quanta_t *entry,
>>>>+              const quanta_t *target, quanta_t *memory,
>>>>+              size_t memory_len)
>>>>+{
>>>>+    quanta_t *unused_update = memory + writer->num_quantas * 0;
>>>>+    u8 used_qword_diff;
>>>>+
>>>>+    if (WARN_ON(memory_len !=
>>>>+            ENTRY_SYNC_MEMORY_LEN(writer) * sizeof(*memory)))
>>>>+        return;
>>>>+
>>>>+    used_qword_diff = NS(entry_quanta_diff)(writer, entry, target,
>>>>+                        unused_update, memory);
>>>>+    if (hweight8(used_qword_diff) == 1) {
>>>>+        /*
>>>>+         * Only one quanta needs its used bits to be changed. This is a
>>>>+         * hitless update, update all bits the current entry is 
>>>>ignoring
>>>>+         * to their new values, then update a single "critical quanta"
>>>>+         * to change the entry and finally 0 out any bits that are now
>>>>+         * unused in the target configuration.
>>>>+         */
>>>>+        unsigned int critical_qword_index = ffs(used_qword_diff) - 1;
>>>>+
>>>>+        /*
>>>>+         * Skip writing unused bits in the critical quanta since we'll
>>>>+         * be writing it in the next step anyways. This can save a sync
>>>>+         * when the only change is in that quanta.
>>>>+         */
>>>>+        unused_update[critical_qword_index] =
>>>>+            entry[critical_qword_index];
>>>>+        NS(entry_set)(writer, entry, unused_update, 0,
>>>>+                  writer->num_quantas);
>>>>+        NS(entry_set)(writer, entry, target, critical_qword_index, 1);
>>>>+        NS(entry_set)(writer, entry, target, 0, writer->num_quantas);
>>>>+    } else if (used_qword_diff) {
>>>>+        /*
>>>>+         * At least two quantas need their inuse bits to be changed.
>>>>+         * This requires a breaking update, zero the V bit, write all
>>>>+         * qwords but 0, then set qword 0
>>>>+         */
>>>>+        unused_update[writer->vbit_quanta] = 0;
>>>>+        NS(entry_set)(writer, entry, unused_update, writer- 
>>>>>vbit_quanta, 1);
>>>>+
>>>>+        if (writer->vbit_quanta != 0)
>>>>+            NS(entry_set)(writer, entry, target, 0,
>>>>+                      writer->vbit_quanta - 1);
>>>
>>>Looking at the definition of the entry_set below, the last argument is
>>>length. So if vbit_quanta 1 then it would write zero len. Shouldn't it
>>>be writing quantas before the vbit_quanta?
>>>>+        if (writer->vbit_quanta != writer->num_quantas)
>>
>>Looking at this again, I think vbit_quanta can never be equal to
>>num_quanta as num_quantas is length and vbit_quanta is index?
>>>>+            NS(entry_set)(writer, entry, target,
>>>>+                      writer->vbit_quanta,
>>
>>Staring from vbit_quanta will set the present bit if it is set in the
>>target?
>>>>+                      writer->num_quantas - 1);
>>>
>>>Sami here, the last argument should not have "- 1".
>>
>>I meant "Same here".
>
>This branch is the disruptive update path. The process is:
>
>1. Clear the Valid bit. The hardware now ignores this entry.
>2. Write all the new data for the words before the Valid bit.
>3. Write all the new data for the words after the Valid bit.
>4. Write the word containing the Valid bit. The entry is now live again
>   with all the new data.
>
>Yes. The last argument for entry_set is length, not index. So perhaps I
>could update it like this?
>
>diff --git a/drivers/iommu/entry_sync_template.h 
>b/drivers/iommu/entry_sync_template.h
>index 646f518b098e..423cbb874919 100644
>--- a/drivers/iommu/entry_sync_template.h
>+++ b/drivers/iommu/entry_sync_template.h
>@@ -118,12 +118,11 @@ void NS(entry_sync_write)(struct 
>entry_sync_writer *writer, quanta_t *entry,
>                NS(entry_set)(writer, entry, unused_update, 
>writer->vbit_quanta, 1);
>
>                if (writer->vbit_quanta != 0)
>-                       NS(entry_set)(writer, entry, target, 0,
>-                                     writer->vbit_quanta - 1);
>-               if (writer->vbit_quanta != writer->num_quantas)
>+                       NS(entry_set)(writer, entry, target, 0, 
>writer->vbit_quanta);
>+               if (writer->vbit_quanta + 1 < writer->num_quantas)
>                        NS(entry_set)(writer, entry, target,
>-                                     writer->vbit_quanta,
>-                                     writer->num_quantas - 1);
>+                                     writer->vbit_quanta + 1,
>+                                     writer->num_quantas - 
>writer->vbit_quanta - 1);

This looks good.

nit: I am wondering whether we can change the arguments to the function,
by modifying the loop in entry_set, to be start and end quanta instead?
That way the caller doesn't have to do these bound checks? What do you
think?
>
>
>                NS(entry_set)(writer, entry, target, 
>writer->vbit_quanta, 1);
>        } else {
>
>Thanks,
>baolu

Thanks,
Sami

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] iommu: Lift and generalize the STE/CD update code from SMMUv3
  2026-03-16  9:51         ` Will Deacon
@ 2026-03-18  3:10           ` Baolu Lu
  2026-03-23 12:55             ` Jason Gunthorpe
  0 siblings, 1 reply; 34+ messages in thread
From: Baolu Lu @ 2026-03-18  3:10 UTC (permalink / raw)
  To: Will Deacon
  Cc: Samiullah Khawaja, Joerg Roedel, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Dmytro Maluka, iommu, linux-kernel

On 3/16/26 17:51, Will Deacon wrote:
> On Sat, Mar 14, 2026 at 04:13:27PM +0800, Baolu Lu wrote:
>> On 3/10/26 08:06, Samiullah Khawaja wrote:
>>> On Mon, Mar 09, 2026 at 11:33:23PM +0000, Samiullah Khawaja wrote:
>>>> On Mon, Mar 09, 2026 at 02:06:41PM +0800, Lu Baolu wrote:
>>>>> From: Jason Gunthorpe<jgg@nvidia.com>
>>>>>
>>>>> Many IOMMU implementations store data structures in host memory that can
>>>>> be quite big. The iommu is able to DMA read the host memory using an
>>>>> atomic quanta, usually 64 or 128 bits, and will read an entry using
>>>>> multiple quanta reads.
>>>>>
>>>>> Updating the host memory datastructure entry while the HW is
>>>>> concurrently
>>>>> DMA'ing it is a little bit involved, but if you want to do this
>>>>> hitlessly,
>>>>> while never making the entry non-valid, then it becomes quite
>>>>> complicated.
>>>>>
>>>>> entry_sync is a library to handle this task. It works on the notion of
>>>>> "used bits" which reflect which bits the HW is actually sensitive to and
>>>>> which bits are ignored by hardware. Many hardware specifications say
>>>>> things like 'if mode is X then bits ABC are ignored'.
>>>>>
>>>>> Using the ignored bits entry_sync can often compute a series of ordered
>>>>> writes and flushes that will allow the entry to be updated while keeping
>>>>> it valid. If such an update is not possible then entry will be made
>>>>> temporarily non-valid.
>>>>>
>>>>> A 64 and 128 bit quanta version is provided to support existing iommus.
>>>>>
>>>>> Co-developed-by: Lu Baolu<baolu.lu@linux.intel.com>
>>>>> Signed-off-by: Lu Baolu<baolu.lu@linux.intel.com>
>>>>> Signed-off-by: Jason Gunthorpe<jgg@nvidia.com>
>>>>> ---
>>>>> drivers/iommu/Kconfig               |  14 +++
>>>>> drivers/iommu/Makefile              |   1 +
>>>>> drivers/iommu/entry_sync.h          |  66 +++++++++++++
>>>>> drivers/iommu/entry_sync_template.h | 143 ++++++++++++++++++++++++++++
>>>>> drivers/iommu/entry_sync.c          |  68 +++++++++++++
>>>>> 5 files changed, 292 insertions(+)
>>>>> create mode 100644 drivers/iommu/entry_sync.h
>>>>> create mode 100644 drivers/iommu/entry_sync_template.h
>>>>> create mode 100644 drivers/iommu/entry_sync.c
> Shouldn't we move the SMMU driver over to this, rather than copy-pasting
> everything? If not, then why is it in generic IOMMU code?

Yes. I will start to do this from the next version.

Thanks,
baolu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] iommu: Lift and generalize the STE/CD update code from SMMUv3
  2026-03-16 16:35         ` Samiullah Khawaja
@ 2026-03-18  3:23           ` Baolu Lu
  0 siblings, 0 replies; 34+ messages in thread
From: Baolu Lu @ 2026-03-18  3:23 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Dmytro Maluka, iommu, linux-kernel

On 3/17/26 00:35, Samiullah Khawaja wrote:
> On Sat, Mar 14, 2026 at 04:13:27PM +0800, Baolu Lu wrote:
>> On 3/10/26 08:06, Samiullah Khawaja wrote:
>>> On Mon, Mar 09, 2026 at 11:33:23PM +0000, Samiullah Khawaja wrote:
>>>> On Mon, Mar 09, 2026 at 02:06:41PM +0800, Lu Baolu wrote:
>>>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>>>
>>>>> Many IOMMU implementations store data structures in host memory 
>>>>> that can
>>>>> be quite big. The iommu is able to DMA read the host memory using an
>>>>> atomic quanta, usually 64 or 128 bits, and will read an entry using
>>>>> multiple quanta reads.
>>>>>
>>>>> Updating the host memory datastructure entry while the HW is 
>>>>> concurrently
>>>>> DMA'ing it is a little bit involved, but if you want to do this 
>>>>> hitlessly,
>>>>> while never making the entry non-valid, then it becomes quite 
>>>>> complicated.
>>>>>
>>>>> entry_sync is a library to handle this task. It works on the notion of
>>>>> "used bits" which reflect which bits the HW is actually sensitive 
>>>>> to and
>>>>> which bits are ignored by hardware. Many hardware specifications say
>>>>> things like 'if mode is X then bits ABC are ignored'.
>>>>>
>>>>> Using the ignored bits entry_sync can often compute a series of 
>>>>> ordered
>>>>> writes and flushes that will allow the entry to be updated while 
>>>>> keeping
>>>>> it valid. If such an update is not possible then entry will be made
>>>>> temporarily non-valid.
>>>>>
>>>>> A 64 and 128 bit quanta version is provided to support existing 
>>>>> iommus.
>>>>>
>>>>> Co-developed-by: Lu Baolu <baolu.lu@linux.intel.com>
>>>>> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
>>>>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>>>>> ---
>>>>> drivers/iommu/Kconfig               |  14 +++
>>>>> drivers/iommu/Makefile              |   1 +
>>>>> drivers/iommu/entry_sync.h          |  66 +++++++++++++
>>>>> drivers/iommu/entry_sync_template.h | 143 ++++++++++++++++++++++++++++
>>>>> drivers/iommu/entry_sync.c          |  68 +++++++++++++
>>>>> 5 files changed, 292 insertions(+)
>>>>> create mode 100644 drivers/iommu/entry_sync.h
>>>>> create mode 100644 drivers/iommu/entry_sync_template.h
>>>>> create mode 100644 drivers/iommu/entry_sync.c
>>>>>
>>>>> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
>>>>> index f86262b11416..2650c9fa125b 100644
>>>>> --- a/drivers/iommu/Kconfig
>>>>> +++ b/drivers/iommu/Kconfig
>>>>> @@ -145,6 +145,20 @@ config IOMMU_DEFAULT_PASSTHROUGH
>>>>>
>>>>> endchoice
>>>>>
>>>>> +config IOMMU_ENTRY_SYNC
>>>>> +    bool
>>>>> +    default n
>>>>> +
>>>>> +config IOMMU_ENTRY_SYNC64
>>>>> +    bool
>>>>> +    select IOMMU_ENTRY_SYNC
>>>>> +    default n
>>>>> +
>>>>> +config IOMMU_ENTRY_SYNC128
>>>>> +    bool
>>>>> +    select IOMMU_ENTRY_SYNC
>>>>> +    default n
>>>>> +
>>>>> config OF_IOMMU
>>>>>     def_bool y
>>>>>     depends on OF && IOMMU_API
>>>>> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
>>>>> index 0275821f4ef9..bd923995497a 100644
>>>>> --- a/drivers/iommu/Makefile
>>>>> +++ b/drivers/iommu/Makefile
>>>>> @@ -10,6 +10,7 @@ obj-$(CONFIG_IOMMU_API) += iommu-traces.o
>>>>> obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
>>>>> obj-$(CONFIG_IOMMU_DEBUGFS) += iommu-debugfs.o
>>>>> obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o
>>>>> +obj-$(CONFIG_IOMMU_ENTRY_SYNC) += entry_sync.o
>>>>> obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
>>>>> obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
>>>>> obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
>>>>> diff --git a/drivers/iommu/entry_sync.h b/drivers/iommu/entry_sync.h
>>>>> new file mode 100644
>>>>> index 000000000000..004d421c71c0
>>>>> --- /dev/null
>>>>> +++ b/drivers/iommu/entry_sync.h
>>>>> @@ -0,0 +1,66 @@
>>>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>>>> +/*
>>>>> + * Many IOMMU implementations store data structures in host memory 
>>>>> that can be
>>>>> + * quite big. The iommu is able to DMA read the host memory using 
>>>>> an atomic
>>>>> + * quanta, usually 64 or 128 bits, and will read an entry using 
>>>>> multiple quanta
>>>>> + * reads.
>>>>> + *
>>>>> + * Updating the host memory datastructure entry while the HW is 
>>>>> concurrently
>>>>> + * DMA'ing it is a little bit involved, but if you want to do this 
>>>>> hitlessly,
>>>>> + * while never making the entry non-valid, then it becomes quite 
>>>>> complicated.
>>>>> + *
>>>>> + * entry_sync is a library to handle this task. It works on the 
>>>>> notion of "used
>>>>> + * bits" which reflect which bits the HW is actually sensitive to 
>>>>> and which bits
>>>>> + * are ignored by hardware. Many hardware specifications say 
>>>>> things like 'if
>>>>> + * mode is X then bits ABC are ignored'.
>>>>> + *
>>>>> + * Using the ignored bits entry_sync can often compute a series of 
>>>>> ordered
>>>>> + * writes and flushes that will allow the entry to be updated 
>>>>> while keeping it
>>>>> + * valid. If such an update is not possible then entry will be 
>>>>> made temporarily
>>>>> + * non-valid.
>>>>> + *
>>>>> + * A 64 and 128 bit quanta version is provided to support existing 
>>>>> iommus.
>>>>> + */
>>>>> +#ifndef IOMMU_ENTRY_SYNC_H
>>>>> +#define IOMMU_ENTRY_SYNC_H
>>>>> +
>>>>> +#include <linux/types.h>
>>>>> +#include <linux/compiler.h>
>>>>> +#include <linux/bug.h>
>>>>> +
>>>>> +/* Caller allocates a stack array of this length to call 
>>>>> entry_sync_write() */
>>>>> +#define ENTRY_SYNC_MEMORY_LEN(writer) ((writer)->num_quantas * 3)
>>>>> +
>>>>> +struct entry_sync_writer_ops64;
>>>>> +struct entry_sync_writer64 {
>>>>> +    const struct entry_sync_writer_ops64 *ops;
>>>>> +    size_t num_quantas;
>>>>> +    size_t vbit_quanta;
>>>>> +};
>>>>> +
>>>>> +struct entry_sync_writer_ops64 {
>>>>> +    void (*get_used)(const __le64 *entry, __le64 *used);
>>>>> +    void (*sync)(struct entry_sync_writer64 *writer);
>>>>> +};
>>>>> +
>>>>> +void entry_sync_write64(struct entry_sync_writer64 *writer, __le64 
>>>>> *entry,
>>>>> +            const __le64 *target, __le64 *memory,
>>>>> +            size_t memory_len);
>>>>> +
>>>>> +struct entry_sync_writer_ops128;
>>>>> +struct entry_sync_writer128 {
>>>>> +    const struct entry_sync_writer_ops128 *ops;
>>>>> +    size_t num_quantas;
>>>>> +    size_t vbit_quanta;
>>>>> +};
>>>>> +
>>>>> +struct entry_sync_writer_ops128 {
>>>>> +    void (*get_used)(const u128 *entry, u128 *used);
>>>>> +    void (*sync)(struct entry_sync_writer128 *writer);
>>>>> +};
>>>>> +
>>>>> +void entry_sync_write128(struct entry_sync_writer128 *writer, u128 
>>>>> *entry,
>>>>> +             const u128 *target, u128 *memory,
>>>>> +             size_t memory_len);
>>>>> +
>>>>> +#endif
>>>>> diff --git a/drivers/iommu/entry_sync_template.h b/drivers/iommu/ 
>>>>> entry_sync_template.h
>>>>> new file mode 100644
>>>>> index 000000000000..646f518b098e
>>>>> --- /dev/null
>>>>> +++ b/drivers/iommu/entry_sync_template.h
>>>>> @@ -0,0 +1,143 @@
>>>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>>>> +#include "entry_sync.h"
>>>>> +#include <linux/args.h>
>>>>> +#include <linux/bitops.h>
>>>>> +
>>>>> +#ifndef entry_sync_writer
>>>>> +#define entry_sync_writer entry_sync_writer64
>>>>> +#define quanta_t __le64
>>>>> +#define NS(name) CONCATENATE(name, 64)
>>>>> +#endif
>>>>> +
>>>>> +/*
>>>>> + * Figure out if we can do a hitless update of entry to become 
>>>>> target. Returns a
>>>>> + * bit mask where 1 indicates that a quanta word needs to be set 
>>>>> disruptively.
>>>>> + * unused_update is an intermediate value of entry that has unused 
>>>>> bits set to
>>>>> + * their new values.
>>>>> + */
>>>>> +static u8 NS(entry_quanta_diff)(struct entry_sync_writer *writer,
>>>>> +                const quanta_t *entry, const quanta_t *target,
>>>>> +                quanta_t *unused_update, quanta_t *memory)
>>>>> +{
>>>>> +    quanta_t *target_used = memory + writer->num_quantas * 1;
>>>>> +    quanta_t *cur_used = memory + writer->num_quantas * 2;
>>>>> +    u8 used_qword_diff = 0;
>>>>> +    unsigned int i;
>>>>> +
>>>>> +    writer->ops->get_used(entry, cur_used);
>>>>> +    writer->ops->get_used(target, target_used);
>>>>> +
>>>>> +    for (i = 0; i != writer->num_quantas; i++) {
>>>>> +        /*
>>>>> +         * Check that masks are up to date, the make functions are 
>>>>> not
>>>>
>>>> nit: "the make functions" looks like a typo.
>>
>> That seems to be a typo. Will clear it in v2.
>>
>>>>> +         * allowed to set a bit to 1 if the used function doesn't 
>>>>> say it
>>>>> +         * is used.
>>>>> +         */
>>>>> +        WARN_ON_ONCE(target[i] & ~target_used[i]);
>>>>> +
>>>>> +        /* Bits can change because they are not currently being 
>>>>> used */
>>>>> +        unused_update[i] = (entry[i] & cur_used[i]) |
>>>>> +                   (target[i] & ~cur_used[i]);
>>>>> +        /*
>>>>> +         * Each bit indicates that a used bit in a qword needs to be
>>>>> +         * changed after unused_update is applied.
>>>>> +         */
>>>>> +        if ((unused_update[i] & target_used[i]) != target[i])
>>>>> +            used_qword_diff |= 1 << i;
>>>>> +    }
>>>>> +    return used_qword_diff;
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * Update the entry to the target configuration. The transition 
>>>>> from the current
>>>>> + * entry to the target entry takes place over multiple steps that 
>>>>> attempts to
>>>>> + * make the transition hitless if possible. This function takes 
>>>>> care not to
>>>>> + * create a situation where the HW can perceive a corrupted entry. 
>>>>> HW is only
>>>>> + * required to have a quanta-bit atomicity with stores from the 
>>>>> CPU, while
>>>>> + * entries are many quanta bit values big.
>>>>> + *
>>>>> + * The difference between the current value and the target value 
>>>>> is analyzed to
>>>>> + * determine which of three updates are required - disruptive, 
>>>>> hitless or no
>>>>> + * change.
>>>>> + *
>>>>> + * In the most general disruptive case we can make any update in 
>>>>> three steps:
>>>>> + *  - Disrupting the entry (V=0)
>>>>> + *  - Fill now unused quanta words, except qword 0 which contains V
>>>>> + *  - Make qword 0 have the final value and valid (V=1) with a 
>>>>> single 64
>>>>> + *    bit store
>>>>> + *
>>>>> + * However this disrupts the HW while it is happening. There are 
>>>>> several
>>>>> + * interesting cases where a STE/CD can be updated without 
>>>>> disturbing the HW
>>>>> + * because only a small number of bits are changing (S1DSS, 
>>>>> CONFIG, etc) or
>>>>> + * because the used bits don't intersect. We can detect this by 
>>>>> calculating how
>>>>> + * many 64 bit values need update after adjusting the unused bits 
>>>>> and skip the
>>>>> + * V=0 process. This relies on the IGNORED behavior described in the
>>>>> + * specification.
>>>>> + */
>>>>> +void NS(entry_sync_write)(struct entry_sync_writer *writer, 
>>>>> quanta_t *entry,
>>>>> +              const quanta_t *target, quanta_t *memory,
>>>>> +              size_t memory_len)
>>>>> +{
>>>>> +    quanta_t *unused_update = memory + writer->num_quantas * 0;
>>>>> +    u8 used_qword_diff;
>>>>> +
>>>>> +    if (WARN_ON(memory_len !=
>>>>> +            ENTRY_SYNC_MEMORY_LEN(writer) * sizeof(*memory)))
>>>>> +        return;
>>>>> +
>>>>> +    used_qword_diff = NS(entry_quanta_diff)(writer, entry, target,
>>>>> +                        unused_update, memory);
>>>>> +    if (hweight8(used_qword_diff) == 1) {
>>>>> +        /*
>>>>> +         * Only one quanta needs its used bits to be changed. This 
>>>>> is a
>>>>> +         * hitless update, update all bits the current entry is 
>>>>> ignoring
>>>>> +         * to their new values, then update a single "critical 
>>>>> quanta"
>>>>> +         * to change the entry and finally 0 out any bits that are 
>>>>> now
>>>>> +         * unused in the target configuration.
>>>>> +         */
>>>>> +        unsigned int critical_qword_index = ffs(used_qword_diff) - 1;
>>>>> +
>>>>> +        /*
>>>>> +         * Skip writing unused bits in the critical quanta since 
>>>>> we'll
>>>>> +         * be writing it in the next step anyways. This can save a 
>>>>> sync
>>>>> +         * when the only change is in that quanta.
>>>>> +         */
>>>>> +        unused_update[critical_qword_index] =
>>>>> +            entry[critical_qword_index];
>>>>> +        NS(entry_set)(writer, entry, unused_update, 0,
>>>>> +                  writer->num_quantas);
>>>>> +        NS(entry_set)(writer, entry, target, critical_qword_index, 
>>>>> 1);
>>>>> +        NS(entry_set)(writer, entry, target, 0, writer->num_quantas);
>>>>> +    } else if (used_qword_diff) {
>>>>> +        /*
>>>>> +         * At least two quantas need their inuse bits to be changed.
>>>>> +         * This requires a breaking update, zero the V bit, write all
>>>>> +         * qwords but 0, then set qword 0
>>>>> +         */
>>>>> +        unused_update[writer->vbit_quanta] = 0;
>>>>> +        NS(entry_set)(writer, entry, unused_update, writer-
>>>>>> vbit_quanta, 1);
>>>>> +
>>>>> +        if (writer->vbit_quanta != 0)
>>>>> +            NS(entry_set)(writer, entry, target, 0,
>>>>> +                      writer->vbit_quanta - 1);
>>>>
>>>> Looking at the definition of the entry_set below, the last argument is
>>>> length. So if vbit_quanta 1 then it would write zero len. Shouldn't it
>>>> be writing quantas before the vbit_quanta?
>>>>> +        if (writer->vbit_quanta != writer->num_quantas)
>>>
>>> Looking at this again, I think vbit_quanta can never be equal to
>>> num_quanta as num_quantas is length and vbit_quanta is index?
>>>>> +            NS(entry_set)(writer, entry, target,
>>>>> +                      writer->vbit_quanta,
>>>
>>> Staring from vbit_quanta will set the present bit if it is set in the
>>> target?
>>>>> +                      writer->num_quantas - 1);
>>>>
>>>> Sami here, the last argument should not have "- 1".
>>>
>>> I meant "Same here".
>>
>> This branch is the disruptive update path. The process is:
>>
>> 1. Clear the Valid bit. The hardware now ignores this entry.
>> 2. Write all the new data for the words before the Valid bit.
>> 3. Write all the new data for the words after the Valid bit.
>> 4. Write the word containing the Valid bit. The entry is now live again
>>   with all the new data.
>>
>> Yes. The last argument for entry_set is length, not index. So perhaps I
>> could update it like this?
>>
>> diff --git a/drivers/iommu/entry_sync_template.h b/drivers/iommu/ 
>> entry_sync_template.h
>> index 646f518b098e..423cbb874919 100644
>> --- a/drivers/iommu/entry_sync_template.h
>> +++ b/drivers/iommu/entry_sync_template.h
>> @@ -118,12 +118,11 @@ void NS(entry_sync_write)(struct 
>> entry_sync_writer *writer, quanta_t *entry,
>>                NS(entry_set)(writer, entry, unused_update, writer- 
>> >vbit_quanta, 1);
>>
>>                if (writer->vbit_quanta != 0)
>> -                       NS(entry_set)(writer, entry, target, 0,
>> -                                     writer->vbit_quanta - 1);
>> -               if (writer->vbit_quanta != writer->num_quantas)
>> +                       NS(entry_set)(writer, entry, target, 0, 
>> writer->vbit_quanta);
>> +               if (writer->vbit_quanta + 1 < writer->num_quantas)
>>                        NS(entry_set)(writer, entry, target,
>> -                                     writer->vbit_quanta,
>> -                                     writer->num_quantas - 1);
>> +                                     writer->vbit_quanta + 1,
>> +                                     writer->num_quantas - writer- 
>> >vbit_quanta - 1);
> 
> This looks good.
> 
> nit: I am wondering whether we can change the arguments to the function,
> by modifying the loop in entry_set, to be start and end quanta instead?
> That way the caller doesn't have to do these bound checks? What do you
> think?

I have no strong opinion about this. It appears that Linux kernel memory
set or copy functions follow the (pointer, offset, length) pattern,
moving away from this might cause confusion for other developers.
Anyway, if other guys also like that way, I am fine to adjust it.

Thanks,
baolu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] iommu: Lift and generalize the STE/CD update code from SMMUv3
  2026-03-18  3:10           ` Baolu Lu
@ 2026-03-23 12:55             ` Jason Gunthorpe
  2026-03-24  5:30               ` Baolu Lu
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2026-03-23 12:55 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Will Deacon, Samiullah Khawaja, Joerg Roedel, Robin Murphy,
	Kevin Tian, Dmytro Maluka, iommu, linux-kernel

On Wed, Mar 18, 2026 at 11:10:12AM +0800, Baolu Lu wrote:
> > Shouldn't we move the SMMU driver over to this, rather than copy-pasting
> > everything? If not, then why is it in generic IOMMU code?
> 
> Yes. I will start to do this from the next version.

I had written a draft already:

https://github.com/jgunthorpe/linux/commit/cda5c27a4020d162948259df9d3c8dd61196290a

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] iommu: Lift and generalize the STE/CD update code from SMMUv3
  2026-03-16  6:24     ` Baolu Lu
@ 2026-03-23 12:59       ` Jason Gunthorpe
  2026-03-24  5:49         ` Baolu Lu
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2026-03-23 12:59 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Nicolin Chen, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel

On Mon, Mar 16, 2026 at 02:24:57PM +0800, Baolu Lu wrote:

> > > +	writer->ops->get_used(entry, cur_used);
> > > +	writer->ops->get_used(target, target_used);
> > 
> > SMMU has get_update_safe now. Can we take it together?
> 
> I will look into the SMMUv3 get_update_safe implementation. Or integrate
> that specially when we transition the ARM SMMUv3 driver to use this
> generic entry_sync library.

The intention was to copy the existing ARM code as is, the draft I
sent was before these changes from Nicolin, so it should get updated..

> > > +EXPORT_SYMBOL(NS(entry_sync_write));
> > 
> > There is also a KUNIT test coverage in arm-smmu-v3 for all of these
> > functions. Maybe we can make that generic as well?
> 
> Same here.

That will be a bit hard since it depends on driver functions.

> > But maybe we could just call them:
> >      entry_sync_writer_le64
> >      entry_sync_writer_u128
> > ?
> I'm fine with the new naming. It is more explicit. I will update the
> names unless there are further objections.

I was wondering if we should just be using void * here as the type
safety seems a bit harmful if the goal is to make the 128 bit option
fall back to 64 bits if not supported.

The maximum supported HW atomic quanta can be passed in through the
struct.

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] iommu/vt-d: Add entry_sync support for PASID entry updates
  2026-03-15  8:11         ` Baolu Lu
@ 2026-03-23 13:07           ` Jason Gunthorpe
  2026-03-24  6:22             ` Baolu Lu
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2026-03-23 13:07 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel

On Sun, Mar 15, 2026 at 04:11:36PM +0800, Baolu Lu wrote:
> > > > ATC invalidations should always be done after the PASID entry is
> > > > written. During a hitless update both translations are unpredictably
> > > > combined, this is unavoidable and OK.
> > > 
> > > The VT-d spec (Sections 6.5.2.5 and 6.5.2.6) explicitly mandates that an
> > > IOTLB invalidation must precede the Device-TLB invalidation. If we only
> > > do the device-TLB invalidation in the sync callback, we risk the device
> > > re-fetching a stale translation from the IOMMU's internal IOTLB.
> > 
> > It is a little weird that is says that, that is worth checking into.
> > 
> > The other text is clear that the IOTLB is cached by DID,PASID only, so
> > if the new PASID entry has a DID,PASID which is already coherent in
> > the IOTLB it should not need any IOTLB flushing.
> > 
> > ie flushing the PASID table should immediately change any ATC fetches
> > from using DID,old_PASID to DID,new_PASID.
> > 
> > If there is some issue where the PASID flush doesn't fence everything
> > (ie an ATC fetch of DID,old_PASID can be passed by an ATC invalidation)
> > then you may need IOTLB invalidations not to manage coherence but to
> > manage ordering. That is an important detail if true.
> 
> On Intel hardware, the PASID-cache and IOTLB are not inclusive. A PASID-
> cache invalidation forces a re-fetch of the pasid entry, but it does not
> automatically purge downstream IOTLB entries.

It doesn't matter, the updated PASID entry will point to a new DID and
the IOTLB (new DID,PASID) entry will be valid in the IOTLB.

We don't need to flush the IOTLB, we just need to ensure that all
lookups done with (old DID,PASID) are completed before sending any
invalidation.

> The spec-mandated IOTLB flush serves as a synchronization barrier to
> ensure that in-flight translation requests are drained and the
> internal IOMMU state is consistent before the invalidation request
> is sent over PCIe to the device's ATC.

A fencing requirement does make sense, but does it have to be done by
flushing the entire DID,PASID? It is ugly to have to drop the IOTLB
just because a context entry changed.

Can you do a 4k IOVA 0 invalidation and get the same fence?

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] iommu: Lift and generalize the STE/CD update code from SMMUv3
  2026-03-23 12:55             ` Jason Gunthorpe
@ 2026-03-24  5:30               ` Baolu Lu
  0 siblings, 0 replies; 34+ messages in thread
From: Baolu Lu @ 2026-03-24  5:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Will Deacon, Samiullah Khawaja, Joerg Roedel, Robin Murphy,
	Kevin Tian, Dmytro Maluka, iommu, linux-kernel

On 3/23/26 20:55, Jason Gunthorpe wrote:
> On Wed, Mar 18, 2026 at 11:10:12AM +0800, Baolu Lu wrote:
>>> Shouldn't we move the SMMU driver over to this, rather than copy-pasting
>>> everything? If not, then why is it in generic IOMMU code?
>> Yes. I will start to do this from the next version.
> I had written a draft already:
> 
> https://github.com/jgunthorpe/linux/commit/ 
> cda5c27a4020d162948259df9d3c8dd61196290a

Yeah, I will include this in the next version.

Thanks,
baolu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] iommu: Lift and generalize the STE/CD update code from SMMUv3
  2026-03-23 12:59       ` Jason Gunthorpe
@ 2026-03-24  5:49         ` Baolu Lu
  0 siblings, 0 replies; 34+ messages in thread
From: Baolu Lu @ 2026-03-24  5:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nicolin Chen, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel

On 3/23/26 20:59, Jason Gunthorpe wrote:
> On Mon, Mar 16, 2026 at 02:24:57PM +0800, Baolu Lu wrote:
> 
>>>> +	writer->ops->get_used(entry, cur_used);
>>>> +	writer->ops->get_used(target, target_used);
>>>
>>> SMMU has get_update_safe now. Can we take it together?
>>
>> I will look into the SMMUv3 get_update_safe implementation. Or integrate
>> that specially when we transition the ARM SMMUv3 driver to use this
>> generic entry_sync library.
> 
> The intention was to copy the existing ARM code as is, the draft I
> sent was before these changes from Nicolin, so it should get updated..

Okay.

> 
>>>> +EXPORT_SYMBOL(NS(entry_sync_write));
>>>
>>> There is also a KUNIT test coverage in arm-smmu-v3 for all of these
>>> functions. Maybe we can make that generic as well?
>>
>> Same here.
> 
> That will be a bit hard since it depends on driver functions.
> 
>>> But maybe we could just call them:
>>>       entry_sync_writer_le64
>>>       entry_sync_writer_u128
>>> ?
>> I'm fine with the new naming. It is more explicit. I will update the
>> names unless there are further objections.
> 
> I was wondering if we should just be using void * here as the type
> safety seems a bit harmful if the goal is to make the 128 bit option
> fall back to 64 bits if not supported.
> 
> The maximum supported HW atomic quanta can be passed in through the
> struct.

I will explore refactoring the library to use void * and a dynamic
quanta size for v2.

Thanks,
baolu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] iommu/vt-d: Add entry_sync support for PASID entry updates
  2026-03-23 13:07           ` Jason Gunthorpe
@ 2026-03-24  6:22             ` Baolu Lu
  2026-03-24 12:53               ` Jason Gunthorpe
  0 siblings, 1 reply; 34+ messages in thread
From: Baolu Lu @ 2026-03-24  6:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel

On 3/23/26 21:07, Jason Gunthorpe wrote:
> On Sun, Mar 15, 2026 at 04:11:36PM +0800, Baolu Lu wrote:
>>>>> ATC invalidations should always be done after the PASID entry is
>>>>> written. During a hitless update both translations are unpredictably
>>>>> combined, this is unavoidable and OK.
>>>>
>>>> The VT-d spec (Sections 6.5.2.5 and 6.5.2.6) explicitly mandates that an
>>>> IOTLB invalidation must precede the Device-TLB invalidation. If we only
>>>> do the device-TLB invalidation in the sync callback, we risk the device
>>>> re-fetching a stale translation from the IOMMU's internal IOTLB.
>>>
>>> It is a little weird that is says that, that is worth checking into.
>>>
>>> The other text is clear that the IOTLB is cached by DID,PASID only, so
>>> if the new PASID entry has a DID,PASID which is already coherent in
>>> the IOTLB it should not need any IOTLB flushing.
>>>
>>> ie flushing the PASID table should immediately change any ATC fetches
>>> from using DID,old_PASID to DID,new_PASID.
>>>
>>> If there is some issue where the PASID flush doesn't fence everything
>>> (ie an ATC fetch of DID,old_PASID can be passed by an ATC invalidation)
>>> then you may need IOTLB invalidations not to manage coherence but to
>>> manage ordering. That is an important detail if true.
>>
>> On Intel hardware, the PASID-cache and IOTLB are not inclusive. A PASID-
>> cache invalidation forces a re-fetch of the pasid entry, but it does not
>> automatically purge downstream IOTLB entries.
> 
> It doesn't matter, the updated PASID entry will point to a new DID and
> the IOTLB (new DID,PASID) entry will be valid in the IOTLB.
> 
> We don't need to flush the IOTLB, we just need to ensure that all
> lookups done with (old DID,PASID) are completed before sending any
> invalidation.

Yes, you are right.

> 
>> The spec-mandated IOTLB flush serves as a synchronization barrier to
>> ensure that in-flight translation requests are drained and the
>> internal IOMMU state is consistent before the invalidation request
>> is sent over PCIe to the device's ATC.
> 
> A fencing requirement does make sense, but does it have to be done by
> flushing the entire DID,PASID? It is ugly to have to drop the IOTLB
> just because a context entry changed.

I believe the full [old_DID, PASID] invalidation is a functional
necessity rather than just a fencing requirement. Even though the new
PASID entry points to a new_DID, leaving stale translations tagged with
[old_DID, PASID] in the IOTLB is problematic.

However, I agree that IOTLB and Device-TLB invalidation should not be
part of the entry_sync for a PASID entry; instead, it belongs in the
domain replacement logic.

> Can you do a 4k IOVA 0 invalidation and get the same fence?
> 
> Jason

Thanks,
baolu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] iommu/vt-d: Add entry_sync support for PASID entry updates
  2026-03-24  6:22             ` Baolu Lu
@ 2026-03-24 12:53               ` Jason Gunthorpe
  0 siblings, 0 replies; 34+ messages in thread
From: Jason Gunthorpe @ 2026-03-24 12:53 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Dmytro Maluka, Samiullah Khawaja, iommu, linux-kernel

On Tue, Mar 24, 2026 at 02:22:21PM +0800, Baolu Lu wrote:
> > A fencing requirement does make sense, but does it have to be done by
> > flushing the entire DID,PASID? It is ugly to have to drop the IOTLB
> > just because a context entry changed.
> 
> I believe the full [old_DID, PASID] invalidation is a functional
> necessity rather than just a fencing requirement. Even though the new
> PASID entry points to a new_DID, leaving stale translations tagged with
> [old_DID, PASID] in the IOTLB is problematic.

You don't know they are stale, the old_DID,PASID could be used by
another context entry.

The proper time to delcare an IOTLB tag as stale is when it is
returned back to the allocator.

The problem in the VT-d design is that it is complicated to manage the
DID lifecycle well - but the driver itself should have a clear idea
when tags are end of life.

Yes, it is a very reasonable simplified design for the driver to say
the DID lifecycle ends at every context entry change, but that is very
different from stating the flush as a mandatory requirement from the
HW spec.

A more complex design could have each context entry request a cache
tag for the context entry, for the specific domain, and share the tags
whenever possible, eg for SVA and S2 domains.

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2026-03-24 12:53 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-09  6:06 [PATCH 0/8] iommu/vt-d: Hitless PASID updates via entry_sync Lu Baolu
2026-03-09  6:06 ` [PATCH 1/8] iommu: Lift and generalize the STE/CD update code from SMMUv3 Lu Baolu
2026-03-09 23:33   ` Samiullah Khawaja
2026-03-10  0:06     ` Samiullah Khawaja
2026-03-14  8:13       ` Baolu Lu
2026-03-16  9:51         ` Will Deacon
2026-03-18  3:10           ` Baolu Lu
2026-03-23 12:55             ` Jason Gunthorpe
2026-03-24  5:30               ` Baolu Lu
2026-03-16 16:35         ` Samiullah Khawaja
2026-03-18  3:23           ` Baolu Lu
2026-03-13  5:39   ` Nicolin Chen
2026-03-16  6:24     ` Baolu Lu
2026-03-23 12:59       ` Jason Gunthorpe
2026-03-24  5:49         ` Baolu Lu
2026-03-09  6:06 ` [PATCH 2/8] iommu/vt-d: Add entry_sync support for PASID entry updates Lu Baolu
2026-03-09 13:41   ` Jason Gunthorpe
2026-03-11  8:42     ` Baolu Lu
2026-03-11 12:23       ` Jason Gunthorpe
2026-03-12  7:51         ` Baolu Lu
2026-03-12  7:50     ` Baolu Lu
2026-03-12 11:44       ` Jason Gunthorpe
2026-03-15  8:11         ` Baolu Lu
2026-03-23 13:07           ` Jason Gunthorpe
2026-03-24  6:22             ` Baolu Lu
2026-03-24 12:53               ` Jason Gunthorpe
2026-03-09  6:06 ` [PATCH 3/8] iommu/vt-d: Require CMPXCHG16B for PASID support Lu Baolu
2026-03-09 13:42   ` Jason Gunthorpe
2026-03-12  7:59     ` Baolu Lu
2026-03-09  6:06 ` [PATCH 4/8] iommu/vt-d: Add trace events for PASID entry sync updates Lu Baolu
2026-03-09  6:06 ` [PATCH 5/8] iommu/vt-d: Use intel_pasid_write() for first-stage setup Lu Baolu
2026-03-09  6:06 ` [PATCH 6/8] iommu/vt-d: Use intel_pasid_write() for second-stage setup Lu Baolu
2026-03-09  6:06 ` [PATCH 7/8] iommu/vt-d: Use intel_pasid_write() for pass-through setup Lu Baolu
2026-03-09  6:06 ` [PATCH 8/8] iommu/vt-d: Use intel_pasid_write() for nested setup Lu Baolu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox