From: Samiullah Khawaja <skhawaja@google.com>
To: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Joerg Roedel <joro@8bytes.org>, Will Deacon <will@kernel.org>,
Robin Murphy <robin.murphy@arm.com>,
Kevin Tian <kevin.tian@intel.com>,
Jason Gunthorpe <jgg@nvidia.com>,
Dmytro Maluka <dmaluka@chromium.org>,
iommu@lists.linux.dev, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/8] iommu: Lift and generalize the STE/CD update code from SMMUv3
Date: Tue, 10 Mar 2026 00:06:54 +0000 [thread overview]
Message-ID: <aa9fQchZJER8QvY0@google.com> (raw)
In-Reply-To: <aa9W4ZKYWHr9-UNy@google.com>
On Mon, Mar 09, 2026 at 11:33:23PM +0000, Samiullah Khawaja wrote:
>On Mon, Mar 09, 2026 at 02:06:41PM +0800, Lu Baolu wrote:
>>From: Jason Gunthorpe <jgg@nvidia.com>
>>
>>Many IOMMU implementations store data structures in host memory that can
>>be quite big. The iommu is able to DMA read the host memory using an
>>atomic quanta, usually 64 or 128 bits, and will read an entry using
>>multiple quanta reads.
>>
>>Updating the host memory datastructure entry while the HW is concurrently
>>DMA'ing it is a little bit involved, but if you want to do this hitlessly,
>>while never making the entry non-valid, then it becomes quite complicated.
>>
>>entry_sync is a library to handle this task. It works on the notion of
>>"used bits" which reflect which bits the HW is actually sensitive to and
>>which bits are ignored by hardware. Many hardware specifications say
>>things like 'if mode is X then bits ABC are ignored'.
>>
>>Using the ignored bits entry_sync can often compute a series of ordered
>>writes and flushes that will allow the entry to be updated while keeping
>>it valid. If such an update is not possible then entry will be made
>>temporarily non-valid.
>>
>>A 64 and 128 bit quanta version is provided to support existing iommus.
>>
>>Co-developed-by: Lu Baolu <baolu.lu@linux.intel.com>
>>Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
>>Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>>---
>>drivers/iommu/Kconfig | 14 +++
>>drivers/iommu/Makefile | 1 +
>>drivers/iommu/entry_sync.h | 66 +++++++++++++
>>drivers/iommu/entry_sync_template.h | 143 ++++++++++++++++++++++++++++
>>drivers/iommu/entry_sync.c | 68 +++++++++++++
>>5 files changed, 292 insertions(+)
>>create mode 100644 drivers/iommu/entry_sync.h
>>create mode 100644 drivers/iommu/entry_sync_template.h
>>create mode 100644 drivers/iommu/entry_sync.c
>>
>>diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
>>index f86262b11416..2650c9fa125b 100644
>>--- a/drivers/iommu/Kconfig
>>+++ b/drivers/iommu/Kconfig
>>@@ -145,6 +145,20 @@ config IOMMU_DEFAULT_PASSTHROUGH
>>
>>endchoice
>>
>>+config IOMMU_ENTRY_SYNC
>>+ bool
>>+ default n
>>+
>>+config IOMMU_ENTRY_SYNC64
>>+ bool
>>+ select IOMMU_ENTRY_SYNC
>>+ default n
>>+
>>+config IOMMU_ENTRY_SYNC128
>>+ bool
>>+ select IOMMU_ENTRY_SYNC
>>+ default n
>>+
>>config OF_IOMMU
>> def_bool y
>> depends on OF && IOMMU_API
>>diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
>>index 0275821f4ef9..bd923995497a 100644
>>--- a/drivers/iommu/Makefile
>>+++ b/drivers/iommu/Makefile
>>@@ -10,6 +10,7 @@ obj-$(CONFIG_IOMMU_API) += iommu-traces.o
>>obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
>>obj-$(CONFIG_IOMMU_DEBUGFS) += iommu-debugfs.o
>>obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o
>>+obj-$(CONFIG_IOMMU_ENTRY_SYNC) += entry_sync.o
>>obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
>>obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
>>obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
>>diff --git a/drivers/iommu/entry_sync.h b/drivers/iommu/entry_sync.h
>>new file mode 100644
>>index 000000000000..004d421c71c0
>>--- /dev/null
>>+++ b/drivers/iommu/entry_sync.h
>>@@ -0,0 +1,66 @@
>>+/* SPDX-License-Identifier: GPL-2.0-only */
>>+/*
>>+ * Many IOMMU implementations store data structures in host memory that can be
>>+ * quite big. The iommu is able to DMA read the host memory using an atomic
>>+ * quanta, usually 64 or 128 bits, and will read an entry using multiple quanta
>>+ * reads.
>>+ *
>>+ * Updating the host memory datastructure entry while the HW is concurrently
>>+ * DMA'ing it is a little bit involved, but if you want to do this hitlessly,
>>+ * while never making the entry non-valid, then it becomes quite complicated.
>>+ *
>>+ * entry_sync is a library to handle this task. It works on the notion of "used
>>+ * bits" which reflect which bits the HW is actually sensitive to and which bits
>>+ * are ignored by hardware. Many hardware specifications say things like 'if
>>+ * mode is X then bits ABC are ignored'.
>>+ *
>>+ * Using the ignored bits entry_sync can often compute a series of ordered
>>+ * writes and flushes that will allow the entry to be updated while keeping it
>>+ * valid. If such an update is not possible then entry will be made temporarily
>>+ * non-valid.
>>+ *
>>+ * A 64 and 128 bit quanta version is provided to support existing iommus.
>>+ */
>>+#ifndef IOMMU_ENTRY_SYNC_H
>>+#define IOMMU_ENTRY_SYNC_H
>>+
>>+#include <linux/types.h>
>>+#include <linux/compiler.h>
>>+#include <linux/bug.h>
>>+
>>+/* Caller allocates a stack array of this length to call entry_sync_write() */
>>+#define ENTRY_SYNC_MEMORY_LEN(writer) ((writer)->num_quantas * 3)
>>+
>>+struct entry_sync_writer_ops64;
>>+struct entry_sync_writer64 {
>>+ const struct entry_sync_writer_ops64 *ops;
>>+ size_t num_quantas;
>>+ size_t vbit_quanta;
>>+};
>>+
>>+struct entry_sync_writer_ops64 {
>>+ void (*get_used)(const __le64 *entry, __le64 *used);
>>+ void (*sync)(struct entry_sync_writer64 *writer);
>>+};
>>+
>>+void entry_sync_write64(struct entry_sync_writer64 *writer, __le64 *entry,
>>+ const __le64 *target, __le64 *memory,
>>+ size_t memory_len);
>>+
>>+struct entry_sync_writer_ops128;
>>+struct entry_sync_writer128 {
>>+ const struct entry_sync_writer_ops128 *ops;
>>+ size_t num_quantas;
>>+ size_t vbit_quanta;
>>+};
>>+
>>+struct entry_sync_writer_ops128 {
>>+ void (*get_used)(const u128 *entry, u128 *used);
>>+ void (*sync)(struct entry_sync_writer128 *writer);
>>+};
>>+
>>+void entry_sync_write128(struct entry_sync_writer128 *writer, u128 *entry,
>>+ const u128 *target, u128 *memory,
>>+ size_t memory_len);
>>+
>>+#endif
>>diff --git a/drivers/iommu/entry_sync_template.h b/drivers/iommu/entry_sync_template.h
>>new file mode 100644
>>index 000000000000..646f518b098e
>>--- /dev/null
>>+++ b/drivers/iommu/entry_sync_template.h
>>@@ -0,0 +1,143 @@
>>+/* SPDX-License-Identifier: GPL-2.0-only */
>>+#include "entry_sync.h"
>>+#include <linux/args.h>
>>+#include <linux/bitops.h>
>>+
>>+#ifndef entry_sync_writer
>>+#define entry_sync_writer entry_sync_writer64
>>+#define quanta_t __le64
>>+#define NS(name) CONCATENATE(name, 64)
>>+#endif
>>+
>>+/*
>>+ * Figure out if we can do a hitless update of entry to become target. Returns a
>>+ * bit mask where 1 indicates that a quanta word needs to be set disruptively.
>>+ * unused_update is an intermediate value of entry that has unused bits set to
>>+ * their new values.
>>+ */
>>+static u8 NS(entry_quanta_diff)(struct entry_sync_writer *writer,
>>+ const quanta_t *entry, const quanta_t *target,
>>+ quanta_t *unused_update, quanta_t *memory)
>>+{
>>+ quanta_t *target_used = memory + writer->num_quantas * 1;
>>+ quanta_t *cur_used = memory + writer->num_quantas * 2;
>>+ u8 used_qword_diff = 0;
>>+ unsigned int i;
>>+
>>+ writer->ops->get_used(entry, cur_used);
>>+ writer->ops->get_used(target, target_used);
>>+
>>+ for (i = 0; i != writer->num_quantas; i++) {
>>+ /*
>>+ * Check that masks are up to date, the make functions are not
>
>nit: "the make functions" looks like a typo.
>>+ * allowed to set a bit to 1 if the used function doesn't say it
>>+ * is used.
>>+ */
>>+ WARN_ON_ONCE(target[i] & ~target_used[i]);
>>+
>>+ /* Bits can change because they are not currently being used */
>>+ unused_update[i] = (entry[i] & cur_used[i]) |
>>+ (target[i] & ~cur_used[i]);
>>+ /*
>>+ * Each bit indicates that a used bit in a qword needs to be
>>+ * changed after unused_update is applied.
>>+ */
>>+ if ((unused_update[i] & target_used[i]) != target[i])
>>+ used_qword_diff |= 1 << i;
>>+ }
>>+ return used_qword_diff;
>>+}
>>+
>>+/*
>>+ * Update the entry to the target configuration. The transition from the current
>>+ * entry to the target entry takes place over multiple steps that attempts to
>>+ * make the transition hitless if possible. This function takes care not to
>>+ * create a situation where the HW can perceive a corrupted entry. HW is only
>>+ * required to have a quanta-bit atomicity with stores from the CPU, while
>>+ * entries are many quanta bit values big.
>>+ *
>>+ * The difference between the current value and the target value is analyzed to
>>+ * determine which of three updates are required - disruptive, hitless or no
>>+ * change.
>>+ *
>>+ * In the most general disruptive case we can make any update in three steps:
>>+ * - Disrupting the entry (V=0)
>>+ * - Fill now unused quanta words, except qword 0 which contains V
>>+ * - Make qword 0 have the final value and valid (V=1) with a single 64
>>+ * bit store
>>+ *
>>+ * However this disrupts the HW while it is happening. There are several
>>+ * interesting cases where a STE/CD can be updated without disturbing the HW
>>+ * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
>>+ * because the used bits don't intersect. We can detect this by calculating how
>>+ * many 64 bit values need update after adjusting the unused bits and skip the
>>+ * V=0 process. This relies on the IGNORED behavior described in the
>>+ * specification.
>>+ */
>>+void NS(entry_sync_write)(struct entry_sync_writer *writer, quanta_t *entry,
>>+ const quanta_t *target, quanta_t *memory,
>>+ size_t memory_len)
>>+{
>>+ quanta_t *unused_update = memory + writer->num_quantas * 0;
>>+ u8 used_qword_diff;
>>+
>>+ if (WARN_ON(memory_len !=
>>+ ENTRY_SYNC_MEMORY_LEN(writer) * sizeof(*memory)))
>>+ return;
>>+
>>+ used_qword_diff = NS(entry_quanta_diff)(writer, entry, target,
>>+ unused_update, memory);
>>+ if (hweight8(used_qword_diff) == 1) {
>>+ /*
>>+ * Only one quanta needs its used bits to be changed. This is a
>>+ * hitless update, update all bits the current entry is ignoring
>>+ * to their new values, then update a single "critical quanta"
>>+ * to change the entry and finally 0 out any bits that are now
>>+ * unused in the target configuration.
>>+ */
>>+ unsigned int critical_qword_index = ffs(used_qword_diff) - 1;
>>+
>>+ /*
>>+ * Skip writing unused bits in the critical quanta since we'll
>>+ * be writing it in the next step anyways. This can save a sync
>>+ * when the only change is in that quanta.
>>+ */
>>+ unused_update[critical_qword_index] =
>>+ entry[critical_qword_index];
>>+ NS(entry_set)(writer, entry, unused_update, 0,
>>+ writer->num_quantas);
>>+ NS(entry_set)(writer, entry, target, critical_qword_index, 1);
>>+ NS(entry_set)(writer, entry, target, 0, writer->num_quantas);
>>+ } else if (used_qword_diff) {
>>+ /*
>>+ * At least two quantas need their inuse bits to be changed.
>>+ * This requires a breaking update, zero the V bit, write all
>>+ * qwords but 0, then set qword 0
>>+ */
>>+ unused_update[writer->vbit_quanta] = 0;
>>+ NS(entry_set)(writer, entry, unused_update, writer->vbit_quanta, 1);
>>+
>>+ if (writer->vbit_quanta != 0)
>>+ NS(entry_set)(writer, entry, target, 0,
>>+ writer->vbit_quanta - 1);
>
>Looking at the definition of the entry_set below, the last argument is
>length. So if vbit_quanta 1 then it would write zero len. Shouldn't it
>be writing quantas before the vbit_quanta?
>>+ if (writer->vbit_quanta != writer->num_quantas)
Looking at this again, I think vbit_quanta can never be equal to
num_quanta as num_quantas is length and vbit_quanta is index?
>>+ NS(entry_set)(writer, entry, target,
>>+ writer->vbit_quanta,
Staring from vbit_quanta will set the present bit if it is set in the
target?
>>+ writer->num_quantas - 1);
>
>Sami here, the last argument should not have "- 1".
I meant "Same here".
>>+
>>+ NS(entry_set)(writer, entry, target, writer->vbit_quanta, 1);
>>+ } else {
>>+ /*
>>+ * No inuse bit changed. Sanity check that all unused bits are 0
>>+ * in the entry. The target was already sanity checked by
>>+ * entry_quanta_diff().
>>+ */
>>+ WARN_ON_ONCE(NS(entry_set)(writer, entry, target, 0,
>>+ writer->num_quantas));
>>+ }
>>+}
>>+EXPORT_SYMBOL(NS(entry_sync_write));
>>+
>>+#undef entry_sync_writer
>>+#undef quanta_t
>>+#undef NS
>>diff --git a/drivers/iommu/entry_sync.c b/drivers/iommu/entry_sync.c
>>new file mode 100644
>>index 000000000000..48d31270dbba
>>--- /dev/null
>>+++ b/drivers/iommu/entry_sync.c
>>@@ -0,0 +1,68 @@
>>+// SPDX-License-Identifier: GPL-2.0-only
>>+/*
>>+ * Helpers for drivers to update multi-quanta entries shared with HW without
>>+ * races to minimize breaking changes.
>>+ */
>>+#include "entry_sync.h"
>>+#include <linux/kconfig.h>
>>+#include <linux/atomic.h>
>>+
>>+#if IS_ENABLED(CONFIG_IOMMU_ENTRY_SYNC64)
>>+static bool entry_set64(struct entry_sync_writer64 *writer, __le64 *entry,
>>+ const __le64 *target, unsigned int start,
>>+ unsigned int len)
>>+{
>>+ bool changed = false;
>>+ unsigned int i;
>>+
>>+ for (i = start; len != 0; len--, i++) {
>>+ if (entry[i] != target[i]) {
>>+ WRITE_ONCE(entry[i], target[i]);
>>+ changed = true;
>>+ }
>>+ }
>>+
>>+ if (changed)
>>+ writer->ops->sync(writer);
>>+ return changed;
>>+}
>>+
>>+#define entry_sync_writer entry_sync_writer64
>>+#define quanta_t __le64
>>+#define NS(name) CONCATENATE(name, 64)
>>+#include "entry_sync_template.h"
>>+#endif
>>+
>>+#if IS_ENABLED(CONFIG_IOMMU_ENTRY_SYNC128)
>>+static bool entry_set128(struct entry_sync_writer128 *writer, u128 *entry,
>>+ const u128 *target, unsigned int start,
>>+ unsigned int len)
>>+{
>>+ bool changed = false;
>>+ unsigned int i;
>>+
>>+ for (i = start; len != 0; len--, i++) {
>>+ if (entry[i] != target[i]) {
>>+ /*
>>+ * Use cmpxchg128 to generate an indivisible write from
>>+ * the CPU to DMA'able memory. This must ensure that HW
>>+ * sees either the new or old 128 bit value and not
>>+ * something torn. As updates are serialized by a
>>+ * spinlock, we use the local (unlocked) variant to
>>+ * avoid unnecessary bus locking overhead.
>>+ */
>>+ cmpxchg128_local(&entry[i], entry[i], target[i]);
>>+ changed = true;
>>+ }
>>+ }
>>+
>>+ if (changed)
>>+ writer->ops->sync(writer);
>>+ return changed;
>>+}
>>+
>>+#define entry_sync_writer entry_sync_writer128
>>+#define quanta_t u128
>>+#define NS(name) CONCATENATE(name, 128)
>>+#include "entry_sync_template.h"
>>+#endif
>>--
>>2.43.0
>>
next prev parent reply other threads:[~2026-03-10 0:06 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-09 6:06 [PATCH 0/8] iommu/vt-d: Hitless PASID updates via entry_sync Lu Baolu
2026-03-09 6:06 ` [PATCH 1/8] iommu: Lift and generalize the STE/CD update code from SMMUv3 Lu Baolu
2026-03-09 23:33 ` Samiullah Khawaja
2026-03-10 0:06 ` Samiullah Khawaja [this message]
2026-03-14 8:13 ` Baolu Lu
2026-03-16 9:51 ` Will Deacon
2026-03-18 3:10 ` Baolu Lu
2026-03-23 12:55 ` Jason Gunthorpe
2026-03-24 5:30 ` Baolu Lu
2026-03-16 16:35 ` Samiullah Khawaja
2026-03-18 3:23 ` Baolu Lu
2026-03-13 5:39 ` Nicolin Chen
2026-03-16 6:24 ` Baolu Lu
2026-03-23 12:59 ` Jason Gunthorpe
2026-03-24 5:49 ` Baolu Lu
2026-03-09 6:06 ` [PATCH 2/8] iommu/vt-d: Add entry_sync support for PASID entry updates Lu Baolu
2026-03-09 13:41 ` Jason Gunthorpe
2026-03-11 8:42 ` Baolu Lu
2026-03-11 12:23 ` Jason Gunthorpe
2026-03-12 7:51 ` Baolu Lu
2026-03-12 7:50 ` Baolu Lu
2026-03-12 11:44 ` Jason Gunthorpe
2026-03-15 8:11 ` Baolu Lu
2026-03-23 13:07 ` Jason Gunthorpe
2026-03-24 6:22 ` Baolu Lu
2026-03-24 12:53 ` Jason Gunthorpe
2026-03-09 6:06 ` [PATCH 3/8] iommu/vt-d: Require CMPXCHG16B for PASID support Lu Baolu
2026-03-09 13:42 ` Jason Gunthorpe
2026-03-12 7:59 ` Baolu Lu
2026-03-09 6:06 ` [PATCH 4/8] iommu/vt-d: Add trace events for PASID entry sync updates Lu Baolu
2026-03-09 6:06 ` [PATCH 5/8] iommu/vt-d: Use intel_pasid_write() for first-stage setup Lu Baolu
2026-03-09 6:06 ` [PATCH 6/8] iommu/vt-d: Use intel_pasid_write() for second-stage setup Lu Baolu
2026-03-09 6:06 ` [PATCH 7/8] iommu/vt-d: Use intel_pasid_write() for pass-through setup Lu Baolu
2026-03-09 6:06 ` [PATCH 8/8] iommu/vt-d: Use intel_pasid_write() for nested setup Lu Baolu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aa9fQchZJER8QvY0@google.com \
--to=skhawaja@google.com \
--cc=baolu.lu@linux.intel.com \
--cc=dmaluka@chromium.org \
--cc=iommu@lists.linux.dev \
--cc=jgg@nvidia.com \
--cc=joro@8bytes.org \
--cc=kevin.tian@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=robin.murphy@arm.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox