From: Pranjal Shrivastava <praan@google.com>
To: Nicolin Chen <nicolinc@nvidia.com>
Cc: will@kernel.org, robin.murphy@arm.com, jgg@nvidia.com,
joro@8bytes.org, kees@kernel.org, baolu.lu@linux.intel.com,
kevin.tian@intel.com, miko.lenczewski@arm.com,
smostafa@google.com, linux-arm-kernel@lists.infradead.org,
iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
stable@vger.kernel.org, jamien@nvidia.com
Subject: Re: [PATCH rc v6 1/7] iommu/arm-smmu-v3: Add arm_smmu_kdump_adopt_strtab() for kdump
Date: Sun, 28 Jun 2026 23:00:43 +0000 [thread overview]
Message-ID: <akGnm_dzvoapjwoI@google.com> (raw)
In-Reply-To: <f3d1f938a9b4d540f67d9c1ff394bd62735a4f5c.1779265413.git.nicolinc@nvidia.com>
On Wed, May 20, 2026 at 10:03:18AM -0700, Nicolin Chen wrote:
Hi Nicolin,
> When transitioning to a kdump kernel, the primary kernel might have crashed
> while endpoint devices were actively bus-mastering DMA. Currently, the SMMU
> driver aggressively resets the hardware during probe by clearing CR0_SMMUEN
> and setting the Global Bypass Attribute (GBPA) to ABORT.
>
> In a kdump scenario, this aggressive reset is highly destructive:
> a) If GBPA is set to ABORT, in-flight DMA will be aborted, generating fatal
> PCIe AER or SErrors that may panic the kdump kernel
> b) If GBPA is set to BYPASS, in-flight DMA targeting some IOVAs will bypass
> the SMMU and corrupt the physical memory at those 1:1 mapped IOVAs.
>
> To safely absorb in-flight DMAs, a kdump kernel will have to leave SMMUEN=1
> intact and avoid modifying STRTAB_BASE, allowing HW to continue translating
> in-flight DMAs reusing the crashed kernel's page tables until the endpoint
> device drivers probe and quiesce their respective hardware.
>
> However, the ARM SMMUv3 architecture specification states that updating the
> SMMU_STRTAB_BASE register while SMMUEN == 1 is UNPREDICTABLE or ignored.
>
> This leaves a kdump kernel no choice but to adopt the stream table from the
> crashed kernel.
>
> Introduce ARM_SMMU_OPT_KDUMP_ADOPT and adopt functions memremapping all the
> stream tables extracted from STRTAB_BASE and STRTAB_BASE_CFG.
>
> Note that the adoption of the crashed kernel's stream table follows certain
> strict rules, since the old stream table might be compromised. Thus, apply
> some basic validations against the values read from the registers. If tests
> fail, it means the stream table cannot be trusted, so toss it entirely. To
> avoid OOM due to a potentially corrupted stream table, the memremap for l2
> tables is done on the kdump kernel's demand.
>
> The new option will be set in a following change.
>
> Fixes: b63b3439b856 ("iommu/arm-smmu-v3: Abort all transactions if SMMU is enabled in kdump kernel")
> Cc: stable@vger.kernel.org # v6.12+
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 +
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 254 +++++++++++++++++++-
> 2 files changed, 252 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index ef42df4753ec4..cd60b692c3901 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -861,6 +861,7 @@ struct arm_smmu_device {
> #define ARM_SMMU_OPT_MSIPOLL (1 << 2)
> #define ARM_SMMU_OPT_CMDQ_FORCE_SYNC (1 << 3)
> #define ARM_SMMU_OPT_TEGRA241_CMDQV (1 << 4)
> +#define ARM_SMMU_OPT_KDUMP_ADOPT (1 << 5)
> u32 options;
>
> struct arm_smmu_cmdq cmdq;
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index e8d7dbe495f03..aa6837a5daa88 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2040,16 +2040,70 @@ static void arm_smmu_init_initial_stes(struct arm_smmu_ste *strtab,
> }
> }
>
> +static int arm_smmu_kdump_adopt_l2_strtab(struct arm_smmu_device *smmu, u32 sid,
> + phys_addr_t base, u32 span,
> + struct arm_smmu_strtab_l2 **l2table)
> +{
> + struct arm_smmu_strtab_l2 *table;
> + size_t size;
> +
> + /*
> + * Only a coherent SMMU is supported at this moment. For a non-coherent
> + * SMMU that wants to support ARM_SMMU_OPT_KDUMP_ADOPT, try MEMREMAP_WC.
> + */
> + if (WARN_ON(!(smmu->features & ARM_SMMU_FEAT_COHERENCY)))
> + return -EOPNOTSUPP;
We already checked this in arm_smmu_kdump_adopt_strtab_2lvl() can it
change since?
> +
> + /*
> + * Retest the span in case the L1 descriptor has been overwritten since
> + * the adopt. Reject this master's insert; panic or SMMU-disable would
> + * either lose the vmcore or cascade aborts. Do not try to fix it, as it
> + * would break all other SIDs in the same bus (PCI case). The corruption
> + * blast radius is already bounded to that bus range.
> + */
> + if (span != STRTAB_SPLIT + 1) {
> + dev_err(smmu->dev,
> + "kdump: L1[%u] span %u changed since adopt (was %u)\n",
> + arm_smmu_strtab_l1_idx(sid), span, STRTAB_SPLIT + 1);
> + return -EINVAL;
> + }
> +
> + size = (1UL << (span - 1)) * sizeof(struct arm_smmu_ste);
> +
> + table = devm_memremap(smmu->dev, base, size, MEMREMAP_WB);
Why do we use devm_memremap() here but memremap() in the rest of the
functions (even for the L1 ptr)?
> + if (IS_ERR(table)) {
> + dev_err(smmu->dev,
> + "kdump: failed to adopt l2 stream table for SID %u\n",
> + sid);
> + return PTR_ERR(table);
> + }
> +
> + *l2table = table;
> + return 0;
> +}
> +
> static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
> {
> dma_addr_t l2ptr_dma;
> struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
> struct arm_smmu_strtab_l2 **l2table;
> + u32 l1_idx = arm_smmu_strtab_l1_idx(sid);
>
> - l2table = &cfg->l2.l2ptrs[arm_smmu_strtab_l1_idx(sid)];
> + l2table = &cfg->l2.l2ptrs[l1_idx];
> if (*l2table)
> return 0;
>
> + /* Deferred adoption of the crashed kernel's L2 table */
> + if (smmu->options & ARM_SMMU_OPT_KDUMP_ADOPT) {
> + u64 l2ptr = le64_to_cpu(cfg->l2.l1tab[l1_idx].l2ptr);
> + phys_addr_t base = l2ptr & STRTAB_L1_DESC_L2PTR_MASK;
> + u32 span = FIELD_GET(STRTAB_L1_DESC_SPAN, l2ptr);
> +
> + if (span && base)
> + return arm_smmu_kdump_adopt_l2_strtab(smmu, sid, base,
> + span, l2table);
> + }
> +
> *l2table = dmam_alloc_coherent(smmu->dev, sizeof(**l2table),
> &l2ptr_dma, GFP_KERNEL);
> if (!*l2table) {
> @@ -2061,8 +2115,7 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
>
> arm_smmu_init_initial_stes((*l2table)->stes,
> ARRAY_SIZE((*l2table)->stes));
> - arm_smmu_write_strtab_l1_desc(&cfg->l2.l1tab[arm_smmu_strtab_l1_idx(sid)],
> - l2ptr_dma);
> + arm_smmu_write_strtab_l1_desc(&cfg->l2.l1tab[l1_idx], l2ptr_dma);
> return 0;
> }
>
> @@ -4556,10 +4609,204 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
> return 0;
> }
>
> +static int arm_smmu_kdump_adopt_strtab_2lvl(struct arm_smmu_device *smmu,
> + u32 cfg_reg, phys_addr_t base)
> +{
> + u32 log2size = FIELD_GET(STRTAB_BASE_CFG_LOG2SIZE, cfg_reg);
> + u32 split = FIELD_GET(STRTAB_BASE_CFG_SPLIT, cfg_reg);
> + struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
> + u32 num_l1_ents;
> + size_t size;
> + int i;
> +
> + /*
> + * Only a coherent SMMU is supported at this moment. For a non-coherent
> + * SMMU that wants to support ARM_SMMU_OPT_KDUMP_ADOPT, try MEMREMAP_WC.
> + */
> + if (WARN_ON(!(smmu->features & ARM_SMMU_FEAT_COHERENCY)))
> + return -EOPNOTSUPP;
> +
> + if (log2size < split || log2size > smmu->sid_bits) {
> + dev_err(smmu->dev, "kdump: log2size %u out of range [%u, %u]\n",
> + log2size, split, smmu->sid_bits);
> + return -EINVAL;
> + }
> + if (split != STRTAB_SPLIT) {
> + dev_err(smmu->dev,
> + "kdump: unsupported STRTAB_SPLIT %u (expected %u)\n",
> + split, STRTAB_SPLIT);
> + return -EINVAL;
> + }
> +
> + num_l1_ents = 1U << (log2size - split);
> + if (num_l1_ents > STRTAB_MAX_L1_ENTRIES) {
> + dev_err(smmu->dev, "kdump: l1 entries %u exceeds max %u\n",
> + num_l1_ents, STRTAB_MAX_L1_ENTRIES);
> + return -EINVAL;
> + }
> +
> + cfg->l2.num_l1_ents = num_l1_ents;
> +
> + size = num_l1_ents * sizeof(struct arm_smmu_strtab_l1);
> + cfg->l2.l1tab = memremap(base, size, MEMREMAP_WB);
> + if (!cfg->l2.l1tab)
> + return -ENOMEM;
Same here (as below, sorry reviewing it as the code flows), we should
populate cfg->l2.l1_dma here to be consistent.
> +
> + cfg->l2.l2ptrs =
> + kcalloc(num_l1_ents, sizeof(*cfg->l2.l2ptrs), GFP_KERNEL);
> + if (!cfg->l2.l2ptrs)
> + return -ENOMEM;
We shuold ummap cfg->l2.l1tab before returning -ENOMEM here
> +
> + for (i = 0; i < num_l1_ents; i++) {
> + u64 l2ptr = le64_to_cpu(cfg->l2.l1tab[i].l2ptr);
> + phys_addr_t l2_base = l2ptr & STRTAB_L1_DESC_L2PTR_MASK;
> + u32 span = FIELD_GET(STRTAB_L1_DESC_SPAN, l2ptr);
> +
> + if (!span || !l2_base)
> + continue;
> +
> + if (span != STRTAB_SPLIT + 1) {
> + dev_err(smmu->dev,
> + "kdump: L1[%u] unsupported span %u (vs %u)\n",
> + i, span, STRTAB_SPLIT + 1);
> + return -EINVAL;
We leak kcalloc'd mem (l2.l2ptrs) here, also we should unmap cfg->l2.l1tab
> + }
> +
> + /*
> + * If the crashed kernel's l1 descriptors are deeply corrupted,
> + * blindly memremapping every l2 table here could lead to OOM.
> + *
> + * Defer the l2 memremap to arm_smmu_init_l2_strtab(), so peak
> + * memory is bounded by the kdump kernel's actual demand.
> + */
> + }
> +
> + return 0;
> +}
> +
> +static int arm_smmu_kdump_adopt_strtab_linear(struct arm_smmu_device *smmu,
> + u32 cfg_reg, phys_addr_t base)
> +{
> + u32 log2size = FIELD_GET(STRTAB_BASE_CFG_LOG2SIZE, cfg_reg);
> + struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
> + unsigned int max_log2size;
> + size_t size;
> +
> + /*
> + * Only a coherent SMMU is supported at this moment. For a non-coherent
> + * SMMU that wants to support ARM_SMMU_OPT_KDUMP_ADOPT, try MEMREMAP_WC.
> + */
> + if (WARN_ON(!(smmu->features & ARM_SMMU_FEAT_COHERENCY)))
> + return -EOPNOTSUPP;
> +
> + /* Cap the size at what the kdump kernel itself would have allocated */
> + if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB)
> + max_log2size =
> + ilog2(STRTAB_MAX_L1_ENTRIES * STRTAB_NUM_L2_STES);
Looks like we'd never hit this if condition because we'd never support a
"linear" strtab if the HW supports ARM_SMMU_FEAT_2_LVL_STRTAB. Please
see arm_smmu_init_strtab:
static int arm_smmu_init_strtab(struct arm_smmu_device *smmu)
{
int ret;
if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB)
ret = arm_smmu_init_strtab_2lvl(smmu);
else
ret = arm_smmu_init_strtab_linear(smmu);
if (ret)
return ret;
ida_init(&smmu->vmid_map);
return 0;
}
> + else
> + max_log2size = smmu->sid_bits;
> +
> + /* cfg->linear.num_ents is unsigned int, so cap log2size at 31 */
> + max_log2size = min(max_log2size, 31U);
> + if (log2size > max_log2size) {
> + dev_err(smmu->dev, "kdump: unsupported log2size %u (> %u)\n",
> + log2size, max_log2size);
> + return -EINVAL;
> + }
> +
> + /*
> + * We might end up with a num_ents != sid_bits, which is fine. In the
> + * ARM_SMMU_OPT_KDUMP_ADOPT case, arm_smmu_write_strtab() is bypassed.
> + */
> + cfg->linear.num_ents = 1U << log2size;
> +
> + size = cfg->linear.num_ents * sizeof(struct arm_smmu_ste);
> + cfg->linear.table = memremap(base, size, MEMREMAP_WB);
> + if (!cfg->linear.table)
> + return -ENOMEM;
We seem to skips initializing cfg->linear.ste_dma (it is populated in
arm_smmu_init_strtab_linear)
While the comment notes that arm_smmu_write_strtab() is bypassed in the
kdump case, leaving cfg->linear.ste_dma uninitialized seems like a
ticking time bomb if any other part of the driver ever uses it.
> + return 0;
> +}
> +
> +static void arm_smmu_kdump_adopt_cleanup(void *data)
> +{
> + struct arm_smmu_device *smmu = data;
> + u32 cfg_reg = readl_relaxed(smmu->base + ARM_SMMU_STRTAB_BASE_CFG);
I'm worried about reading the HW register here, since this is a devres action, it
can run after arm_smmu_device_remove() or arm_smmu_device_shutdown()
(which would call rpm_put()). Please see __device_release_driver[1]:
pm_runtime_put_sync(dev); <--- HW turned off
device_remove(dev);
if (dev->bus && dev->bus->dma_cleanup)
dev->bus->dma_cleanup(dev);
device_unbind_cleanup(dev); <--- This is where devm_release runs
device_links_driver_cleanup(dev);
Thus, even if we call rpm_get() here it would make no sense as the
register contents would've been lost. Can we rely on some SW state
here? smmu->features & 2LVL or maybe add an fmt in cfg?
> + struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
> + u32 fmt = FIELD_GET(STRTAB_BASE_CFG_FMT, cfg_reg);
> +
> + if (fmt == STRTAB_BASE_CFG_FMT_2LVL) {
> + kfree(cfg->l2.l2ptrs);
> + if (cfg->l2.l1tab)
> + memunmap(cfg->l2.l1tab);
> + } else if (fmt == STRTAB_BASE_CFG_FMT_LINEAR) {
> + if (cfg->linear.table)
> + memunmap(cfg->linear.table);
> + }
> +}
> +
> +static int arm_smmu_kdump_adopt_strtab(struct arm_smmu_device *smmu)
> +{
> + u32 cfg_reg = readl_relaxed(smmu->base + ARM_SMMU_STRTAB_BASE_CFG);
> + u64 base_reg = readq_relaxed(smmu->base + ARM_SMMU_STRTAB_BASE);
> + u32 fmt = FIELD_GET(STRTAB_BASE_CFG_FMT, cfg_reg);
> + phys_addr_t base = base_reg & STRTAB_BASE_ADDR_MASK;
> + int ret;
> +
> + dev_info(smmu->dev, "kdump: adopting crashed kernel's stream table\n");
Nit: Should this be dev_info? If everything goes right, the user doesn't
need to know. dev_dbg seems more appropriate. It should only be a warn or
err if adoption fails (which is in place).
> +
> + if (fmt == STRTAB_BASE_CFG_FMT_2LVL) {
> + /*
> + * Both kernels run on the same hardware, so it's impossible for
> + * kdump kernel to see the support for linear stream table only.
> + */
> + if (WARN_ON(!(smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB)))
> + ret = -EINVAL;
> + else
> + ret = arm_smmu_kdump_adopt_strtab_2lvl(smmu, cfg_reg,
> + base);
> + } else if (fmt == STRTAB_BASE_CFG_FMT_LINEAR) {
> + /*
> + * In case that the old kernel for some reason used the linear
Nit: This sounds a little judgemental, what if the HW only supports
linear table? Let's drop the "for some reason" part.
> + * format, enforce the same format to match the adopted table.
> + */
> + ret = arm_smmu_kdump_adopt_strtab_linear(smmu, cfg_reg, base);
> + if (!ret)
> + smmu->features &= ~ARM_SMMU_FEAT_2_LVL_STRTAB;
IIRC, this should NOT be set if we selected the linear format.
Looking at arm_smmu_init_strtab(), if the HW supports 2-level tables,
the driver unconditionally selects it. What is the expected scenario
where the previous kernel would have allocated a linear table on 2-level
capable hardware? IMO, it is a bug if we see linear fmt with this
feature set. Am I missing something?
> + } else {
> + dev_err(smmu->dev, "kdump: invalid STRTAB format %u\n", fmt);
> + ret = -EINVAL;
> + }
> +
> + if (ret) {
> + arm_smmu_kdump_adopt_cleanup(smmu);
> + goto err;
> + }
> +
> + ret = devm_add_action_or_reset(smmu->dev, arm_smmu_kdump_adopt_cleanup,
> + smmu);
> + /* devm_add_action_or_reset ran the cleanup upon failure */
> + if (ret) {
> + dev_warn(smmu->dev, "kdump: failed to set up cleanup action\n");
> + goto err;
> + }
> +
> + return 0;
> +
> +err:
> + dev_warn(smmu->dev, "kdump: falling back to full reset\n");
> + memset(&smmu->strtab_cfg, 0, sizeof(smmu->strtab_cfg));
> + smmu->options &= ~ARM_SMMU_OPT_KDUMP_ADOPT;
> + return ret;
> +}
> +
> static int arm_smmu_init_strtab(struct arm_smmu_device *smmu)
> {
> int ret;
>
> + if ((smmu->options & ARM_SMMU_OPT_KDUMP_ADOPT) &&
> + !arm_smmu_kdump_adopt_strtab(smmu))
> + goto out;
> +
> if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB)
> ret = arm_smmu_init_strtab_2lvl(smmu);
> else
> @@ -4567,6 +4814,7 @@ static int arm_smmu_init_strtab(struct arm_smmu_device *smmu)
> if (ret)
> return ret;
>
> +out:
> ida_init(&smmu->vmid_map);
>
> return 0;
> --
> 2.43.0
>
Thanks,
Praan
[1] https://elixir.bootlin.com/linux/v7.1.2/source/drivers/base/dd.c#L1350
next prev parent reply other threads:[~2026-06-28 23:00 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-20 17:03 [PATCH rc v6 0/7] iommu/arm-smmu-v3: Fix device crash on kdump kernel Nicolin Chen
2026-05-20 17:03 ` [PATCH rc v6 1/7] iommu/arm-smmu-v3: Add arm_smmu_kdump_adopt_strtab() for kdump Nicolin Chen
2026-06-28 23:00 ` Pranjal Shrivastava [this message]
2026-06-29 4:01 ` Nicolin Chen
2026-05-20 17:03 ` [PATCH rc v6 2/7] iommu/arm-smmu-v3: Implement is_attach_deferred() " Nicolin Chen
2026-06-28 23:06 ` Pranjal Shrivastava
2026-05-20 17:03 ` [PATCH rc v6 3/7] iommu/arm-smmu-v3: Do not enable EVTQ/PRIQ interrupts in kdump kernel Nicolin Chen
2026-05-20 17:03 ` [PATCH rc v6 4/7] iommu/arm-smmu-v3: Skip EVTQ/PRIQ setup " Nicolin Chen
2026-05-20 17:03 ` [PATCH rc v6 5/7] iommu/arm-smmu-v3: Retain CR0_SMMUEN during kdump device reset Nicolin Chen
2026-05-20 17:03 ` [PATCH rc v6 6/7] iommu/arm-smmu-v3: Skip RMR bypass for kdump adoption Nicolin Chen
2026-05-20 17:03 ` [PATCH rc v6 7/7] iommu/arm-smmu-v3: Detect ARM_SMMU_OPT_KDUMP_ADOPT in probe() Nicolin Chen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=akGnm_dzvoapjwoI@google.com \
--to=praan@google.com \
--cc=baolu.lu@linux.intel.com \
--cc=iommu@lists.linux.dev \
--cc=jamien@nvidia.com \
--cc=jgg@nvidia.com \
--cc=joro@8bytes.org \
--cc=kees@kernel.org \
--cc=kevin.tian@intel.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=miko.lenczewski@arm.com \
--cc=nicolinc@nvidia.com \
--cc=robin.murphy@arm.com \
--cc=smostafa@google.com \
--cc=stable@vger.kernel.org \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox