From: Michal Pecio <michal.pecio@gmail.com>
To: Desnes Nunes <desnesn@redhat.com>,
David Woodhouse <dwmw2@infradead.org>,
Lu Baolu <baolu.lu@linux.intel.com>
Cc: linux-kernel@vger.kernel.org, linux-usb@vger.kernel.org,
gregkh@linuxfoundation.org, mathias.nyman@intel.com,
stable@vger.kernel.org, iommu@lists.linux.dev
Subject: Re: [PATCH RFT RFC] usb: xhci: Kill hosts with HCE or HSE on command timeout
Date: Wed, 27 May 2026 10:32:21 +0200 [thread overview]
Message-ID: <20260527103221.7f8b15b0.michal.pecio@gmail.com> (raw)
In-Reply-To: <CACaw+ezMnQh2_oqbZ0jF99+wOADMU2vSMqxh9BoJoefjAC_ixw@mail.gmail.com>
Adding Intel IOMMU people.
Context:
Desnes reported xHCI issues duing crash kernel boot after SysRq
triggered panic. Turns out, the chip gets an IOMMU fault, some other
devices also do. Faulting address is a successful dma_alloc_coherent()
allocation in xhci_alloc_erst(), no evidence that it's freed before
the fault occurs. No problems during normal boot.
On Wed, 27 May 2026 00:47:53 -0300, Desnes Nunes wrote:
> # grep "alloc ERST\|free ERST\|ERST\|Device context\|fault addr" kexec-dmesg.log
> [Tue May 26 08:41:56 2026] DMAR: [DMA Write NO_PASID] Request device
> [80:1f.6] fault addr 0x106f06000 [fault reason 0x39] SM: Present bit
> in Root Entry is clear
> [Tue May 26 08:41:56 2026] DMAR: [DMA Write NO_PASID] Request device
> [80:1f.6] fault addr 0x106f19000 [fault reason 0x39] SM: Present bit
> in Root Entry is clear
> [Tue May 26 08:41:57 2026] DMAR: [DMA Write NO_PASID] Request device
> [80:1f.6] fault addr 0x106f1c000 [fault reason 0x39] SM: Present bit
> in Root Entry is clear
> [...]
> [Tue May 26 08:42:01 2026] xhci_hcd 0000:80:14.0: alloc ERST at
> 0x0000001075140000
> [Tue May 26 08:42:01 2026] xhci_hcd 0000:80:14.0: ERST deq = 64'h107513e000
> [Tue May 26 08:42:02 2026] DMAR: [DMA Read NO_PASID] Request device
> [80:14.0] fault addr 0x1075140000 [fault reason 0x39] SM: Present bit
> in Root Entry is clear
>
> ^ PS: Different address alloc on kdump though
>
> > Otherwise, it seems you were right that you have some IOMMU problem.
>
> Thus, I started to investigate this front now. This time I gave some
> more attention to these dmar messages:
>
> [Tue May 19 08:17:49 2026] DMAR: Intel-IOMMU force enabled due to
> platform opt in
> [Tue May 19 08:17:49 2026] DMAR: No RMRR found
> [Tue May 19 08:17:49 2026] DMAR: No ATSR found
> [Tue May 19 08:17:49 2026] DMAR: dmar0: Using Queued invalidation
> => [Tue May 19 08:17:49 2026] DMAR: Translation already enabled -
> trying to copy translation structures
> => [Tue May 19 08:17:49 2026] DMAR: Copied translation tables from
> previous kernel for dmar0
> [Tue May 19 08:17:49 2026] DMAR: dmar1: Using Queued invalidation
> => [Tue May 19 08:17:49 2026] DMAR: Translation already enabled -
> trying to copy translation structures
> => [Tue May 19 08:17:49 2026] DMAR: Copied translation tables from
> previous kernel for dmar1
>
> I started wondering if maybe on my system these translation tables
> can't be fully trusted for some reason during kdump?
> Maybe iommu is copying root_entries with the Present bit clear, and
> thus generating the fault reason 0x39?
> -> bus 0x80's? Both ethernet and xhci_hcd fault addr were on this bus
>
> So, to test this theory out, I tried to disable translation and
> allocate a clean root-entry table right away if I am running a kdump
> kernel:
>
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index e236c7ec221f..de673f34f4e1 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -2135,24 +2135,31 @@ static int __init init_dmars(void)
> if (translation_pre_enabled(iommu)) {
> pr_info("Translation already enabled - trying
> to copy translation structures\n");
>
> - ret = copy_translation_tables(iommu);
> - if (ret) {
> - /*
> - * We found the IOMMU with translation
> - * enabled - but failed to copy over the
> - * old root-entry table. Try to proceed
> - * by disabling translation now and
> - * allocating a clean root-entry table.
> - * This might cause DMAR faults, but
> - * probably the dump will still succeed.
> - */
> - pr_err("Failed to copy translation
> tables from previous kernel for %s\n",
> - iommu->name);
> + if (is_kdump_kernel()) {
> + pr_info("DESNES V2 IOMMU kdump kernel,
> disabilng translation and allocating clean root-entry for %s\n",
> + iommu->name);
> iommu_disable_translation(iommu);
> clear_translation_pre_enabled(iommu);
> } else {
> - pr_info("Copied translation tables
> from previous kernel for %s\n",
> - iommu->name);
> + ret = copy_translation_tables(iommu);
> + if (ret) {
> + /*
> + * We found the IOMMU with translation
> + * enabled - but failed to copy over the
> + * old root-entry table. Try to proceed
> + * by disabling translation now and
> + * allocating a clean root-entry table.
> + * This might cause DMAR faults, but
> + * probably the dump will still succeed.
> + */
> + pr_err("DESNES V2 Failed to
> copy translation tables from previous kernel for %s\n",
> + iommu->name);
> + iommu_disable_translation(iommu);
> + clear_translation_pre_enabled(iommu);
> + } else {
> + pr_info("DESNES V2 Copied
> translation tables from previous kernel for %s\n",
> + iommu->name);
> + }
> }
> }
>
> Didn't had time to check ERST or HSE yet, but with this I didn't had
> any DMAR faults, vmcore was collected normally and system rebooted
> smoothly afterwards:
>
> [Tue May 26 22:52:58 2026] DMAR: Intel-IOMMU force enabled due to
> platform opt in
> [Tue May 26 22:52:58 2026] DMAR: No RMRR found
> [Tue May 26 22:52:58 2026] DMAR: No ATSR found
> [Tue May 26 22:52:58 2026] DMAR: dmar0: Using Queued invalidation
> => [Tue May 26 22:52:58 2026] DMAR: Translation already enabled -
> trying to copy translation structures
> => [Tue May 26 22:52:58 2026] DMAR: DESNES V2 IOMMU kdump kernel,
> disabilng translation and allocating clean root-entry for dmar0
> [Tue May 26 22:52:58 2026] DMAR: dmar1: Using Queued invalidation
> => [Tue May 26 22:52:58 2026] DMAR: Translation already enabled -
> trying to copy translation structures
> => [Tue May 26 22:52:58 2026] DMAR: DESNES V2 IOMMU kdump kernel,
> disabilng translation and allocating clean root-entry for dmar1
>
> Seems like a lead on this iommu front.
>
> The funny thing is that the comment in this section literaly says that
> doing this could cause faults, but here clearing it actually seemed to
> solve them and made kdump succeed - commit
> 091d42e43d21b6ca7ec39bf5f9e17bc0bd8d4312 ("iommu/vt-d: Copy
> translation tables from old kernel")
>
> Let me do some more tests to dump and check the root-entry table
> before clearing, as well as to check ERST allocations and HSE value,
> and I'll get back to you Michal.
>
> Best Regards,
>
> Desnes
>
next prev parent reply other threads:[~2026-05-27 8:32 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-30 1:48 [PATCH] usb: xhci: bound wait command completion to avoid kdump deadlock Desnes Nunes
2026-04-30 8:48 ` Michal Pecio
2026-04-30 17:27 ` Desnes Nunes
2026-04-30 21:54 ` Michal Pecio
2026-05-01 14:09 ` Desnes Nunes
2026-05-02 9:46 ` [PATCH RFT RFC] usb: xhci: Kill hosts with HCE or HSE on command timeout Michal Pecio
2026-05-02 11:38 ` Desnes Nunes
2026-05-02 21:55 ` Michal Pecio
2026-05-03 3:36 ` Desnes Nunes
2026-05-03 5:17 ` Michal Pecio
2026-05-03 16:20 ` Desnes Nunes
2026-05-03 19:31 ` Michal Pecio
2026-05-04 7:31 ` Michal Pecio
2026-05-18 6:33 ` Michal Pecio
2026-05-20 4:59 ` Desnes Nunes
2026-05-22 9:03 ` Michal Pecio
2026-05-22 20:45 ` Desnes Nunes
2026-05-23 0:29 ` Michal Pecio
2026-05-23 3:47 ` Desnes Nunes
2026-05-23 8:28 ` Michal Pecio
2026-05-27 3:47 ` Desnes Nunes
2026-05-27 8:32 ` Michal Pecio [this message]
2026-06-10 15:32 ` Desnes Nunes
2026-06-18 0:57 ` Desnes Nunes
2026-06-18 4:46 ` Intel IOMMU bug: xHCI faults during crash kernel boot Michal Pecio
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260527103221.7f8b15b0.michal.pecio@gmail.com \
--to=michal.pecio@gmail.com \
--cc=baolu.lu@linux.intel.com \
--cc=desnesn@redhat.com \
--cc=dwmw2@infradead.org \
--cc=gregkh@linuxfoundation.org \
--cc=iommu@lists.linux.dev \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-usb@vger.kernel.org \
--cc=mathias.nyman@intel.com \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.