Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac)

public inbox for linux-usb@vger.kernel.org
 help / color / mirror / Atom feed

* Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac)
       [not found] <b2abd254-d11f-4ef7-8664-b9e5a1409abc@panix.com>
@ 2025-02-10 21:05 ` Bjorn Helgaas
  2025-02-11  0:18   ` Kenneth Crudup
  0 siblings, 1 reply; 28+ messages in thread
From: Bjorn Helgaas @ 2025-02-10 21:05 UTC (permalink / raw)
  To: Kenneth Crudup
  Cc: ilpo.jarvinen, Bjorn Helgaas, Jian-Hong Pan, linux-pci,
	linux-kernel, Niklāvs Koļesņikovs, Andreas Noever,
	Michael Jamet, Mika Westerberg, Yehezkel Bernat, linux-usb

[+cc Thunderbolt folks, original post
https://lore.kernel.org/r/04091f53-3c94-4533-ab48-e9296e6e2841@panix.com]

Wow, something about your platform or usage is really good at finding
these bugs ;)  The original post is a page fault in xe_display_pm_resume()
and this one a NULL pointer dereference in __tb_path_deactivate_hop().
You see these so frequently that I would think Google would find more
reports, but I don't see any, which makes me wonder if this is some
kind of memory corruption related to a driver on your system.

On Sat, Feb 08, 2025 at 07:47:10PM -0800, Kenneth Crudup wrote:
> Eh, NVM.
> 
> Just had another resume crash with it reverted:
> 
> ----
> <6>[69848.656985][T25549] CPU19 is up
> <6>[69848.666220][T25549] ACPI: PM: Waking up from system sleep state S4
> <6>[69848.693976][T25549] ACPI: EC: interrupt unblocked
> <4>[69848.704735][T25549] thunderbolt 0000:00:0d.2: 0:5: path does not end
> on a DP adapter, cleaning up
> <1>[69848.706322][T25549] BUG: kernel NULL pointer dereference, address:
> 0000000000000384
> <1>[69848.706324][T25549] #PF: supervisor read access in kernel mode
> <1>[69848.706325][T25549] #PF: error_code(0x0000) - not-present page
> <6>[69848.706326][T25549] PGD 0 P4D 0
> <4>[69848.706327][T25549] Oops: Oops: 0000 [#1] PREEMPT SMP
> <4>[69848.706330][T25549] CPU: 1 UID: 0 PID: 25549 Comm: systemd-sleep
> Tainted: G S   U     O       6.14.0-rc1-kenny+ #4
> <4>[69848.706332][T25549] Tainted: [S]=CPU_OUT_OF_SPEC, [U]=USER,
> [O]=OOT_MODULE
> <4>[69848.706332][T25549] Hardware name: Dell Inc. XPS 9320/0KNXGD, BIOS
> 2.18.1 12/24/2024
> <4>[69848.706333][T25549] RIP: 0010:__tb_path_deactivate_hop+0x25/0x220
> <4>[69848.706337][T25549] Code: 5d 5d c3 c3 90 55 48 89 e5 41 57 41 56 41 55
> 41 54 53 48 83 ec 18 89 55 c4 65 48 8b 04 25 28 00 00 00 48 89 45 d0 48 8b
> 47 20 <80> b8 84 03 00 00 00 0f 85 64 01 00 00 8b 90 04 03 00 00 44 8d 2c
> <4>[69848.706338][T25549] RSP: 0000:ffffa2d40a9fb7f8 EFLAGS: 00010292
> <4>[69848.706339][T25549] RAX: 0000000000000000 RBX: 0000000000000001 RCX:
> 0000000000000002
> <4>[69848.706340][T25549] RDX: 000000000000000e RSI: 00000000a863e150 RDI:
> ffffa2d400803b00
> <4>[69848.706340][T25549] RBP: ffffa2d40a9fb838 R08: 0000000000000000 R09:
> ffffffffa9a55760
> <4>[69848.706341][T25549] R10: 0000000000000000 R11: 0000000000000000 R12:
> ffff8d1fc6bce7c0
> <4>[69848.706341][T25549] R13: 0000000000000028 R14: ffff8d1fc1861000 R15:
> ffff8d1fc18513e8
> <4>[69848.706342][T25549] FS:  00007fb5b3631940(0000)
> GS:ffff8d272f440000(0000) knlGS:0000000000000000
> <4>[69848.706343][T25549] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4>[69848.706343][T25549] CR2: 0000000000000384 CR3: 00000003cea00003 CR4:
> 0000000000770ef0
> <4>[69848.706344][T25549] PKRU: 55555554
> <4>[69848.706344][T25549] Call Trace:
> <4>[69848.706345][T25549]  <TASK>
> <4>[69848.706348][T25549]  ? show_regs.part.0+0x1d/0x20
> <4>[69848.706351][T25549]  ? __die+0x52/0x91
> <4>[69848.706352][T25549]  ? page_fault_oops+0x9a/0x220
> <4>[69848.706354][T25549]  ? exc_page_fault+0x2fc/0x5c0
> <4>[69848.706357][T25549]  ? asm_exc_page_fault+0x27/0x30
> <4>[69848.706359][T25549]  ? __tb_path_deactivate_hop+0x25/0x220
> <4>[69848.706360][T25549]  __tb_path_deactivate_hops+0x37/0x60
> <4>[69848.706361][T25549]  tb_path_deactivate+0x1e/0x110
> <4>[69848.706362][T25549]  tb_tunnel_deactivate+0x65/0x120
> <4>[69848.706363][T25549]  tb_tunnel_discover_dp+0x373/0x670
> <4>[69848.706364][T25549]  tb_switch_discover_tunnels+0x71/0x1e0
> <4>[69848.706366][T25549]  tb_resume_noirq+0x91/0x2a0
> <4>[69848.706368][T25549]  tb_domain_resume_noirq+0x3f/0x60
> <4>[69848.706369][T25549]  nhi_resume_noirq+0x34/0x90
> <4>[69848.706370][T25549]  pci_pm_restore_noirq+0x71/0xc0
> <4>[69848.706372][T25549]  ? new_id_store+0x1b0/0x1b0
> <4>[69848.706373][T25549]  dpm_run_callback+0x40/0xb0
> <4>[69848.706375][T25549]  device_resume_noirq+0xc4/0x2a0
> <4>[69848.706376][T25549]  dpm_noirq_resume_devices+0x11b/0x150
> <4>[69848.706376][T25549]  dpm_resume_start+0xc/0x30
> <4>[69848.706377][T25549]  hibernation_snapshot+0x26d/0x430
> <4>[69848.706379][T25549]  hibernate.cold+0x9c/0x333
> <4>[69848.706380][T25549]  state_store+0xbe/0xc0
> <4>[69848.706381][T25549]  kobj_attr_store+0xf/0x20
> <4>[69848.706383][T25549]  sysfs_kf_write+0x34/0x40
> <4>[69848.706385][T25549]  kernfs_fop_write_iter+0x134/0x1e0
> <4>[69848.706386][T25549]  vfs_write+0x244/0x410
> <4>[69848.706388][T25549]  ksys_write+0x63/0xd0
> <4>[69848.706389][T25549]  __x64_sys_write+0x14/0x20
> <4>[69848.706390][T25549]  x64_sys_call+0x9eb/0xa00
> <4>[69848.706392][T25549]  do_syscall_64+0x63/0xf0
> <4>[69848.706394][T25549]  ? do_filp_open+0xbe/0x170
> <4>[69848.706395][T25549]  ? do_wp_page+0x7f3/0xe80
> <4>[69848.706398][T25549]  ? ___pte_offset_map+0x17/0xe0
> <4>[69848.706399][T25549]  ? __handle_mm_fault+0xb13/0x1160
> <4>[69848.706400][T25549]  ? do_syscall_64+0x6f/0xf0
> <4>[69848.706401][T25549]  ? strncpy_from_user+0x25/0xf0
> <4>[69848.706402][T25549]  ? __count_memcg_events+0x49/0xe0
> <4>[69848.706403][T25549]  ? handle_mm_fault+0x181/0x2a0
> <4>[69848.706404][T25549]  ? irqentry_exit+0x4a/0x60
> <4>[69848.706405][T25549]  ? exc_page_fault+0x196/0x5c0
> <4>[69848.706406][T25549]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
> <4>[69848.706407][T25549] RIP: 0033:0x7fb5b3526274
> <4>[69848.706411][T25549] Code: Unable to access opcode bytes at
> 0x7fb5b352624a.
> <4>[69848.706411][T25549] RSP: 002b:00007ffe667063f8 EFLAGS: 00000202
> ORIG_RAX: 0000000000000001
> <4>[69848.706412][T25549] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
> 00007fb5b3526274
> <4>[69848.706413][T25549] RDX: 0000000000000005 RSI: 000055ae2380a030 RDI:
> 0000000000000007
> <4>[69848.706413][T25549] RBP: 00007ffe66706420 R08: 0000000000000000 R09:
> 0000000000000001
> <4>[69848.706414][T25549] R10: 0000000000000003 R11: 0000000000000202 R12:
> 0000000000000005
> <4>[69848.706414][T25549] R13: 000055ae2380a030 R14: 000055ae237f62a0 R15:
> 00007fb5b360fea0
> <4>[69848.706415][T25549]  </TASK>
> <4>[69848.706415][T25549] Modules linked in: vmw_vmci snd_soc_sof_sdw
> snd_soc_sdw_utils snd_sof_probes iwlmvm mei_hdcp mei_pxp mac80211
> snd_sof_pci_intel_tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic
> snd_sof_pci soundwire_intel soundwire_generic_allocation soundwire_cadence
> snd_sof_intel_hda_common snd_soc_hdac_hda iwlwifi btusb
> snd_sof_intel_hda_mlink btintel snd_sof_intel_hda cfg80211 mei_me ov01a10 xe
> drm_ttm_helper gpu_sched drm_suballoc_helper drm_gpuvm drm_exec i915
> drm_buddy intel_gtt drm_display_helper cec ttm
> <4>[69848.706433][T25549] CR2: 0000000000000384
> <4>[69848.706435][T25549] ---[ end trace 0000000000000000 ]
> ----
> 
> On 2/8/25 12:56, Kenneth Crudup wrote:
> > 
> > Guys, I don't think this commit is right; I've had 2 out of three resume
> > failures since this change went into Linus' master. I've attached a
> > pstore dump of the latest crash, and while it appears to be coming from
> > the Intel XE driver, 95% of my (s0ix) resumes worked previously[1]
> > before this change.
> > 
> > LMK if you need more information.
> > 
> > -Kenny
> > 
> > [1] - unless I forget to detach my NVMe USB4 external drive before
> > suspending, which is a breakage that appears to have gone in sometime
> > around the 6.10 series, but I haven't been able to bisect it
> 
> -- 
> Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County
> CA
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac)
  2025-02-10 21:05 ` PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac) Bjorn Helgaas
@ 2025-02-11  0:18   ` Kenneth Crudup
  2025-02-11  5:57     ` Mika Westerberg
  0 siblings, 1 reply; 28+ messages in thread
From: Kenneth Crudup @ 2025-02-11  0:18 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: ilpo.jarvinen, Bjorn Helgaas, Jian-Hong Pan, linux-pci,
	linux-kernel, Niklāvs Koļesņikovs, Andreas Noever,
	Michael Jamet, Mika Westerberg, Yehezkel Bernat, linux-usb,
	Kenneth Crudup


I wonder about that, too- but for the most part, my resumes from 
hibernate/suspend tend to work most of the time.

I've instrumented .../drivers/thunderbolt/path.c to add some pr_info()s 
in __tb_path_deactivate_hop() to see if it's either tripping up on a NPE 
and to see where it dies, but since then nothing has happened again, so 
I do wonder sometimes if it's an errant driver (i.e., just subtle 
changes in the ordering of code could be making a difference).

That being said, leaving a USB4-to-NVMe bridged NVMe device connected to 
a Thunderbolt dock on suspend causes hangs on resume that only the power 
button seems to recover from. I've really been trying to bisect the 
cause on this one as it's usually my only failure mode (and frustrating 
when I forget to unplug before I suspend). What complicates that one 
more is it now doesn't happen EVERY time.

One change I've done from Linus' master (which I tend to grab every few 
days) and have for a while is this:

----
-        cflags-$(CONFIG_MCORE2)                += -march=core2
+        cflags-$(CONFIG_MCORE2) += \
+                $(call cc-option,-march=alderlake,$(call 
cc-option,-mtune=native)) \
+                $(call cc-option,-mtune=alderlake,$(call 
cc-option,-mtune=native))
----

But I'm running "gcc (Ubuntu 14.2.0-4ubuntu2) 14.2.0", so that should be 
mature enough to not introduce any Intel-chipset-specific bugs, right?

-Kenny

On 2/10/25 13:05, Bjorn Helgaas wrote:
> [+cc Thunderbolt folks, original post
> https://lore.kernel.org/r/04091f53-3c94-4533-ab48-e9296e6e2841@panix.com]
> 
> Wow, something about your platform or usage is really good at finding
> these bugs ;)  The original post is a page fault in xe_display_pm_resume()
> and this one a NULL pointer dereference in __tb_path_deactivate_hop().
> You see these so frequently that I would think Google would find more
> reports, but I don't see any, which makes me wonder if this is some
> kind of memory corruption related to a driver on your system.
> 
> On Sat, Feb 08, 2025 at 07:47:10PM -0800, Kenneth Crudup wrote:
>> Eh, NVM.
>>
>> Just had another resume crash with it reverted:
>>
>> ----
>> <6>[69848.656985][T25549] CPU19 is up
>> <6>[69848.666220][T25549] ACPI: PM: Waking up from system sleep state S4
>> <6>[69848.693976][T25549] ACPI: EC: interrupt unblocked
>> <4>[69848.704735][T25549] thunderbolt 0000:00:0d.2: 0:5: path does not end
>> on a DP adapter, cleaning up
>> <1>[69848.706322][T25549] BUG: kernel NULL pointer dereference, address:
>> 0000000000000384
>> <1>[69848.706324][T25549] #PF: supervisor read access in kernel mode
>> <1>[69848.706325][T25549] #PF: error_code(0x0000) - not-present page
>> <6>[69848.706326][T25549] PGD 0 P4D 0
>> <4>[69848.706327][T25549] Oops: Oops: 0000 [#1] PREEMPT SMP
>> <4>[69848.706330][T25549] CPU: 1 UID: 0 PID: 25549 Comm: systemd-sleep
>> Tainted: G S   U     O       6.14.0-rc1-kenny+ #4
>> <4>[69848.706332][T25549] Tainted: [S]=CPU_OUT_OF_SPEC, [U]=USER,
>> [O]=OOT_MODULE
>> <4>[69848.706332][T25549] Hardware name: Dell Inc. XPS 9320/0KNXGD, BIOS
>> 2.18.1 12/24/2024
>> <4>[69848.706333][T25549] RIP: 0010:__tb_path_deactivate_hop+0x25/0x220
>> <4>[69848.706337][T25549] Code: 5d 5d c3 c3 90 55 48 89 e5 41 57 41 56 41 55
>> 41 54 53 48 83 ec 18 89 55 c4 65 48 8b 04 25 28 00 00 00 48 89 45 d0 48 8b
>> 47 20 <80> b8 84 03 00 00 00 0f 85 64 01 00 00 8b 90 04 03 00 00 44 8d 2c
>> <4>[69848.706338][T25549] RSP: 0000:ffffa2d40a9fb7f8 EFLAGS: 00010292
>> <4>[69848.706339][T25549] RAX: 0000000000000000 RBX: 0000000000000001 RCX:
>> 0000000000000002
>> <4>[69848.706340][T25549] RDX: 000000000000000e RSI: 00000000a863e150 RDI:
>> ffffa2d400803b00
>> <4>[69848.706340][T25549] RBP: ffffa2d40a9fb838 R08: 0000000000000000 R09:
>> ffffffffa9a55760
>> <4>[69848.706341][T25549] R10: 0000000000000000 R11: 0000000000000000 R12:
>> ffff8d1fc6bce7c0
>> <4>[69848.706341][T25549] R13: 0000000000000028 R14: ffff8d1fc1861000 R15:
>> ffff8d1fc18513e8
>> <4>[69848.706342][T25549] FS:  00007fb5b3631940(0000)
>> GS:ffff8d272f440000(0000) knlGS:0000000000000000
>> <4>[69848.706343][T25549] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> <4>[69848.706343][T25549] CR2: 0000000000000384 CR3: 00000003cea00003 CR4:
>> 0000000000770ef0
>> <4>[69848.706344][T25549] PKRU: 55555554
>> <4>[69848.706344][T25549] Call Trace:
>> <4>[69848.706345][T25549]  <TASK>
>> <4>[69848.706348][T25549]  ? show_regs.part.0+0x1d/0x20
>> <4>[69848.706351][T25549]  ? __die+0x52/0x91
>> <4>[69848.706352][T25549]  ? page_fault_oops+0x9a/0x220
>> <4>[69848.706354][T25549]  ? exc_page_fault+0x2fc/0x5c0
>> <4>[69848.706357][T25549]  ? asm_exc_page_fault+0x27/0x30
>> <4>[69848.706359][T25549]  ? __tb_path_deactivate_hop+0x25/0x220
>> <4>[69848.706360][T25549]  __tb_path_deactivate_hops+0x37/0x60
>> <4>[69848.706361][T25549]  tb_path_deactivate+0x1e/0x110
>> <4>[69848.706362][T25549]  tb_tunnel_deactivate+0x65/0x120
>> <4>[69848.706363][T25549]  tb_tunnel_discover_dp+0x373/0x670
>> <4>[69848.706364][T25549]  tb_switch_discover_tunnels+0x71/0x1e0
>> <4>[69848.706366][T25549]  tb_resume_noirq+0x91/0x2a0
>> <4>[69848.706368][T25549]  tb_domain_resume_noirq+0x3f/0x60
>> <4>[69848.706369][T25549]  nhi_resume_noirq+0x34/0x90
>> <4>[69848.706370][T25549]  pci_pm_restore_noirq+0x71/0xc0
>> <4>[69848.706372][T25549]  ? new_id_store+0x1b0/0x1b0
>> <4>[69848.706373][T25549]  dpm_run_callback+0x40/0xb0
>> <4>[69848.706375][T25549]  device_resume_noirq+0xc4/0x2a0
>> <4>[69848.706376][T25549]  dpm_noirq_resume_devices+0x11b/0x150
>> <4>[69848.706376][T25549]  dpm_resume_start+0xc/0x30
>> <4>[69848.706377][T25549]  hibernation_snapshot+0x26d/0x430
>> <4>[69848.706379][T25549]  hibernate.cold+0x9c/0x333
>> <4>[69848.706380][T25549]  state_store+0xbe/0xc0
>> <4>[69848.706381][T25549]  kobj_attr_store+0xf/0x20
>> <4>[69848.706383][T25549]  sysfs_kf_write+0x34/0x40
>> <4>[69848.706385][T25549]  kernfs_fop_write_iter+0x134/0x1e0
>> <4>[69848.706386][T25549]  vfs_write+0x244/0x410
>> <4>[69848.706388][T25549]  ksys_write+0x63/0xd0
>> <4>[69848.706389][T25549]  __x64_sys_write+0x14/0x20
>> <4>[69848.706390][T25549]  x64_sys_call+0x9eb/0xa00
>> <4>[69848.706392][T25549]  do_syscall_64+0x63/0xf0
>> <4>[69848.706394][T25549]  ? do_filp_open+0xbe/0x170
>> <4>[69848.706395][T25549]  ? do_wp_page+0x7f3/0xe80
>> <4>[69848.706398][T25549]  ? ___pte_offset_map+0x17/0xe0
>> <4>[69848.706399][T25549]  ? __handle_mm_fault+0xb13/0x1160
>> <4>[69848.706400][T25549]  ? do_syscall_64+0x6f/0xf0
>> <4>[69848.706401][T25549]  ? strncpy_from_user+0x25/0xf0
>> <4>[69848.706402][T25549]  ? __count_memcg_events+0x49/0xe0
>> <4>[69848.706403][T25549]  ? handle_mm_fault+0x181/0x2a0
>> <4>[69848.706404][T25549]  ? irqentry_exit+0x4a/0x60
>> <4>[69848.706405][T25549]  ? exc_page_fault+0x196/0x5c0
>> <4>[69848.706406][T25549]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
>> <4>[69848.706407][T25549] RIP: 0033:0x7fb5b3526274
>> <4>[69848.706411][T25549] Code: Unable to access opcode bytes at
>> 0x7fb5b352624a.
>> <4>[69848.706411][T25549] RSP: 002b:00007ffe667063f8 EFLAGS: 00000202
>> ORIG_RAX: 0000000000000001
>> <4>[69848.706412][T25549] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
>> 00007fb5b3526274
>> <4>[69848.706413][T25549] RDX: 0000000000000005 RSI: 000055ae2380a030 RDI:
>> 0000000000000007
>> <4>[69848.706413][T25549] RBP: 00007ffe66706420 R08: 0000000000000000 R09:
>> 0000000000000001
>> <4>[69848.706414][T25549] R10: 0000000000000003 R11: 0000000000000202 R12:
>> 0000000000000005
>> <4>[69848.706414][T25549] R13: 000055ae2380a030 R14: 000055ae237f62a0 R15:
>> 00007fb5b360fea0
>> <4>[69848.706415][T25549]  </TASK>
>> <4>[69848.706415][T25549] Modules linked in: vmw_vmci snd_soc_sof_sdw
>> snd_soc_sdw_utils snd_sof_probes iwlmvm mei_hdcp mei_pxp mac80211
>> snd_sof_pci_intel_tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic
>> snd_sof_pci soundwire_intel soundwire_generic_allocation soundwire_cadence
>> snd_sof_intel_hda_common snd_soc_hdac_hda iwlwifi btusb
>> snd_sof_intel_hda_mlink btintel snd_sof_intel_hda cfg80211 mei_me ov01a10 xe
>> drm_ttm_helper gpu_sched drm_suballoc_helper drm_gpuvm drm_exec i915
>> drm_buddy intel_gtt drm_display_helper cec ttm
>> <4>[69848.706433][T25549] CR2: 0000000000000384
>> <4>[69848.706435][T25549] ---[ end trace 0000000000000000 ]
>> ----
>>
>> On 2/8/25 12:56, Kenneth Crudup wrote:
>>>
>>> Guys, I don't think this commit is right; I've had 2 out of three resume
>>> failures since this change went into Linus' master. I've attached a
>>> pstore dump of the latest crash, and while it appears to be coming from
>>> the Intel XE driver, 95% of my (s0ix) resumes worked previously[1]
>>> before this change.
>>>
>>> LMK if you need more information.
>>>
>>> -Kenny
>>>
>>> [1] - unless I forget to detach my NVMe USB4 external drive before
>>> suspending, which is a breakage that appears to have gone in sometime
>>> around the 6.10 series, but I haven't been able to bisect it
>>
>> -- 
>> Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County
>> CA
>>
> 

-- 
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange 
County CA


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac)
  2025-02-11  0:18   ` Kenneth Crudup
@ 2025-02-11  5:57     ` Mika Westerberg
  2025-02-11  6:17       ` diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac)) Kenneth Crudup
  0 siblings, 1 reply; 28+ messages in thread
From: Mika Westerberg @ 2025-02-11  5:57 UTC (permalink / raw)
  To: Kenneth Crudup
  Cc: Bjorn Helgaas, ilpo.jarvinen, Bjorn Helgaas, Jian-Hong Pan,
	linux-pci, linux-kernel, Niklāvs Koļesņikovs,
	Andreas Noever, Michael Jamet, Yehezkel Bernat, linux-usb

Hi,

Can you elaborate what are the steps you do to reproduce the issue? Also it
would help to see full dmesg if possible with "thunderbolt.dyndbg=+p" in
the kernel command line.

I suspect based on the below is that the USB4 link just goes down and does
not come up upon resume. What kind of devices are involved?

Actually it would be good if you could greate entry on bugzilla.kernel.org
and attach the files there.

On Mon, Feb 10, 2025 at 04:18:51PM -0800, Kenneth Crudup wrote:
> 
> I wonder about that, too- but for the most part, my resumes from
> hibernate/suspend tend to work most of the time.
> 
> I've instrumented .../drivers/thunderbolt/path.c to add some pr_info()s in
> __tb_path_deactivate_hop() to see if it's either tripping up on a NPE and to
> see where it dies, but since then nothing has happened again, so I do wonder
> sometimes if it's an errant driver (i.e., just subtle changes in the
> ordering of code could be making a difference).
> 
> That being said, leaving a USB4-to-NVMe bridged NVMe device connected to a
> Thunderbolt dock on suspend causes hangs on resume that only the power
> button seems to recover from. I've really been trying to bisect the cause on
> this one as it's usually my only failure mode (and frustrating when I forget
> to unplug before I suspend). What complicates that one more is it now
> doesn't happen EVERY time.
> 
> One change I've done from Linus' master (which I tend to grab every few
> days) and have for a while is this:
> 
> ----
> -        cflags-$(CONFIG_MCORE2)                += -march=core2
> +        cflags-$(CONFIG_MCORE2) += \
> +                $(call cc-option,-march=alderlake,$(call
> cc-option,-mtune=native)) \
> +                $(call cc-option,-mtune=alderlake,$(call
> cc-option,-mtune=native))
> ----
> 
> But I'm running "gcc (Ubuntu 14.2.0-4ubuntu2) 14.2.0", so that should be
> mature enough to not introduce any Intel-chipset-specific bugs, right?
> 
> -Kenny
> 
> On 2/10/25 13:05, Bjorn Helgaas wrote:
> > [+cc Thunderbolt folks, original post
> > https://lore.kernel.org/r/04091f53-3c94-4533-ab48-e9296e6e2841@panix.com]
> > 
> > Wow, something about your platform or usage is really good at finding
> > these bugs ;)  The original post is a page fault in xe_display_pm_resume()
> > and this one a NULL pointer dereference in __tb_path_deactivate_hop().
> > You see these so frequently that I would think Google would find more
> > reports, but I don't see any, which makes me wonder if this is some
> > kind of memory corruption related to a driver on your system.
> > 
> > On Sat, Feb 08, 2025 at 07:47:10PM -0800, Kenneth Crudup wrote:
> > > Eh, NVM.
> > > 
> > > Just had another resume crash with it reverted:
> > > 
> > > ----
> > > <6>[69848.656985][T25549] CPU19 is up
> > > <6>[69848.666220][T25549] ACPI: PM: Waking up from system sleep state S4
> > > <6>[69848.693976][T25549] ACPI: EC: interrupt unblocked
> > > <4>[69848.704735][T25549] thunderbolt 0000:00:0d.2: 0:5: path does not end
> > > on a DP adapter, cleaning up
> > > <1>[69848.706322][T25549] BUG: kernel NULL pointer dereference, address:
> > > 0000000000000384
> > > <1>[69848.706324][T25549] #PF: supervisor read access in kernel mode
> > > <1>[69848.706325][T25549] #PF: error_code(0x0000) - not-present page
> > > <6>[69848.706326][T25549] PGD 0 P4D 0
> > > <4>[69848.706327][T25549] Oops: Oops: 0000 [#1] PREEMPT SMP
> > > <4>[69848.706330][T25549] CPU: 1 UID: 0 PID: 25549 Comm: systemd-sleep
> > > Tainted: G S   U     O       6.14.0-rc1-kenny+ #4
> > > <4>[69848.706332][T25549] Tainted: [S]=CPU_OUT_OF_SPEC, [U]=USER,
> > > [O]=OOT_MODULE
> > > <4>[69848.706332][T25549] Hardware name: Dell Inc. XPS 9320/0KNXGD, BIOS
> > > 2.18.1 12/24/2024
> > > <4>[69848.706333][T25549] RIP: 0010:__tb_path_deactivate_hop+0x25/0x220
> > > <4>[69848.706337][T25549] Code: 5d 5d c3 c3 90 55 48 89 e5 41 57 41 56 41 55
> > > 41 54 53 48 83 ec 18 89 55 c4 65 48 8b 04 25 28 00 00 00 48 89 45 d0 48 8b
> > > 47 20 <80> b8 84 03 00 00 00 0f 85 64 01 00 00 8b 90 04 03 00 00 44 8d 2c
> > > <4>[69848.706338][T25549] RSP: 0000:ffffa2d40a9fb7f8 EFLAGS: 00010292
> > > <4>[69848.706339][T25549] RAX: 0000000000000000 RBX: 0000000000000001 RCX:
> > > 0000000000000002
> > > <4>[69848.706340][T25549] RDX: 000000000000000e RSI: 00000000a863e150 RDI:
> > > ffffa2d400803b00
> > > <4>[69848.706340][T25549] RBP: ffffa2d40a9fb838 R08: 0000000000000000 R09:
> > > ffffffffa9a55760
> > > <4>[69848.706341][T25549] R10: 0000000000000000 R11: 0000000000000000 R12:
> > > ffff8d1fc6bce7c0
> > > <4>[69848.706341][T25549] R13: 0000000000000028 R14: ffff8d1fc1861000 R15:
> > > ffff8d1fc18513e8
> > > <4>[69848.706342][T25549] FS:  00007fb5b3631940(0000)
> > > GS:ffff8d272f440000(0000) knlGS:0000000000000000
> > > <4>[69848.706343][T25549] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > <4>[69848.706343][T25549] CR2: 0000000000000384 CR3: 00000003cea00003 CR4:
> > > 0000000000770ef0
> > > <4>[69848.706344][T25549] PKRU: 55555554
> > > <4>[69848.706344][T25549] Call Trace:
> > > <4>[69848.706345][T25549]  <TASK>
> > > <4>[69848.706348][T25549]  ? show_regs.part.0+0x1d/0x20
> > > <4>[69848.706351][T25549]  ? __die+0x52/0x91
> > > <4>[69848.706352][T25549]  ? page_fault_oops+0x9a/0x220
> > > <4>[69848.706354][T25549]  ? exc_page_fault+0x2fc/0x5c0
> > > <4>[69848.706357][T25549]  ? asm_exc_page_fault+0x27/0x30
> > > <4>[69848.706359][T25549]  ? __tb_path_deactivate_hop+0x25/0x220
> > > <4>[69848.706360][T25549]  __tb_path_deactivate_hops+0x37/0x60
> > > <4>[69848.706361][T25549]  tb_path_deactivate+0x1e/0x110
> > > <4>[69848.706362][T25549]  tb_tunnel_deactivate+0x65/0x120
> > > <4>[69848.706363][T25549]  tb_tunnel_discover_dp+0x373/0x670
> > > <4>[69848.706364][T25549]  tb_switch_discover_tunnels+0x71/0x1e0
> > > <4>[69848.706366][T25549]  tb_resume_noirq+0x91/0x2a0
> > > <4>[69848.706368][T25549]  tb_domain_resume_noirq+0x3f/0x60
> > > <4>[69848.706369][T25549]  nhi_resume_noirq+0x34/0x90
> > > <4>[69848.706370][T25549]  pci_pm_restore_noirq+0x71/0xc0
> > > <4>[69848.706372][T25549]  ? new_id_store+0x1b0/0x1b0
> > > <4>[69848.706373][T25549]  dpm_run_callback+0x40/0xb0
> > > <4>[69848.706375][T25549]  device_resume_noirq+0xc4/0x2a0
> > > <4>[69848.706376][T25549]  dpm_noirq_resume_devices+0x11b/0x150
> > > <4>[69848.706376][T25549]  dpm_resume_start+0xc/0x30
> > > <4>[69848.706377][T25549]  hibernation_snapshot+0x26d/0x430
> > > <4>[69848.706379][T25549]  hibernate.cold+0x9c/0x333
> > > <4>[69848.706380][T25549]  state_store+0xbe/0xc0
> > > <4>[69848.706381][T25549]  kobj_attr_store+0xf/0x20
> > > <4>[69848.706383][T25549]  sysfs_kf_write+0x34/0x40
> > > <4>[69848.706385][T25549]  kernfs_fop_write_iter+0x134/0x1e0
> > > <4>[69848.706386][T25549]  vfs_write+0x244/0x410
> > > <4>[69848.706388][T25549]  ksys_write+0x63/0xd0
> > > <4>[69848.706389][T25549]  __x64_sys_write+0x14/0x20
> > > <4>[69848.706390][T25549]  x64_sys_call+0x9eb/0xa00
> > > <4>[69848.706392][T25549]  do_syscall_64+0x63/0xf0
> > > <4>[69848.706394][T25549]  ? do_filp_open+0xbe/0x170
> > > <4>[69848.706395][T25549]  ? do_wp_page+0x7f3/0xe80
> > > <4>[69848.706398][T25549]  ? ___pte_offset_map+0x17/0xe0
> > > <4>[69848.706399][T25549]  ? __handle_mm_fault+0xb13/0x1160
> > > <4>[69848.706400][T25549]  ? do_syscall_64+0x6f/0xf0
> > > <4>[69848.706401][T25549]  ? strncpy_from_user+0x25/0xf0
> > > <4>[69848.706402][T25549]  ? __count_memcg_events+0x49/0xe0
> > > <4>[69848.706403][T25549]  ? handle_mm_fault+0x181/0x2a0
> > > <4>[69848.706404][T25549]  ? irqentry_exit+0x4a/0x60
> > > <4>[69848.706405][T25549]  ? exc_page_fault+0x196/0x5c0
> > > <4>[69848.706406][T25549]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
> > > <4>[69848.706407][T25549] RIP: 0033:0x7fb5b3526274
> > > <4>[69848.706411][T25549] Code: Unable to access opcode bytes at
> > > 0x7fb5b352624a.
> > > <4>[69848.706411][T25549] RSP: 002b:00007ffe667063f8 EFLAGS: 00000202
> > > ORIG_RAX: 0000000000000001
> > > <4>[69848.706412][T25549] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
> > > 00007fb5b3526274
> > > <4>[69848.706413][T25549] RDX: 0000000000000005 RSI: 000055ae2380a030 RDI:
> > > 0000000000000007
> > > <4>[69848.706413][T25549] RBP: 00007ffe66706420 R08: 0000000000000000 R09:
> > > 0000000000000001
> > > <4>[69848.706414][T25549] R10: 0000000000000003 R11: 0000000000000202 R12:
> > > 0000000000000005
> > > <4>[69848.706414][T25549] R13: 000055ae2380a030 R14: 000055ae237f62a0 R15:
> > > 00007fb5b360fea0
> > > <4>[69848.706415][T25549]  </TASK>
> > > <4>[69848.706415][T25549] Modules linked in: vmw_vmci snd_soc_sof_sdw
> > > snd_soc_sdw_utils snd_sof_probes iwlmvm mei_hdcp mei_pxp mac80211
> > > snd_sof_pci_intel_tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic
> > > snd_sof_pci soundwire_intel soundwire_generic_allocation soundwire_cadence
> > > snd_sof_intel_hda_common snd_soc_hdac_hda iwlwifi btusb
> > > snd_sof_intel_hda_mlink btintel snd_sof_intel_hda cfg80211 mei_me ov01a10 xe
> > > drm_ttm_helper gpu_sched drm_suballoc_helper drm_gpuvm drm_exec i915
> > > drm_buddy intel_gtt drm_display_helper cec ttm
> > > <4>[69848.706433][T25549] CR2: 0000000000000384
> > > <4>[69848.706435][T25549] ---[ end trace 0000000000000000 ]
> > > ----
> > > 
> > > On 2/8/25 12:56, Kenneth Crudup wrote:
> > > > 
> > > > Guys, I don't think this commit is right; I've had 2 out of three resume
> > > > failures since this change went into Linus' master. I've attached a
> > > > pstore dump of the latest crash, and while it appears to be coming from
> > > > the Intel XE driver, 95% of my (s0ix) resumes worked previously[1]
> > > > before this change.
> > > > 
> > > > LMK if you need more information.
> > > > 
> > > > -Kenny
> > > > 
> > > > [1] - unless I forget to detach my NVMe USB4 external drive before
> > > > suspending, which is a breakage that appears to have gone in sometime
> > > > around the 6.10 series, but I haven't been able to bisect it
> > > 
> > > -- 
> > > Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County
> > > CA
> > > 
> > 
> 
> -- 
> Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County
> CA

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))
  2025-02-11  5:57     ` Mika Westerberg
@ 2025-02-11  6:17       ` Kenneth Crudup
  2025-02-13 13:59         ` Mika Westerberg
  0 siblings, 1 reply; 28+ messages in thread
From: Kenneth Crudup @ 2025-02-11  6:17 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Bjorn Helgaas, ilpo.jarvinen, Bjorn Helgaas, Jian-Hong Pan,
	linux-pci, linux-kernel, Niklāvs Koļesņikovs,
	Andreas Noever, Michael Jamet, Yehezkel Bernat, linux-usb,
	Kenneth Crudup


The setup is fairly simple (once I'd figured out the failure mode):

- Have an ASMedia 246x NVMe-to-USB4 housing (with NVMe drive) attached 
to the system via my TB4 dock (CalDigit TS4, but I've had it happen with 
a Dell dock as well (either with the drive mounted, or not) when I suspend

- Resume with the drive disconnected (i.e., I've gone from home to the 
office).

It doesn't happen every time, and for some crazy reason elapsed time 
between suspend and resume seems to make it more likely to happen. Plus 
it seems directly attaching the drive (i.e., no dock in between) doesn't 
cause resumes to fail.

Unfortunately there's no "this always happens" failure mode, else I'd've 
had a bisection by now (it's been happening for a few months).

(... and although I doubt it'll make any difference, I built and am now
running a kernel without the Alderlake optimizations/directives; maybe 
there's something to the "errant memory" thing.)

But I'll run the steps below and post up the results (as well as create 
a Bugzilla).

-K


On 2/10/25 21:57, Mika Westerberg wrote:
> Hi,
> 
> Can you elaborate what are the steps you do to reproduce the issue? Also it
> would help to see full dmesg if possible with "thunderbolt.dyndbg=+p" in
> the kernel command line.
> 
> I suspect based on the below is that the USB4 link just goes down and does
> not come up upon resume. What kind of devices are involved?
> 
> Actually it would be good if you could greate entry on bugzilla.kernel.org
> and attach the files there.
> 
> On Mon, Feb 10, 2025 at 04:18:51PM -0800, Kenneth Crudup wrote:
>>
>> I wonder about that, too- but for the most part, my resumes from
>> hibernate/suspend tend to work most of the time.
>>
>> I've instrumented .../drivers/thunderbolt/path.c to add some pr_info()s in
>> __tb_path_deactivate_hop() to see if it's either tripping up on a NPE and to
>> see where it dies, but since then nothing has happened again, so I do wonder
>> sometimes if it's an errant driver (i.e., just subtle changes in the
>> ordering of code could be making a difference).
>>
>> That being said, leaving a USB4-to-NVMe bridged NVMe device connected to a
>> Thunderbolt dock on suspend causes hangs on resume that only the power
>> button seems to recover from. I've really been trying to bisect the cause on
>> this one as it's usually my only failure mode (and frustrating when I forget
>> to unplug before I suspend). What complicates that one more is it now
>> doesn't happen EVERY time.
>>
>> One change I've done from Linus' master (which I tend to grab every few
>> days) and have for a while is this:
>>
>> ----
>> -        cflags-$(CONFIG_MCORE2)                += -march=core2
>> +        cflags-$(CONFIG_MCORE2) += \
>> +                $(call cc-option,-march=alderlake,$(call
>> cc-option,-mtune=native)) \
>> +                $(call cc-option,-mtune=alderlake,$(call
>> cc-option,-mtune=native))
>> ----
>>
>> But I'm running "gcc (Ubuntu 14.2.0-4ubuntu2) 14.2.0", so that should be
>> mature enough to not introduce any Intel-chipset-specific bugs, right?
>>
>> -Kenny
>>
>> On 2/10/25 13:05, Bjorn Helgaas wrote:
>>> [+cc Thunderbolt folks, original post
>>> https://lore.kernel.org/r/04091f53-3c94-4533-ab48-e9296e6e2841@panix.com]
>>>
>>> Wow, something about your platform or usage is really good at finding
>>> these bugs ;)  The original post is a page fault in xe_display_pm_resume()
>>> and this one a NULL pointer dereference in __tb_path_deactivate_hop().
>>> You see these so frequently that I would think Google would find more
>>> reports, but I don't see any, which makes me wonder if this is some
>>> kind of memory corruption related to a driver on your system.
>>>
>>> On Sat, Feb 08, 2025 at 07:47:10PM -0800, Kenneth Crudup wrote:
>>>> Eh, NVM.
>>>>
>>>> Just had another resume crash with it reverted:
>>>>
>>>> ----
>>>> <6>[69848.656985][T25549] CPU19 is up
>>>> <6>[69848.666220][T25549] ACPI: PM: Waking up from system sleep state S4
>>>> <6>[69848.693976][T25549] ACPI: EC: interrupt unblocked
>>>> <4>[69848.704735][T25549] thunderbolt 0000:00:0d.2: 0:5: path does not end
>>>> on a DP adapter, cleaning up
>>>> <1>[69848.706322][T25549] BUG: kernel NULL pointer dereference, address:
>>>> 0000000000000384
>>>> <1>[69848.706324][T25549] #PF: supervisor read access in kernel mode
>>>> <1>[69848.706325][T25549] #PF: error_code(0x0000) - not-present page
>>>> <6>[69848.706326][T25549] PGD 0 P4D 0
>>>> <4>[69848.706327][T25549] Oops: Oops: 0000 [#1] PREEMPT SMP
>>>> <4>[69848.706330][T25549] CPU: 1 UID: 0 PID: 25549 Comm: systemd-sleep
>>>> Tainted: G S   U     O       6.14.0-rc1-kenny+ #4
>>>> <4>[69848.706332][T25549] Tainted: [S]=CPU_OUT_OF_SPEC, [U]=USER,
>>>> [O]=OOT_MODULE
>>>> <4>[69848.706332][T25549] Hardware name: Dell Inc. XPS 9320/0KNXGD, BIOS
>>>> 2.18.1 12/24/2024
>>>> <4>[69848.706333][T25549] RIP: 0010:__tb_path_deactivate_hop+0x25/0x220
>>>> <4>[69848.706337][T25549] Code: 5d 5d c3 c3 90 55 48 89 e5 41 57 41 56 41 55
>>>> 41 54 53 48 83 ec 18 89 55 c4 65 48 8b 04 25 28 00 00 00 48 89 45 d0 48 8b
>>>> 47 20 <80> b8 84 03 00 00 00 0f 85 64 01 00 00 8b 90 04 03 00 00 44 8d 2c
>>>> <4>[69848.706338][T25549] RSP: 0000:ffffa2d40a9fb7f8 EFLAGS: 00010292
>>>> <4>[69848.706339][T25549] RAX: 0000000000000000 RBX: 0000000000000001 RCX:
>>>> 0000000000000002
>>>> <4>[69848.706340][T25549] RDX: 000000000000000e RSI: 00000000a863e150 RDI:
>>>> ffffa2d400803b00
>>>> <4>[69848.706340][T25549] RBP: ffffa2d40a9fb838 R08: 0000000000000000 R09:
>>>> ffffffffa9a55760
>>>> <4>[69848.706341][T25549] R10: 0000000000000000 R11: 0000000000000000 R12:
>>>> ffff8d1fc6bce7c0
>>>> <4>[69848.706341][T25549] R13: 0000000000000028 R14: ffff8d1fc1861000 R15:
>>>> ffff8d1fc18513e8
>>>> <4>[69848.706342][T25549] FS:  00007fb5b3631940(0000)
>>>> GS:ffff8d272f440000(0000) knlGS:0000000000000000
>>>> <4>[69848.706343][T25549] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> <4>[69848.706343][T25549] CR2: 0000000000000384 CR3: 00000003cea00003 CR4:
>>>> 0000000000770ef0
>>>> <4>[69848.706344][T25549] PKRU: 55555554
>>>> <4>[69848.706344][T25549] Call Trace:
>>>> <4>[69848.706345][T25549]  <TASK>
>>>> <4>[69848.706348][T25549]  ? show_regs.part.0+0x1d/0x20
>>>> <4>[69848.706351][T25549]  ? __die+0x52/0x91
>>>> <4>[69848.706352][T25549]  ? page_fault_oops+0x9a/0x220
>>>> <4>[69848.706354][T25549]  ? exc_page_fault+0x2fc/0x5c0
>>>> <4>[69848.706357][T25549]  ? asm_exc_page_fault+0x27/0x30
>>>> <4>[69848.706359][T25549]  ? __tb_path_deactivate_hop+0x25/0x220
>>>> <4>[69848.706360][T25549]  __tb_path_deactivate_hops+0x37/0x60
>>>> <4>[69848.706361][T25549]  tb_path_deactivate+0x1e/0x110
>>>> <4>[69848.706362][T25549]  tb_tunnel_deactivate+0x65/0x120
>>>> <4>[69848.706363][T25549]  tb_tunnel_discover_dp+0x373/0x670
>>>> <4>[69848.706364][T25549]  tb_switch_discover_tunnels+0x71/0x1e0
>>>> <4>[69848.706366][T25549]  tb_resume_noirq+0x91/0x2a0
>>>> <4>[69848.706368][T25549]  tb_domain_resume_noirq+0x3f/0x60
>>>> <4>[69848.706369][T25549]  nhi_resume_noirq+0x34/0x90
>>>> <4>[69848.706370][T25549]  pci_pm_restore_noirq+0x71/0xc0
>>>> <4>[69848.706372][T25549]  ? new_id_store+0x1b0/0x1b0
>>>> <4>[69848.706373][T25549]  dpm_run_callback+0x40/0xb0
>>>> <4>[69848.706375][T25549]  device_resume_noirq+0xc4/0x2a0
>>>> <4>[69848.706376][T25549]  dpm_noirq_resume_devices+0x11b/0x150
>>>> <4>[69848.706376][T25549]  dpm_resume_start+0xc/0x30
>>>> <4>[69848.706377][T25549]  hibernation_snapshot+0x26d/0x430
>>>> <4>[69848.706379][T25549]  hibernate.cold+0x9c/0x333
>>>> <4>[69848.706380][T25549]  state_store+0xbe/0xc0
>>>> <4>[69848.706381][T25549]  kobj_attr_store+0xf/0x20
>>>> <4>[69848.706383][T25549]  sysfs_kf_write+0x34/0x40
>>>> <4>[69848.706385][T25549]  kernfs_fop_write_iter+0x134/0x1e0
>>>> <4>[69848.706386][T25549]  vfs_write+0x244/0x410
>>>> <4>[69848.706388][T25549]  ksys_write+0x63/0xd0
>>>> <4>[69848.706389][T25549]  __x64_sys_write+0x14/0x20
>>>> <4>[69848.706390][T25549]  x64_sys_call+0x9eb/0xa00
>>>> <4>[69848.706392][T25549]  do_syscall_64+0x63/0xf0
>>>> <4>[69848.706394][T25549]  ? do_filp_open+0xbe/0x170
>>>> <4>[69848.706395][T25549]  ? do_wp_page+0x7f3/0xe80
>>>> <4>[69848.706398][T25549]  ? ___pte_offset_map+0x17/0xe0
>>>> <4>[69848.706399][T25549]  ? __handle_mm_fault+0xb13/0x1160
>>>> <4>[69848.706400][T25549]  ? do_syscall_64+0x6f/0xf0
>>>> <4>[69848.706401][T25549]  ? strncpy_from_user+0x25/0xf0
>>>> <4>[69848.706402][T25549]  ? __count_memcg_events+0x49/0xe0
>>>> <4>[69848.706403][T25549]  ? handle_mm_fault+0x181/0x2a0
>>>> <4>[69848.706404][T25549]  ? irqentry_exit+0x4a/0x60
>>>> <4>[69848.706405][T25549]  ? exc_page_fault+0x196/0x5c0
>>>> <4>[69848.706406][T25549]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
>>>> <4>[69848.706407][T25549] RIP: 0033:0x7fb5b3526274
>>>> <4>[69848.706411][T25549] Code: Unable to access opcode bytes at
>>>> 0x7fb5b352624a.
>>>> <4>[69848.706411][T25549] RSP: 002b:00007ffe667063f8 EFLAGS: 00000202
>>>> ORIG_RAX: 0000000000000001
>>>> <4>[69848.706412][T25549] RAX: ffffffffffffffda RBX: 0000000000000005 RCX:
>>>> 00007fb5b3526274
>>>> <4>[69848.706413][T25549] RDX: 0000000000000005 RSI: 000055ae2380a030 RDI:
>>>> 0000000000000007
>>>> <4>[69848.706413][T25549] RBP: 00007ffe66706420 R08: 0000000000000000 R09:
>>>> 0000000000000001
>>>> <4>[69848.706414][T25549] R10: 0000000000000003 R11: 0000000000000202 R12:
>>>> 0000000000000005
>>>> <4>[69848.706414][T25549] R13: 000055ae2380a030 R14: 000055ae237f62a0 R15:
>>>> 00007fb5b360fea0
>>>> <4>[69848.706415][T25549]  </TASK>
>>>> <4>[69848.706415][T25549] Modules linked in: vmw_vmci snd_soc_sof_sdw
>>>> snd_soc_sdw_utils snd_sof_probes iwlmvm mei_hdcp mei_pxp mac80211
>>>> snd_sof_pci_intel_tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic
>>>> snd_sof_pci soundwire_intel soundwire_generic_allocation soundwire_cadence
>>>> snd_sof_intel_hda_common snd_soc_hdac_hda iwlwifi btusb
>>>> snd_sof_intel_hda_mlink btintel snd_sof_intel_hda cfg80211 mei_me ov01a10 xe
>>>> drm_ttm_helper gpu_sched drm_suballoc_helper drm_gpuvm drm_exec i915
>>>> drm_buddy intel_gtt drm_display_helper cec ttm
>>>> <4>[69848.706433][T25549] CR2: 0000000000000384
>>>> <4>[69848.706435][T25549] ---[ end trace 0000000000000000 ]
>>>> ----
>>>>
>>>> On 2/8/25 12:56, Kenneth Crudup wrote:
>>>>>
>>>>> Guys, I don't think this commit is right; I've had 2 out of three resume
>>>>> failures since this change went into Linus' master. I've attached a
>>>>> pstore dump of the latest crash, and while it appears to be coming from
>>>>> the Intel XE driver, 95% of my (s0ix) resumes worked previously[1]
>>>>> before this change.
>>>>>
>>>>> LMK if you need more information.
>>>>>
>>>>> -Kenny
>>>>>
>>>>> [1] - unless I forget to detach my NVMe USB4 external drive before
>>>>> suspending, which is a breakage that appears to have gone in sometime
>>>>> around the 6.10 series, but I haven't been able to bisect it
>>>>
>>>> -- 
>>>> Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County
>>>> CA
>>>>
>>>
>>
>> -- 
>> Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County
>> CA
> 

-- 
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange 
County CA


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))
  2025-02-11  6:17       ` diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac)) Kenneth Crudup
@ 2025-02-13 13:59         ` Mika Westerberg
  2025-02-13 19:19           ` Kenneth Crudup
  0 siblings, 1 reply; 28+ messages in thread
From: Mika Westerberg @ 2025-02-13 13:59 UTC (permalink / raw)
  To: Kenneth Crudup
  Cc: Bjorn Helgaas, ilpo.jarvinen, Bjorn Helgaas, Jian-Hong Pan,
	linux-pci, linux-kernel, Niklāvs Koļesņikovs,
	Andreas Noever, Michael Jamet, Yehezkel Bernat, linux-usb

Hi,

On Mon, Feb 10, 2025 at 10:17:47PM -0800, Kenneth Crudup wrote:
> 
> The setup is fairly simple (once I'd figured out the failure mode):
> 
> - Have an ASMedia 246x NVMe-to-USB4 housing (with NVMe drive) attached to
> the system via my TB4 dock (CalDigit TS4, but I've had it happen with a Dell
> dock as well (either with the drive mounted, or not) when I suspend
> 
> - Resume with the drive disconnected (i.e., I've gone from home to the
> office).

I see this is fairly normal use-case (sans the disk I guess). Steps to
follow are then something like:

1. Boot the system, nothing connected.
2. Connect CalDigit TS4 (PCIe tunnel is enabled by the UI) to the host Type-C port.
3. Connect ASMedia NVMe to CalDigit downstream Type-C port (PCIe tunnel is enabled by the UI).
4. Verify that the NVMe is visible (lspci, lsblk).

The topology looks like below:

  Host <- TB -> CalDigit TS4 <- TB -> NVMe

5. Suspend the system (close the lid).
6. Unplug the CalDigit TS4.
7. Resume the system (open the lid).

Expectation: system wakes up just fine.
Actual behavior: system crashes and burns.

Do you BTW, unmount the filesystem before you suspend?

> It doesn't happen every time, and for some crazy reason elapsed time between
> suspend and resume seems to make it more likely to happen. Plus it seems
> directly attaching the drive (i.e., no dock in between) doesn't cause
> resumes to fail.

It would be good to see the dmesg output (with thunderbolt.dyndbg=+p) with
these connected, even without suspending so see if there is anything
missing. Since it is Dell system I would expect they have tested this in
Linux pretty well so probably we don't see anything weird there.

I have similar here (not the same devices though) so I can try on my end if
this repros.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))
  2025-02-13 13:59         ` Mika Westerberg
@ 2025-02-13 19:19           ` Kenneth Crudup
  2025-02-14 16:29             ` Mika Westerberg
  0 siblings, 1 reply; 28+ messages in thread
From: Kenneth Crudup @ 2025-02-13 19:19 UTC (permalink / raw)
  To: Mika Westerberg, Me
  Cc: Bjorn Helgaas, ilpo.jarvinen, Bjorn Helgaas, Jian-Hong Pan,
	linux-pci, linux-kernel, Niklāvs Koļesņikovs,
	Andreas Noever, Michael Jamet, Yehezkel Bernat, linux-usb

[-- Attachment #1: Type: text/plain, Size: 386 bytes --]


On 2/13/25 05:59, Mika Westerberg wrote:

> Hi,

As Murphy's would have it, now my crashes are display-driver related 
(this is Xe, but I've also seen it with i915).

Attached here just for the heck of it, but I'll be better testing the 
NVMe enclosure-related failures this weekend. Stay tuned!

-K

-- 
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange 
County CA

[-- Attachment #2: pstore-202502131049.tar.gz --]
[-- Type: application/gzip, Size: 8000 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))
  2025-02-13 19:19           ` Kenneth Crudup
@ 2025-02-14 16:29             ` Mika Westerberg
  2025-02-14 17:39               ` Kenneth Crudup
  0 siblings, 1 reply; 28+ messages in thread
From: Mika Westerberg @ 2025-02-14 16:29 UTC (permalink / raw)
  To: Kenneth Crudup
  Cc: Bjorn Helgaas, ilpo.jarvinen, Bjorn Helgaas, Jian-Hong Pan,
	linux-pci, linux-kernel, Niklāvs Koļesņikovs,
	Andreas Noever, Michael Jamet, Yehezkel Bernat, linux-usb

Hi,

On Thu, Feb 13, 2025 at 11:19:35AM -0800, Kenneth Crudup wrote:
> 
> On 2/13/25 05:59, Mika Westerberg wrote:
> 
> > Hi,
> 
> As Murphy's would have it, now my crashes are display-driver related (this
> is Xe, but I've also seen it with i915).
> 
> Attached here just for the heck of it, but I'll be better testing the NVMe
> enclosure-related failures this weekend. Stay tuned!

Okay, I checked quickly and no TB related crash there but I was actually
able to reproduce hang when I unplug the device chain during suspend. I did
not yet have time to look into it deeper. I'm sure this has been working
fine in the past as we tested all kinds of topologies including similar to
this.

I will be out next week for vacation but will continue after that if the
problem is not alraedy solved ;-)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))
  2025-02-14 16:29             ` Mika Westerberg
@ 2025-02-14 17:39               ` Kenneth Crudup
  2025-02-26  8:44                 ` Mika Westerberg
  0 siblings, 1 reply; 28+ messages in thread
From: Kenneth Crudup @ 2025-02-14 17:39 UTC (permalink / raw)
  To: Mika Westerberg, Me
  Cc: Bjorn Helgaas, ilpo.jarvinen, Bjorn Helgaas, Jian-Hong Pan,
	linux-pci, linux-kernel, Niklāvs Koļesņikovs,
	Andreas Noever, Michael Jamet, Yehezkel Bernat, linux-usb


This is excellent news that you were able to reproduce it- I'd figured 
this regression would have been caught already (as I do remember this 
working before) and was worried it may have been specific to a 
particular piece of hardware (or software setup) on my system.

I'll see what I can dig up on my end, but as I'm not expert in these 
subsystems I may not be able to diagnose anything until your return.

I also saw some DRM/connected fixes posted to Linus' master so maybe one 
of them corrects this new display-crash issue (I'm not home on my big 
monitor to be able to test yet).

-Kenny

On 2/14/25 08:29, Mika Westerberg wrote:
> Hi,
> 
> On Thu, Feb 13, 2025 at 11:19:35AM -0800, Kenneth Crudup wrote:
>>
>> On 2/13/25 05:59, Mika Westerberg wrote:
>>
>>> Hi,
>>
>> As Murphy's would have it, now my crashes are display-driver related (this
>> is Xe, but I've also seen it with i915).
>>
>> Attached here just for the heck of it, but I'll be better testing the NVMe
>> enclosure-related failures this weekend. Stay tuned!
> 
> Okay, I checked quickly and no TB related crash there but I was actually
> able to reproduce hang when I unplug the device chain during suspend. I did
> not yet have time to look into it deeper. I'm sure this has been working
> fine in the past as we tested all kinds of topologies including similar to
> this.
> 
> I will be out next week for vacation but will continue after that if the
> problem is not alraedy solved ;-)
> 

-- 
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange 
County CA


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))
  2025-02-14 17:39               ` Kenneth Crudup
@ 2025-02-26  8:44                 ` Mika Westerberg
  2025-02-26  9:10                   ` Lukas Wunner
                                     ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Mika Westerberg @ 2025-02-26  8:44 UTC (permalink / raw)
  To: Kenneth Crudup
  Cc: Bjorn Helgaas, ilpo.jarvinen, Bjorn Helgaas, Jian-Hong Pan,
	linux-pci, linux-kernel, Niklāvs Koļesņikovs,
	Andreas Noever, Michael Jamet, Lukas Wunner, Yehezkel Bernat,
	linux-usb

[-- Attachment #1: Type: text/plain, Size: 3075 bytes --]

Hi Kenneth,

On Fri, Feb 14, 2025 at 09:39:33AM -0800, Kenneth Crudup wrote:
> 
> This is excellent news that you were able to reproduce it- I'd figured this
> regression would have been caught already (as I do remember this working
> before) and was worried it may have been specific to a particular piece of
> hardware (or software setup) on my system.
> 
> I'll see what I can dig up on my end, but as I'm not expert in these
> subsystems I may not be able to diagnose anything until your return.

[Back now]

My git bisect ended up to this commit:

  9d573d19547b ("PCI: pciehp: Detect device replacement during system sleep")

Adding Lukas who is the expert.

My steps to reproduce on Intel Meteor Lake based reference system are:

1. Boot the system up, nothing connected.
2. Once up, connect Thunderbolt 4 dock and Thunderbolt 3 NVMe in a chain:

  [Meteor Lake host] <--> [TB 4 dock] <--> [TB 3 NVMe]

3. Authorize PCIe tunnels (whatever your distro provides, my buildroot just
    has the debugging tools so running 'tbauth -r 301')

4. Check that the PCIe topology matches the expected (lspci)

5. Enter s2idle:

  # rtcwake -s 30 -mmem

6. Once it is suspended, unplug the cable between the host and the dock.

7. Wait for the resume to happen.

Expectation: The system wakes up fine, notices that the TB and PCIe devices
are gone, stays responsive and usable.

Actual result: Resume never completes.

I added "no_console_suspend" to the command line and the did sysrq-w to
get list of blocked tasks. I've attached it just in case it is needed.

If I revert the above commit the issue is gone. Now I'm not sure if this is
exactly the same issue that you are seeing but nevertheless this is kind of
normal use case so definitely something we should get fixed.

Lukas, if you need any more information let me know. I can reproduce this
easily.

> I also saw some DRM/connected fixes posted to Linus' master so maybe one of
> them corrects this new display-crash issue (I'm not home on my big monitor
> to be able to test yet).
> 
> -Kenny
> 
> On 2/14/25 08:29, Mika Westerberg wrote:
> > Hi,
> > 
> > On Thu, Feb 13, 2025 at 11:19:35AM -0800, Kenneth Crudup wrote:
> > > 
> > > On 2/13/25 05:59, Mika Westerberg wrote:
> > > 
> > > > Hi,
> > > 
> > > As Murphy's would have it, now my crashes are display-driver related (this
> > > is Xe, but I've also seen it with i915).
> > > 
> > > Attached here just for the heck of it, but I'll be better testing the NVMe
> > > enclosure-related failures this weekend. Stay tuned!
> > 
> > Okay, I checked quickly and no TB related crash there but I was actually
> > able to reproduce hang when I unplug the device chain during suspend. I did
> > not yet have time to look into it deeper. I'm sure this has been working
> > fine in the past as we tested all kinds of topologies including similar to
> > this.
> > 
> > I will be out next week for vacation but will continue after that if the
> > problem is not alraedy solved ;-)
> > 
> 
> -- 
> Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County
> CA

[-- Attachment #2: 6.14-hang-nvme.out --]
[-- Type: text/plain, Size: 17094 bytes --]

[ 1371.331135] sysrq: Show Blocked State
[ 1371.334833] task:kworker/u56:0   state:D stack:0     pid:11    tgid:11    ppid:2      task_flags:0x4208060 flags:0x00004000
[ 1371.345878] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[ 1371.351243] Call Trace:
[ 1371.353684]  <TASK>
[ 1371.355780]  __schedule+0x1074/0x3000
[ 1371.359428]  ? __pfx___schedule+0x10/0x10
[ 1371.363415]  ? do_raw_spin_lock+0x12f/0x270
[ 1371.367575]  ? __kasan_check_write+0x14/0x20
[ 1371.371818]  ? __pfx_do_raw_spin_lock+0x10/0x10
[ 1371.376316]  ? __kasan_check_read+0x11/0x20
[ 1371.380471]  ? do_raw_spin_unlock+0x59/0x1f0
[ 1371.384711]  schedule+0x78/0x320
[ 1371.387924]  schedule_preempt_disabled+0x18/0x30
[ 1371.392509]  __mutex_lock.constprop.0+0x989/0x15d0
[ 1371.397270]  ? __pfx___mutex_lock.constprop.0+0x10/0x10
[ 1371.402454]  ? __kasan_check_write+0x14/0x20
[ 1371.406694]  ? __pfx_mutex_unlock+0x10/0x10
[ 1371.410850]  __mutex_lock_slowpath+0x13/0x20
[ 1371.415089]  mutex_lock+0xcd/0xe0
[ 1371.418383]  ? __pfx_mutex_lock+0x10/0x10
[ 1371.422366]  ? __pfx___flush_workqueue+0x10/0x10
[ 1371.426950]  acpi_device_hotplug+0x85/0xa20
[ 1371.431110]  ? __pfx_acpi_device_hotplug+0x10/0x10
[ 1371.435870]  acpi_hotplug_work_fn+0x5e/0x90
[ 1371.440026]  process_one_work+0x640/0xeb0
[ 1371.444006]  ? __pfx_do_raw_spin_lock+0x10/0x10
[ 1371.448503]  ? __pfx___timer_delete_sync+0x10/0x10
[ 1371.453261]  worker_thread+0x5ec/0x1050
[ 1371.457078]  ? __pfx_worker_thread+0x10/0x10
[ 1371.461319]  kthread+0x384/0x7e0
[ 1371.464528]  ? __pfx_kthread+0x10/0x10
[ 1371.468253]  ? _raw_spin_unlock_irq+0x1e/0x40
[ 1371.472581]  ? calculate_sigpending+0x77/0xa0
[ 1371.476909]  ? __pfx_kthread+0x10/0x10
[ 1371.480632]  ret_from_fork+0x3a/0x80
[ 1371.484188]  ? __pfx_kthread+0x10/0x10
[ 1371.487912]  ret_from_fork_asm+0x1a/0x30
[ 1371.491811]  </TASK>
[ 1371.494089] task:irq/123-pciehp  state:D stack:0     pid:140   tgid:140   ppid:2      task_flags:0x288040 flags:0x00004000
[ 1371.505034] Call Trace:
[ 1371.507470]  <TASK>
[ 1371.509563]  __schedule+0x1074/0x3000
[ 1371.513204]  ? __pfx___schedule+0x10/0x10
[ 1371.517187]  ? __pfx_do_raw_spin_lock+0x10/0x10
[ 1371.521687]  ? _raw_spin_unlock_irqrestore+0x27/0x50
[ 1371.526612]  ? do_raw_spin_unlock+0x59/0x1f0
[ 1371.530853]  ? _raw_spin_unlock_irqrestore+0x27/0x50
[ 1371.535780]  schedule+0x78/0x320
[ 1371.538988]  __synchronize_irq+0x160/0x1d0
[ 1371.543059]  ? __pfx___synchronize_irq+0x10/0x10
[ 1371.547645]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 1371.552832]  free_irq+0x2db/0x860
[ 1371.556127]  pcie_shutdown_notification+0xed/0x1a0
[ 1371.560887]  pciehp_remove+0x45/0xa0
[ 1371.564444]  pcie_port_remove_service+0x6a/0xa0
[ 1371.568944]  device_remove+0x118/0x170
[ 1371.572671]  device_release_driver_internal+0x3c1/0x570
[ 1371.577857]  ? klist_devices_put+0x31/0x50
[ 1371.581925]  device_release_driver+0x12/0x20
[ 1371.586165]  bus_remove_device+0x1e8/0x3d0
[ 1371.590233]  device_del+0x398/0x960
[ 1371.593701]  ? __pfx_device_del+0x10/0x10
[ 1371.597684]  ? __kasan_check_read+0x11/0x20
[ 1371.601839]  ? do_raw_spin_unlock+0x59/0x1f0
[ 1371.606081]  ? __pfx_remove_iter+0x10/0x10
[ 1371.610152]  device_unregister+0x17/0xa0
[ 1371.614048]  remove_iter+0x46/0x60
[ 1371.617430]  device_for_each_child+0xe5/0x170
[ 1371.621761]  ? __pfx_pciehp_is_native+0x10/0x10
[ 1371.626263]  ? __pfx_device_for_each_child+0x10/0x10
[ 1371.631193]  ? __kasan_check_read+0x11/0x20
[ 1371.635350]  pcie_portdrv_remove+0x30/0x80
[ 1371.639418]  pci_device_remove+0xa9/0x1d0
[ 1371.643402]  device_remove+0xc4/0x170
[ 1371.647044]  device_release_driver_internal+0x3c1/0x570
[ 1371.652228]  device_release_driver+0x12/0x20
[ 1371.656469]  pci_stop_bus_device+0x102/0x150
[ 1371.660709]  pci_stop_bus_device+0xa2/0x150
[ 1371.664864]  pci_stop_bus_device+0xa2/0x150
[ 1371.669018]  pci_stop_bus_device+0xca/0x150
[ 1371.673173]  pci_stop_and_remove_bus_device+0x12/0x30
[ 1371.678187]  pciehp_unconfigure_device+0x231/0x360
[ 1371.682947]  ? __pfx_pciehp_unconfigure_device+0x10/0x10
[ 1371.688221]  ? do_raw_spin_unlock+0x59/0x1f0
[ 1371.692459]  ? _raw_spin_unlock_irqrestore+0x27/0x50
[ 1371.697385]  pciehp_disable_slot+0xfc/0x2f0
[ 1371.701540]  ? __pfx_pciehp_disable_slot+0x10/0x10
[ 1371.706300]  ? __pfx_mutex_unlock+0x10/0x10
[ 1371.710455]  ? mutex_lock_interruptible+0xc0/0xe0
[ 1371.715127]  ? _raw_spin_unlock_irqrestore+0x27/0x50
[ 1371.720055]  pciehp_handle_presence_or_link_change+0x114/0xa00
[ 1371.725842]  ? down_read+0x110/0x480
[ 1371.729395]  ? __pfx___synchronize_hardirq+0x10/0x10
[ 1371.734323]  ? __pfx_pciehp_handle_presence_or_link_change+0x10/0x10
[ 1371.740625]  pciehp_ist+0x23b/0x380
[ 1371.744090]  ? __pfx_pciehp_ist+0x10/0x10
[ 1371.748071]  irq_thread_fn+0x89/0x160
[ 1371.751714]  irq_thread+0x33b/0x580
[ 1371.755180]  ? __pfx_irq_thread_fn+0x10/0x10
[ 1371.759420]  ? __pfx_irq_thread+0x10/0x10
[ 1371.763402]  ? __pfx_irq_thread_dtor+0x10/0x10
[ 1371.767812]  ? __kasan_check_read+0x11/0x20
[ 1371.771967]  ? __kthread_parkme+0x8f/0x160
[ 1371.776036]  ? __pfx_irq_thread+0x10/0x10
[ 1371.780020]  kthread+0x384/0x7e0
[ 1371.783227]  ? __pfx_kthread+0x10/0x10
[ 1371.786951]  ? _raw_spin_unlock_irq+0x1e/0x40
[ 1371.791277]  ? calculate_sigpending+0x77/0xa0
[ 1371.795606]  ? __pfx_kthread+0x10/0x10
[ 1371.799327]  ret_from_fork+0x3a/0x80
[ 1371.802882]  ? __pfx_kthread+0x10/0x10
[ 1371.806607]  ret_from_fork_asm+0x1a/0x30
[ 1371.810503]  </TASK>
[ 1371.812759] task:irq/217-pciehp  state:D stack:0     pid:548   tgid:548   ppid:2      task_flags:0x208040 flags:0x00004000
[ 1371.823705] Call Trace:
[ 1371.826142]  <TASK>
[ 1371.828232]  __schedule+0x1074/0x3000
[ 1371.831876]  ? __pfx___schedule+0x10/0x10
[ 1371.835859]  ? __kasan_check_write+0x14/0x20
[ 1371.840102]  ? __pfx_do_raw_spin_lock+0x10/0x10
[ 1371.844602]  ? __pfx_osq_unlock+0x10/0x10
[ 1371.848587]  schedule+0x78/0x320
[ 1371.851794]  schedule_preempt_disabled+0x18/0x30
[ 1371.856381]  __mutex_lock.constprop.0+0x989/0x15d0
[ 1371.861137]  ? __pfx___mutex_lock.constprop.0+0x10/0x10
[ 1371.866323]  ? __pfx___dynamic_dev_dbg+0x10/0x10
[ 1371.870908]  ? up_read+0x215/0x7b0
[ 1371.874292]  __mutex_lock_slowpath+0x13/0x20
[ 1371.878532]  mutex_lock+0xcd/0xe0
[ 1371.881827]  ? __pfx_mutex_lock+0x10/0x10
[ 1371.885813]  ? __pfx_pci_dev_set_disconnected+0x10/0x10
[ 1371.891000]  pci_lock_rescan_remove+0x15/0x20
[ 1371.895332]  pciehp_unconfigure_device+0x185/0x360
[ 1371.900087]  ? __pfx_pciehp_unconfigure_device+0x10/0x10
[ 1371.905359]  ? do_raw_spin_unlock+0x59/0x1f0
[ 1371.909598]  ? _raw_spin_unlock_irqrestore+0x27/0x50
[ 1371.914527]  pciehp_disable_slot+0xfc/0x2f0
[ 1371.918683]  ? __pfx_pciehp_disable_slot+0x10/0x10
[ 1371.923441]  ? __pfx_mutex_unlock+0x10/0x10
[ 1371.927597]  ? mutex_lock_interruptible+0xc0/0xe0
[ 1371.932265]  ? _raw_spin_unlock_irqrestore+0x27/0x50
[ 1371.937194]  pciehp_handle_presence_or_link_change+0x114/0xa00
[ 1371.942979]  ? down_read+0x110/0x480
[ 1371.946532]  ? __pfx___synchronize_hardirq+0x10/0x10
[ 1371.951462]  ? __pfx_pciehp_handle_presence_or_link_change+0x10/0x10
[ 1371.957763]  pciehp_ist+0x23b/0x380
[ 1371.961230]  ? kfree+0x106/0x3b0
[ 1371.964438]  ? __pfx_pciehp_ist+0x10/0x10
[ 1371.968420]  irq_thread_fn+0x89/0x160
[ 1371.972062]  irq_thread+0x33b/0x580
[ 1371.975532]  ? __pfx_irq_thread_fn+0x10/0x10
[ 1371.979770]  ? __pfx_irq_thread+0x10/0x10
[ 1371.983752]  ? __pfx_irq_thread_dtor+0x10/0x10
[ 1371.988162]  ? __kasan_check_read+0x11/0x20
[ 1371.992318]  ? __kthread_parkme+0x8f/0x160
[ 1371.996385]  ? __pfx_irq_thread+0x10/0x10
[ 1372.000367]  kthread+0x384/0x7e0
[ 1372.003575]  ? __pfx_kthread+0x10/0x10
[ 1372.007304]  ? _raw_spin_unlock_irq+0x1e/0x40
[ 1372.011630]  ? calculate_sigpending+0x77/0xa0
[ 1372.015954]  ? __pfx_kthread+0x10/0x10
[ 1372.019677]  ret_from_fork+0x3a/0x80
[ 1372.023230]  ? __pfx_kthread+0x10/0x10
[ 1372.026954]  ret_from_fork_asm+0x1a/0x30
[ 1372.030850]  </TASK>
[ 1372.033030] task:irq/221-pciehp  state:D stack:0     pid:551   tgid:551   ppid:2      task_flags:0x208040 flags:0x00004000
[ 1372.043974] Call Trace:
[ 1372.046411]  <TASK>
[ 1372.048503]  __schedule+0x1074/0x3000
[ 1372.052142]  ? __pfx___schedule+0x10/0x10
[ 1372.056125]  ? __kasan_check_write+0x14/0x20
[ 1372.060366]  ? __pfx_do_raw_spin_lock+0x10/0x10
[ 1372.064865]  ? dev_printk_emit+0xa2/0xd5
[ 1372.068765]  schedule+0x78/0x320
[ 1372.071972]  schedule_preempt_disabled+0x18/0x30
[ 1372.076560]  __mutex_lock.constprop.0+0x989/0x15d0
[ 1372.081320]  ? __pfx___mutex_lock.constprop.0+0x10/0x10
[ 1372.086505]  ? __pfx___dynamic_dev_dbg+0x10/0x10
[ 1372.091087]  ? up_read+0x215/0x7b0
[ 1372.094470]  __mutex_lock_slowpath+0x13/0x20
[ 1372.098710]  mutex_lock+0xcd/0xe0
[ 1372.102006]  ? __pfx_mutex_lock+0x10/0x10
[ 1372.105989]  ? __pfx_pci_dev_set_disconnected+0x10/0x10
[ 1372.111175]  pci_lock_rescan_remove+0x15/0x20
[ 1372.115504]  pciehp_unconfigure_device+0x185/0x360
[ 1372.120262]  ? __pfx_pciehp_unconfigure_device+0x10/0x10
[ 1372.125531]  ? do_raw_spin_unlock+0x59/0x1f0
[ 1372.129773]  ? _raw_spin_unlock_irqrestore+0x27/0x50
[ 1372.134700]  pciehp_disable_slot+0xfc/0x2f0
[ 1372.138856]  ? __pfx_pciehp_disable_slot+0x10/0x10
[ 1372.143614]  ? __pfx_mutex_unlock+0x10/0x10
[ 1372.147767]  ? mutex_lock_interruptible+0xc0/0xe0
[ 1372.152434]  ? _raw_spin_unlock_irqrestore+0x27/0x50
[ 1372.157363]  pciehp_handle_presence_or_link_change+0x114/0xa00
[ 1372.163152]  ? down_read+0x110/0x480
[ 1372.166703]  ? __pfx___synchronize_hardirq+0x10/0x10
[ 1372.171631]  ? __pfx_pciehp_handle_presence_or_link_change+0x10/0x10
[ 1372.177934]  pciehp_ist+0x23b/0x380
[ 1372.181400]  ? kfree+0x106/0x3b0
[ 1372.184608]  ? __pfx_pciehp_ist+0x10/0x10
[ 1372.188590]  irq_thread_fn+0x89/0x160
[ 1372.192232]  irq_thread+0x33b/0x580
[ 1372.195697]  ? __pfx_irq_thread_fn+0x10/0x10
[ 1372.199938]  ? __pfx_irq_thread+0x10/0x10
[ 1372.203922]  ? __pfx_irq_thread_dtor+0x10/0x10
[ 1372.208332]  ? __kasan_check_read+0x11/0x20
[ 1372.212486]  ? __kthread_parkme+0x8f/0x160
[ 1372.216553]  ? __pfx_irq_thread+0x10/0x10
[ 1372.220536]  kthread+0x384/0x7e0
[ 1372.223744]  ? __pfx_kthread+0x10/0x10
[ 1372.227468]  ? _raw_spin_unlock_irq+0x1e/0x40
[ 1372.231798]  ? calculate_sigpending+0x77/0xa0
[ 1372.236128]  ? __pfx_kthread+0x10/0x10
[ 1372.239851]  ret_from_fork+0x3a/0x80
[ 1372.243404]  ? __pfx_kthread+0x10/0x10
[ 1372.247129]  ret_from_fork_asm+0x1a/0x30
[ 1372.251031]  </TASK>
[ 1372.253210] task:rtcwake         state:D stack:0     pid:570   tgid:570   ppid:375    task_flags:0x80400000 flags:0x00004002
[ 1372.264324] Call Trace:
[ 1372.266764]  <TASK>
[ 1372.268855]  __schedule+0x1074/0x3000
[ 1372.272493]  ? __pfx___schedule+0x10/0x10
[ 1372.276475]  ? __kasan_check_write+0x14/0x20
[ 1372.280716]  ? do_raw_spin_lock+0x12f/0x270
[ 1372.284871]  ? __pfx_do_raw_spin_lock+0x10/0x10
[ 1372.289371]  ? do_raw_spin_unlock+0x59/0x1f0
[ 1372.293611]  ? __kasan_check_read+0x11/0x20
[ 1372.297766]  schedule+0x78/0x320
[ 1372.300979]  async_synchronize_cookie_domain+0x1af/0x210
[ 1372.306252]  ? __pfx_async_synchronize_cookie_domain+0x10/0x10
[ 1372.312038]  ? _raw_spin_unlock_irqrestore+0x27/0x50
[ 1372.316967]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 1372.322154]  ? mutex_unlock+0x83/0xd0
[ 1372.325796]  ? __pfx_mutex_unlock+0x10/0x10
[ 1372.329954]  ? __pfx_mutex_lock+0x10/0x10
[ 1372.333938]  async_synchronize_full+0x17/0x20
[ 1372.338268]  dpm_resume+0x256/0x5c0
[ 1372.341733]  ? __pfx_dpm_resume+0x10/0x10
[ 1372.345716]  ? __pfx_dpm_resume_early+0x10/0x10
[ 1372.350219]  ? acpi_set_gpe_wake_mask+0x172/0x250
[ 1372.354888]  dpm_resume_end+0x11/0x30
[ 1372.358529]  suspend_devices_and_enter+0x371/0x11a0
[ 1372.363379]  ? __pfx_suspend_devices_and_enter+0x10/0x10
[ 1372.368653]  pm_suspend+0x2a7/0xaa0
[ 1372.372117]  state_store+0xaa/0x150
[ 1372.375584]  ? __pfx_sysfs_kf_write+0x10/0x10
[ 1372.379918]  kobj_attr_store+0x36/0x70
[ 1372.383643]  ? __pfx_mutex_lock+0x10/0x10
[ 1372.387626]  ? __pfx_kobj_attr_store+0x10/0x10
[ 1372.392042]  sysfs_kf_write+0x122/0x1c0
[ 1372.395856]  ? __kasan_check_write+0x14/0x20
[ 1372.400098]  kernfs_fop_write_iter+0x321/0x4d0
[ 1372.404517]  vfs_write+0x5be/0xf40
[ 1372.407903]  ? __pfx_vfs_write+0x10/0x10
[ 1372.411803]  ? do_sys_openat2+0x115/0x170
[ 1372.415789]  ? __kasan_check_read+0x11/0x20
[ 1372.419944]  ? fdget_pos+0x1be/0x4c0
[ 1372.423497]  ? __rseq_handle_notify_resume+0x49b/0xaf0
[ 1372.428600]  ksys_write+0x106/0x200
[ 1372.432064]  ? __pfx_ksys_write+0x10/0x10
[ 1372.436046]  ? __pfx___x64_sys_openat+0x10/0x10
[ 1372.440545]  __x64_sys_write+0x72/0xb0
[ 1372.444269]  ? syscall_exit_to_user_mode+0x54/0x190
[ 1372.449109]  x64_sys_call+0x28f/0x1d70
[ 1372.452832]  do_syscall_64+0x4b/0x110
[ 1372.456475]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1372.461496] RIP: 0033:0x7fe7c939b25e
[ 1372.465051] RSP: 002b:00007ffde0114988 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ 1372.472560] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fe7c939b25e
[ 1372.479634] RDX: 0000000000000003 RSI: 00007ffde0116e90 RDI: 0000000000000004
[ 1372.486708] RBP: 00007ffde0116e90 R08: 0000000000000000 R09: 0000000000000000
[ 1372.493784] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003
[ 1372.500858] R13: 00007fe7c92d86c8 R14: 00005647ef246e7b R15: 0000000000000000
[ 1372.507934]  </TASK>
[ 1372.510116] task:kworker/u57:6   state:D stack:0     pid:575   tgid:575   ppid:2      task_flags:0x4208060 flags:0x00004000
[ 1372.521146] Workqueue: async async_run_entry_fn
[ 1372.525646] Call Trace:
[ 1372.528087]  <TASK>
[ 1372.530178]  __schedule+0x1074/0x3000
[ 1372.533821]  ? __pfx___schedule+0x10/0x10
[ 1372.537803]  ? __kasan_check_write+0x14/0x20
[ 1372.542044]  ? do_raw_spin_lock+0x12f/0x270
[ 1372.546199]  ? timerqueue_add+0x160/0x340
[ 1372.550186]  ? __kasan_check_read+0x11/0x20
[ 1372.554339]  schedule+0x78/0x320
[ 1372.557552]  schedule_timeout+0x16c/0x1f0
[ 1372.561534]  ? __pfx_schedule_timeout+0x10/0x10
[ 1372.566034]  ? do_raw_spin_lock+0x12f/0x270
[ 1372.570190]  ? __pfx_do_raw_spin_lock+0x10/0x10
[ 1372.574689]  ? __pfx_do_raw_spin_lock+0x10/0x10
[ 1372.579189]  ? __kasan_check_read+0x11/0x20
[ 1372.583346]  __wait_for_common+0x344/0x550
[ 1372.587414]  ? __pfx_schedule_timeout+0x10/0x10
[ 1372.591913]  ? __pfx___wait_for_common+0x10/0x10
[ 1372.596500]  ? mutex_unlock+0x83/0xd0
[ 1372.600142]  ? __pfx_mutex_unlock+0x10/0x10
[ 1372.604299]  ? __pfx_mutex_lock+0x10/0x10
[ 1372.608283]  wait_for_completion+0x24/0x30
[ 1372.612350]  dpm_wait_for_superior+0x301/0x430
[ 1372.616762]  device_resume+0xd6/0x7d0
[ 1372.620404]  async_resume+0x1d/0x30
[ 1372.623870]  async_run_entry_fn+0x95/0x520
[ 1372.627938]  ? do_raw_spin_unlock+0x59/0x1f0
[ 1372.632180]  process_one_work+0x640/0xeb0
[ 1372.636162]  ? __pfx_do_raw_spin_lock+0x10/0x10
[ 1372.640659]  ? __pfx___timer_delete_sync+0x10/0x10
[ 1372.645418]  worker_thread+0x5ec/0x1050
[ 1372.649232]  ? __pfx_worker_thread+0x10/0x10
[ 1372.653471]  kthread+0x384/0x7e0
[ 1372.656677]  ? __pfx_kthread+0x10/0x10
[ 1372.660401]  ? _raw_spin_unlock_irq+0x1e/0x40
[ 1372.664730]  ? calculate_sigpending+0x77/0xa0
[ 1372.669058]  ? __pfx_kthread+0x10/0x10
[ 1372.672781]  ret_from_fork+0x3a/0x80
[ 1372.676336]  ? __pfx_kthread+0x10/0x10
[ 1372.680061]  ret_from_fork_asm+0x1a/0x30
[ 1372.683962]  </TASK>
[ 1372.686152] task:kworker/u57:52  state:D stack:0     pid:630   tgid:630   ppid:2      task_flags:0x4208060 flags:0x00004000
[ 1372.697180] Workqueue: async async_run_entry_fn
[ 1372.701679] Call Trace:
[ 1372.704120]  <TASK>
[ 1372.706211]  __schedule+0x1074/0x3000
[ 1372.709854]  ? __pfx___schedule+0x10/0x10
[ 1372.713835]  ? __kasan_check_write+0x14/0x20
[ 1372.718077]  ? __pfx_do_raw_spin_lock+0x10/0x10
[ 1372.722584]  ? schedule+0x232/0x320
[ 1372.726050]  schedule+0x78/0x320
[ 1372.729263]  schedule_preempt_disabled+0x18/0x30
[ 1372.733848]  __mutex_lock.constprop.0+0x989/0x15d0
[ 1372.738605]  ? __pfx___mutex_lock.constprop.0+0x10/0x10
[ 1372.743789]  ? _raw_spin_unlock_irq+0x1e/0x40
[ 1372.748118]  ? __pfx_schedule_timeout+0x10/0x10
[ 1372.752616]  ? __pfx___wait_for_common+0x10/0x10
[ 1372.757198]  ? mutex_unlock+0x83/0xd0
[ 1372.760840]  __mutex_lock_slowpath+0x13/0x20
[ 1372.765079]  mutex_lock+0xcd/0xe0
[ 1372.768375]  ? __pfx_mutex_lock+0x10/0x10
[ 1372.772360]  device_resume+0x1d7/0x7d0
[ 1372.776084]  async_resume+0x1d/0x30
[ 1372.779551]  async_run_entry_fn+0x95/0x520
[ 1372.783619]  ? do_raw_spin_unlock+0x59/0x1f0
[ 1372.787859]  process_one_work+0x640/0xeb0
[ 1372.791843]  ? __pfx_do_raw_spin_lock+0x10/0x10
[ 1372.796341]  ? __pfx___timer_delete_sync+0x10/0x10
[ 1372.801096]  worker_thread+0x5ec/0x1050
[ 1372.804908]  ? __pfx_worker_thread+0x10/0x10
[ 1372.809150]  kthread+0x384/0x7e0
[ 1372.812358]  ? __pfx_kthread+0x10/0x10
[ 1372.816081]  ? _raw_spin_unlock_irq+0x1e/0x40
[ 1372.820408]  ? calculate_sigpending+0x77/0xa0
[ 1372.824738]  ? __pfx_kthread+0x10/0x10
[ 1372.828461]  ret_from_fork+0x3a/0x80
[ 1372.832016]  ? __pfx_kthread+0x10/0x10
[ 1372.835742]  ret_from_fork_asm+0x1a/0x30
[ 1372.839639]  </TASK>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))
  2025-02-26  8:44                 ` Mika Westerberg
@ 2025-02-26  9:10                   ` Lukas Wunner
  2025-02-26  9:19                     ` Mika Westerberg
  2025-02-26 15:31                   ` Kenneth Crudup
                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 28+ messages in thread
From: Lukas Wunner @ 2025-02-26  9:10 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Kenneth Crudup, Bjorn Helgaas, ilpo.jarvinen, Bjorn Helgaas,
	Jian-Hong Pan, linux-pci, linux-kernel, Nikl??vs Ko??es??ikovs,
	Andreas Noever, Michael Jamet, Yehezkel Bernat, linux-usb

On Wed, Feb 26, 2025 at 10:44:04AM +0200, Mika Westerberg wrote:
>   [Meteor Lake host] <--> [TB 4 dock] <--> [TB 3 NVMe]
[...]
> I added "no_console_suspend" to the command line and the did sysrq-w to
> get list of blocked tasks. I've attached it just in case it is needed.
[...]

This looks like the deadlock we've had for years when hot-removing
nested hotplug ports.

If you attach only a single device to the host, I guess the issue
does not occur, right?

Previous attempts to fix this:

https://lore.kernel.org/all/4c882e25194ba8282b78fe963fec8faae7cf23eb.1529173804.git.lukas@wunner.de/

https://lore.kernel.org/all/20240612181625.3604512-1-kbusch@meta.com/

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))
  2025-02-26  9:10                   ` Lukas Wunner
@ 2025-02-26  9:19                     ` Mika Westerberg
  2025-03-03 20:00                       ` Lukas Wunner
  0 siblings, 1 reply; 28+ messages in thread
From: Mika Westerberg @ 2025-02-26  9:19 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Kenneth Crudup, Bjorn Helgaas, ilpo.jarvinen, Bjorn Helgaas,
	Jian-Hong Pan, linux-pci, linux-kernel, Nikl??vs Ko??es??ikovs,
	Andreas Noever, Michael Jamet, Yehezkel Bernat, linux-usb

On Wed, Feb 26, 2025 at 10:10:43AM +0100, Lukas Wunner wrote:
> On Wed, Feb 26, 2025 at 10:44:04AM +0200, Mika Westerberg wrote:
> >   [Meteor Lake host] <--> [TB 4 dock] <--> [TB 3 NVMe]
> [...]
> > I added "no_console_suspend" to the command line and the did sysrq-w to
> > get list of blocked tasks. I've attached it just in case it is needed.
> [...]
> 
> This looks like the deadlock we've had for years when hot-removing
> nested hotplug ports.
> 
> If you attach only a single device to the host, I guess the issue
> does not occur, right?

Yes.

> Previous attempts to fix this:
> 
> https://lore.kernel.org/all/4c882e25194ba8282b78fe963fec8faae7cf23eb.1529173804.git.lukas@wunner.de/
> 
> https://lore.kernel.org/all/20240612181625.3604512-1-kbusch@meta.com/

Well, it does not happen if I revert the commit so isn't that a
regresssion?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))
  2025-02-26  8:44                 ` Mika Westerberg
  2025-02-26  9:10                   ` Lukas Wunner
@ 2025-02-26 15:31                   ` Kenneth Crudup
  2025-02-26 21:13                   ` Kenneth Crudup
  2025-02-26 21:14                   ` Kenneth Crudup
  3 siblings, 0 replies; 28+ messages in thread
From: Kenneth Crudup @ 2025-02-26 15:31 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Bjorn Helgaas, ilpo.jarvinen, Bjorn Helgaas, Jian-Hong Pan,
	linux-pci, linux-kernel, Niklāvs Koļesņikovs,
	Andreas Noever, Michael Jamet, Lukas Wunner, Yehezkel Bernat,
	linux-usb, Kenneth Crudup


Trying to do a "control" test before I try out your bisected commit, and 
Lukas' changes, but of course now I can't get it to fail (I'm on Linus' 
master as of this morning (b5799106b4).

I'm using my portable USB4 dock (Plugable TBT4-HUB3C) this time (vs. my 
CalDigit 4 dock) but the same ASMedia USB4-to-NVMe adapter as always; in 
any case everything is PCIe so it shouldn't matter.

I don't normally use "tbauth" (I think that's all done for me via the 
"boltctl" suite) but I grabbed and built the GIT and ran it anyway, for 
good measure.

I'll keep you updated, I'll be at my CalDigit dock soon enough if I 
can't get any failures this morning.

-K

On 2/26/25 00:44, Mika Westerberg wrote:
> Hi Kenneth,
> 
> On Fri, Feb 14, 2025 at 09:39:33AM -0800, Kenneth Crudup wrote:
>>
>> This is excellent news that you were able to reproduce it- I'd figured this
>> regression would have been caught already (as I do remember this working
>> before) and was worried it may have been specific to a particular piece of
>> hardware (or software setup) on my system.
>>
>> I'll see what I can dig up on my end, but as I'm not expert in these
>> subsystems I may not be able to diagnose anything until your return.
> 
> [Back now]
> 
> My git bisect ended up to this commit:
> 
>    9d573d19547b ("PCI: pciehp: Detect device replacement during system sleep")
> 
> Adding Lukas who is the expert.
> 
> My steps to reproduce on Intel Meteor Lake based reference system are:
> 
> 1. Boot the system up, nothing connected.
> 2. Once up, connect Thunderbolt 4 dock and Thunderbolt 3 NVMe in a chain:
> 
>    [Meteor Lake host] <--> [TB 4 dock] <--> [TB 3 NVMe]
> 
> 3. Authorize PCIe tunnels (whatever your distro provides, my buildroot just
>      has the debugging tools so running 'tbauth -r 301')
> 
> 4. Check that the PCIe topology matches the expected (lspci)
> 
> 5. Enter s2idle:
> 
>    # rtcwake -s 30 -mmem
> 
> 6. Once it is suspended, unplug the cable between the host and the dock.
> 
> 7. Wait for the resume to happen.
> 
> Expectation: The system wakes up fine, notices that the TB and PCIe devices
> are gone, stays responsive and usable.
> 
> Actual result: Resume never completes.
> 
> I added "no_console_suspend" to the command line and the did sysrq-w to
> get list of blocked tasks. I've attached it just in case it is needed.
> 
> If I revert the above commit the issue is gone. Now I'm not sure if this is
> exactly the same issue that you are seeing but nevertheless this is kind of
> normal use case so definitely something we should get fixed.
> 
> Lukas, if you need any more information let me know. I can reproduce this
> easily.
> 
>> I also saw some DRM/connected fixes posted to Linus' master so maybe one of
>> them corrects this new display-crash issue (I'm not home on my big monitor
>> to be able to test yet).
>>
>> -Kenny
>>
>> On 2/14/25 08:29, Mika Westerberg wrote:
>>> Hi,
>>>
>>> On Thu, Feb 13, 2025 at 11:19:35AM -0800, Kenneth Crudup wrote:
>>>>
>>>> On 2/13/25 05:59, Mika Westerberg wrote:
>>>>
>>>>> Hi,
>>>>
>>>> As Murphy's would have it, now my crashes are display-driver related (this
>>>> is Xe, but I've also seen it with i915).
>>>>
>>>> Attached here just for the heck of it, but I'll be better testing the NVMe
>>>> enclosure-related failures this weekend. Stay tuned!
>>>
>>> Okay, I checked quickly and no TB related crash there but I was actually
>>> able to reproduce hang when I unplug the device chain during suspend. I did
>>> not yet have time to look into it deeper. I'm sure this has been working
>>> fine in the past as we tested all kinds of topologies including similar to
>>> this.
>>>
>>> I will be out next week for vacation but will continue after that if the
>>> problem is not alraedy solved ;-)
>>>
>>
>> -- 
>> Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County
>> CA

-- 
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange 
County CA


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))
  2025-02-26  8:44                 ` Mika Westerberg
  2025-02-26  9:10                   ` Lukas Wunner
  2025-02-26 15:31                   ` Kenneth Crudup
@ 2025-02-26 21:13                   ` Kenneth Crudup
  2025-02-26 21:14                   ` Kenneth Crudup
  3 siblings, 0 replies; 28+ messages in thread
From: Kenneth Crudup @ 2025-02-26 21:13 UTC (permalink / raw)
  To: Mika Westerberg, Kenneth Crudup
  Cc: Bjorn Helgaas, ilpo.jarvinen, Bjorn Helgaas, Jian-Hong Pan,
	linux-pci, linux-kernel, Niklāvs Koļesņikovs,
	Andreas Noever, Michael Jamet, Lukas Wunner, Yehezkel Bernat,
	linux-usb

Trying to do a "control" test before I try out your bisected commit, and
Lukas' changes, but of course now I can't get it to fail (I'm on Linus'
master as of this morning (b5799106b4).

I'm using my portable USB4 dock (Plugable TBT4-HUB3C) this time (vs. my
CalDigit 4 dock) but the same ASMedia USB4-to-NVMe adapter as always; in
any case everything is PCIe so it shouldn't matter.

I don't normally use "tbauth" (I think that's all done for me via the
"boltctl" suite) but I grabbed and built the GIT and ran it anyway, for
good measure.

I'll keep you updated, I'll be at my CalDigit dock soon enough if I
can't get any failures this morning.

-K


On 2/26/25 00:44, Mika Westerberg wrote:
> Hi Kenneth,
> 
> On Fri, Feb 14, 2025 at 09:39:33AM -0800, Kenneth Crudup wrote:
>>
>> This is excellent news that you were able to reproduce it- I'd figured this
>> regression would have been caught already (as I do remember this working
>> before) and was worried it may have been specific to a particular piece of
>> hardware (or software setup) on my system.
>>
>> I'll see what I can dig up on my end, but as I'm not expert in these
>> subsystems I may not be able to diagnose anything until your return.
> 
> [Back now]
> 
> My git bisect ended up to this commit:
> 
>    9d573d19547b ("PCI: pciehp: Detect device replacement during system sleep")
> 
> Adding Lukas who is the expert.
> 
> My steps to reproduce on Intel Meteor Lake based reference system are:
> 
> 1. Boot the system up, nothing connected.
> 2. Once up, connect Thunderbolt 4 dock and Thunderbolt 3 NVMe in a chain:
> 
>    [Meteor Lake host] <--> [TB 4 dock] <--> [TB 3 NVMe]
> 
> 3. Authorize PCIe tunnels (whatever your distro provides, my buildroot just
>      has the debugging tools so running 'tbauth -r 301')
> 
> 4. Check that the PCIe topology matches the expected (lspci)
> 
> 5. Enter s2idle:
> 
>    # rtcwake -s 30 -mmem
> 
> 6. Once it is suspended, unplug the cable between the host and the dock.
> 
> 7. Wait for the resume to happen.
> 
> Expectation: The system wakes up fine, notices that the TB and PCIe devices
> are gone, stays responsive and usable.
> 
> Actual result: Resume never completes.
> 
> I added "no_console_suspend" to the command line and the did sysrq-w to
> get list of blocked tasks. I've attached it just in case it is needed.
> 
> If I revert the above commit the issue is gone. Now I'm not sure if this is
> exactly the same issue that you are seeing but nevertheless this is kind of
> normal use case so definitely something we should get fixed.
> 
> Lukas, if you need any more information let me know. I can reproduce this
> easily.
> 
>> I also saw some DRM/connected fixes posted to Linus' master so maybe one of
>> them corrects this new display-crash issue (I'm not home on my big monitor
>> to be able to test yet).
>>
>> -Kenny
>>
>> On 2/14/25 08:29, Mika Westerberg wrote:
>>> Hi,
>>>
>>> On Thu, Feb 13, 2025 at 11:19:35AM -0800, Kenneth Crudup wrote:
>>>>
>>>> On 2/13/25 05:59, Mika Westerberg wrote:
>>>>
>>>>> Hi,
>>>>
>>>> As Murphy's would have it, now my crashes are display-driver related (this
>>>> is Xe, but I've also seen it with i915).
>>>>
>>>> Attached here just for the heck of it, but I'll be better testing the NVMe
>>>> enclosure-related failures this weekend. Stay tuned!
>>>
>>> Okay, I checked quickly and no TB related crash there but I was actually
>>> able to reproduce hang when I unplug the device chain during suspend. I did
>>> not yet have time to look into it deeper. I'm sure this has been working
>>> fine in the past as we tested all kinds of topologies including similar to
>>> this.
>>>
>>> I will be out next week for vacation but will continue after that if the
>>> problem is not alraedy solved ;-)
>>>
>>
>> -- 
>> Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County
>> CA

-- 
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange 
County CA


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))
  2025-02-26  8:44                 ` Mika Westerberg
                                     ` (2 preceding siblings ...)
  2025-02-26 21:13                   ` Kenneth Crudup
@ 2025-02-26 21:14                   ` Kenneth Crudup
  2025-02-27 17:46                     ` Kenneth Crudup
  3 siblings, 1 reply; 28+ messages in thread
From: Kenneth Crudup @ 2025-02-26 21:14 UTC (permalink / raw)
  To: Mika Westerberg, Kenneth Crudup
  Cc: Bjorn Helgaas, ilpo.jarvinen, Bjorn Helgaas, Jian-Hong Pan,
	linux-pci, linux-kernel, Niklāvs Koļesņikovs,
	Andreas Noever, Michael Jamet, Lukas Wunner, Yehezkel Bernat,
	linux-usb

OK, just did a resume after suspended (for an hour, which somehow seems 
to matter) while my CalDigit dock was attached with the ASMedia NVMe 
adaptor at suspend, but both disconnected on resume, and I am indeed 
locked up.

I can attached the "pstore" report if necessary.

Unfortunately I won't be able to get back to the CalDigit until Saturday 
afternoon California time.

I'll be trying all the reverts/commits listed herein and at least check 
for regressions in other cases, though.

-Kenny

On 2/26/25 00:44, Mika Westerberg wrote:
> Hi Kenneth,
> 
> On Fri, Feb 14, 2025 at 09:39:33AM -0800, Kenneth Crudup wrote:
>>
>> This is excellent news that you were able to reproduce it- I'd figured this
>> regression would have been caught already (as I do remember this working
>> before) and was worried it may have been specific to a particular piece of
>> hardware (or software setup) on my system.
>>
>> I'll see what I can dig up on my end, but as I'm not expert in these
>> subsystems I may not be able to diagnose anything until your return.
> 
> [Back now]
> 
> My git bisect ended up to this commit:
> 
>    9d573d19547b ("PCI: pciehp: Detect device replacement during system sleep")
> 
> Adding Lukas who is the expert.
> 
> My steps to reproduce on Intel Meteor Lake based reference system are:
> 
> 1. Boot the system up, nothing connected.
> 2. Once up, connect Thunderbolt 4 dock and Thunderbolt 3 NVMe in a chain:
> 
>    [Meteor Lake host] <--> [TB 4 dock] <--> [TB 3 NVMe]
> 
> 3. Authorize PCIe tunnels (whatever your distro provides, my buildroot just
>      has the debugging tools so running 'tbauth -r 301')
> 
> 4. Check that the PCIe topology matches the expected (lspci)
> 
> 5. Enter s2idle:
> 
>    # rtcwake -s 30 -mmem
> 
> 6. Once it is suspended, unplug the cable between the host and the dock.
> 
> 7. Wait for the resume to happen.
> 
> Expectation: The system wakes up fine, notices that the TB and PCIe devices
> are gone, stays responsive and usable.
> 
> Actual result: Resume never completes.
> 
> I added "no_console_suspend" to the command line and the did sysrq-w to
> get list of blocked tasks. I've attached it just in case it is needed.
> 
> If I revert the above commit the issue is gone. Now I'm not sure if this is
> exactly the same issue that you are seeing but nevertheless this is kind of
> normal use case so definitely something we should get fixed.
> 
> Lukas, if you need any more information let me know. I can reproduce this
> easily.
> 
>> I also saw some DRM/connected fixes posted to Linus' master so maybe one of
>> them corrects this new display-crash issue (I'm not home on my big monitor
>> to be able to test yet).
>>
>> -Kenny
>>
>> On 2/14/25 08:29, Mika Westerberg wrote:
>>> Hi,
>>>
>>> On Thu, Feb 13, 2025 at 11:19:35AM -0800, Kenneth Crudup wrote:
>>>>
>>>> On 2/13/25 05:59, Mika Westerberg wrote:
>>>>
>>>>> Hi,
>>>>
>>>> As Murphy's would have it, now my crashes are display-driver related (this
>>>> is Xe, but I've also seen it with i915).
>>>>
>>>> Attached here just for the heck of it, but I'll be better testing the NVMe
>>>> enclosure-related failures this weekend. Stay tuned!
>>>
>>> Okay, I checked quickly and no TB related crash there but I was actually
>>> able to reproduce hang when I unplug the device chain during suspend. I did
>>> not yet have time to look into it deeper. I'm sure this has been working
>>> fine in the past as we tested all kinds of topologies including similar to
>>> this.
>>>
>>> I will be out next week for vacation but will continue after that if the
>>> problem is not alraedy solved ;-)
>>>
>>
>> -- 
>> Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County
>> CA

-- 
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange 
County CA


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))
  2025-02-26 21:14                   ` Kenneth Crudup
@ 2025-02-27 17:46                     ` Kenneth Crudup
  2025-02-28 10:49                       ` Mika Westerberg
  0 siblings, 1 reply; 28+ messages in thread
From: Kenneth Crudup @ 2025-02-27 17:46 UTC (permalink / raw)
  To: Mika Westerberg, Kenneth Crudup
  Cc: Bjorn Helgaas, ilpo.jarvinen, Bjorn Helgaas, Jian-Hong Pan,
	linux-pci, linux-kernel, Niklāvs Koļesņikovs,
	Andreas Noever, Michael Jamet, Lukas Wunner, Yehezkel Bernat,
	linux-usb

[-- Attachment #1: Type: text/plain, Size: 4854 bytes --]


So I think, the failure mode may be related in some part to 
DP/Tunneling,  too- I finally got another lockup (this time, after a 
hibernate, which I guess is some of the same facility) but what was 
different about this time where I couldn't reproduce the lockups (and 
what happens when I use my CalDigit dock) was I had an external USB-C 
monitor connected when I resumed, and when I'm home (where I sometimes 
forget to remove the NVMe USB4 adaptor) I always have my monitor 
connected to the dock.

See attached dump log. I'm using the (somewhat still experimental) Xe 
display driver, but I've seen this same lockup happen with i915.

In any case, I've now reverted 9d573d19, and when I get back to my 
CalDigit I can try instrumenting the code paths in the commit and see 
exactly where we're locking up.

-K

On 2/26/25 13:14, Kenneth Crudup wrote:
> OK, just did a resume after suspended (for an hour, which somehow seems 
> to matter) while my CalDigit dock was attached with the ASMedia NVMe 
> adaptor at suspend, but both disconnected on resume, and I am indeed 
> locked up.
> 
> I can attached the "pstore" report if necessary.
> 
> Unfortunately I won't be able to get back to the CalDigit until Saturday 
> afternoon California time.
> 
> I'll be trying all the reverts/commits listed herein and at least check 
> for regressions in other cases, though.
> 
> -Kenny
> 
> On 2/26/25 00:44, Mika Westerberg wrote:
>> Hi Kenneth,
>>
>> On Fri, Feb 14, 2025 at 09:39:33AM -0800, Kenneth Crudup wrote:
>>>
>>> This is excellent news that you were able to reproduce it- I'd 
>>> figured this
>>> regression would have been caught already (as I do remember this working
>>> before) and was worried it may have been specific to a particular 
>>> piece of
>>> hardware (or software setup) on my system.
>>>
>>> I'll see what I can dig up on my end, but as I'm not expert in these
>>> subsystems I may not be able to diagnose anything until your return.
>>
>> [Back now]
>>
>> My git bisect ended up to this commit:
>>
>>    9d573d19547b ("PCI: pciehp: Detect device replacement during system 
>> sleep")
>>
>> Adding Lukas who is the expert.
>>
>> My steps to reproduce on Intel Meteor Lake based reference system are:
>>
>> 1. Boot the system up, nothing connected.
>> 2. Once up, connect Thunderbolt 4 dock and Thunderbolt 3 NVMe in a chain:
>>
>>    [Meteor Lake host] <--> [TB 4 dock] <--> [TB 3 NVMe]
>>
>> 3. Authorize PCIe tunnels (whatever your distro provides, my buildroot 
>> just
>>      has the debugging tools so running 'tbauth -r 301')
>>
>> 4. Check that the PCIe topology matches the expected (lspci)
>>
>> 5. Enter s2idle:
>>
>>    # rtcwake -s 30 -mmem
>>
>> 6. Once it is suspended, unplug the cable between the host and the dock.
>>
>> 7. Wait for the resume to happen.
>>
>> Expectation: The system wakes up fine, notices that the TB and PCIe 
>> devices
>> are gone, stays responsive and usable.
>>
>> Actual result: Resume never completes.
>>
>> I added "no_console_suspend" to the command line and the did sysrq-w to
>> get list of blocked tasks. I've attached it just in case it is needed.
>>
>> If I revert the above commit the issue is gone. Now I'm not sure if 
>> this is
>> exactly the same issue that you are seeing but nevertheless this is 
>> kind of
>> normal use case so definitely something we should get fixed.
>>
>> Lukas, if you need any more information let me know. I can reproduce this
>> easily.
>>
>>> I also saw some DRM/connected fixes posted to Linus' master so maybe 
>>> one of
>>> them corrects this new display-crash issue (I'm not home on my big 
>>> monitor
>>> to be able to test yet).
>>>
>>> -Kenny
>>>
>>> On 2/14/25 08:29, Mika Westerberg wrote:
>>>> Hi,
>>>>
>>>> On Thu, Feb 13, 2025 at 11:19:35AM -0800, Kenneth Crudup wrote:
>>>>>
>>>>> On 2/13/25 05:59, Mika Westerberg wrote:
>>>>>
>>>>>> Hi,
>>>>>
>>>>> As Murphy's would have it, now my crashes are display-driver 
>>>>> related (this
>>>>> is Xe, but I've also seen it with i915).
>>>>>
>>>>> Attached here just for the heck of it, but I'll be better testing 
>>>>> the NVMe
>>>>> enclosure-related failures this weekend. Stay tuned!
>>>>
>>>> Okay, I checked quickly and no TB related crash there but I was 
>>>> actually
>>>> able to reproduce hang when I unplug the device chain during 
>>>> suspend. I did
>>>> not yet have time to look into it deeper. I'm sure this has been 
>>>> working
>>>> fine in the past as we tested all kinds of topologies including 
>>>> similar to
>>>> this.
>>>>
>>>> I will be out next week for vacation but will continue after that if 
>>>> the
>>>> problem is not alraedy solved ;-)
>>>>
>>>
>>> -- 
>>> Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange 
>>> County
>>> CA
> 

-- 
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange 
County CA

[-- Attachment #2: pstore-202502262249.tar.bz2 --]
[-- Type: application/x-bzip, Size: 8376 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))
  2025-02-27 17:46                     ` Kenneth Crudup
@ 2025-02-28 10:49                       ` Mika Westerberg
  2025-02-28 16:04                         ` Kenneth Crudup
  0 siblings, 1 reply; 28+ messages in thread
From: Mika Westerberg @ 2025-02-28 10:49 UTC (permalink / raw)
  To: Kenneth Crudup
  Cc: Bjorn Helgaas, ilpo.jarvinen, Bjorn Helgaas, Jian-Hong Pan,
	linux-pci, linux-kernel, Niklāvs Koļesņikovs,
	Andreas Noever, Michael Jamet, Lukas Wunner, Yehezkel Bernat,
	linux-usb

Hi,

On Thu, Feb 27, 2025 at 09:46:07AM -0800, Kenneth Crudup wrote:
> So I think, the failure mode may be related in some part to DP/Tunneling,
> too- I finally got another lockup (this time, after a hibernate, which I
> guess is some of the same facility) but what was different about this time
> where I couldn't reproduce the lockups (and what happens when I use my
> CalDigit dock) was I had an external USB-C monitor connected when I resumed,
> and when I'm home (where I sometimes forget to remove the NVMe USB4 adaptor)
> I always have my monitor connected to the dock.

It would be good to stick with a "proven" use-case so that the steps are
always the same. This may involve several issues in various parts of the
kernel and we need to track them one by one. If you change the steps in the
middle then we may end up finding completely different issues and it is not
helping the debugging effort.

The steps at the moment would be simply this:

1. Boot the system up, nothing connected.
2. Connect Thunderbolt dock and make sure UI authorizes it.
3. Connect Thunderbolt NVMe to the Thunderbolt dock and make sure UI authorizes it.
4. Verify that the devices behind PCIe tunnels are visible and functional (lspci for example)
5. Suspend the laptop by closing lid.
6. Unplug the dock (and the NVMe).
7. Resume the laptop by opening the lid.

Expectation: The system resumes just fine, finds the devices gone and stays functional.
Actual result: The system does not resume properly, seems to crash and burn the screen
	       is black.

Please correct me if I got something wrong. This is essentially that you go
from work to home, unplugging the dock and then resuming it at home.

The other thing is that in the pstore I see these:

thunderbolt 0000:00:0d.2: 0:5: __tb_path_deactivate_hop(): 401

but there is no such log in the mainline. If you have done some local
changes I suggest to drop all them to make sure we are looking at the same
source code.

> See attached dump log. I'm using the (somewhat still experimental) Xe
> display driver, but I've seen this same lockup happen with i915.

Please also keep using tha same graphics driver.

> In any case, I've now reverted 9d573d19, and when I get back to my CalDigit
> I can try instrumenting the code paths in the commit and see exactly where
> we're locking up.

No need to add any changes. Just try with the revert and see if that at
least makes the system resume properly. If it does then there could be
other issues but then you can take full dmesg and send to us instead of
those pstore snippets.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))
  2025-02-28 10:49                       ` Mika Westerberg
@ 2025-02-28 16:04                         ` Kenneth Crudup
  2025-03-02 16:13                           ` Kenneth Crudup
  0 siblings, 1 reply; 28+ messages in thread
From: Kenneth Crudup @ 2025-02-28 16:04 UTC (permalink / raw)
  To: Mika Westerberg, Kenneth Crudup
  Cc: Bjorn Helgaas, ilpo.jarvinen, Bjorn Helgaas, Jian-Hong Pan,
	linux-pci, linux-kernel, Niklāvs Koļesņikovs,
	Andreas Noever, Michael Jamet, Lukas Wunner, Yehezkel Bernat,
	linux-usb


I'm still several hundred miles from the consistently-reproducible 
hardware for another couple of days yet, so I've been logging the other 
failures as they happen.

Don't worry about the printk()s WRT to the code; a couple of weeks ago 
I'd seen an NPE on resume in __tb_path_deactivate_hop so threw in a 
bunch of tb_port_info(port, "%s(): %d\n", __func__, __LINE__); so I 
could get an idea of where the crash was.

I'll have more info Sunday with the original kernel, one with the 
revert, and with some of Lukas' proposed(? not sure if they made it in 
there) changes from his previous E-mail.

-K

On 2/28/25 02:49, Mika Westerberg wrote:
> Hi,
> 
> On Thu, Feb 27, 2025 at 09:46:07AM -0800, Kenneth Crudup wrote:
>> So I think, the failure mode may be related in some part to DP/Tunneling,
>> too- I finally got another lockup (this time, after a hibernate, which I
>> guess is some of the same facility) but what was different about this time
>> where I couldn't reproduce the lockups (and what happens when I use my
>> CalDigit dock) was I had an external USB-C monitor connected when I resumed,
>> and when I'm home (where I sometimes forget to remove the NVMe USB4 adaptor)
>> I always have my monitor connected to the dock.
> 
> It would be good to stick with a "proven" use-case so that the steps are
> always the same. This may involve several issues in various parts of the
> kernel and we need to track them one by one. If you change the steps in the
> middle then we may end up finding completely different issues and it is not
> helping the debugging effort.
> 
> The steps at the moment would be simply this:
> 
> 1. Boot the system up, nothing connected.
> 2. Connect Thunderbolt dock and make sure UI authorizes it.
> 3. Connect Thunderbolt NVMe to the Thunderbolt dock and make sure UI authorizes it.
> 4. Verify that the devices behind PCIe tunnels are visible and functional (lspci for example)
> 5. Suspend the laptop by closing lid.
> 6. Unplug the dock (and the NVMe).
> 7. Resume the laptop by opening the lid.
> 
> Expectation: The system resumes just fine, finds the devices gone and stays functional.
> Actual result: The system does not resume properly, seems to crash and burn the screen
> 	       is black.
> 
> Please correct me if I got something wrong. This is essentially that you go
> from work to home, unplugging the dock and then resuming it at home.
> 
> The other thing is that in the pstore I see these:
> 
> thunderbolt 0000:00:0d.2: 0:5: __tb_path_deactivate_hop(): 401
> 
> but there is no such log in the mainline. If you have done some local
> changes I suggest to drop all them to make sure we are looking at the same
> source code.
> 
>> See attached dump log. I'm using the (somewhat still experimental) Xe
>> display driver, but I've seen this same lockup happen with i915.
> 
> Please also keep using tha same graphics driver.
> 
>> In any case, I've now reverted 9d573d19, and when I get back to my CalDigit
>> I can try instrumenting the code paths in the commit and see exactly where
>> we're locking up.
> 
> No need to add any changes. Just try with the revert and see if that at
> least makes the system resume properly. If it does then there could be
> other issues but then you can take full dmesg and send to us instead of
> those pstore snippets.
> 

-- 
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange 
County CA


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))
  2025-02-28 16:04                         ` Kenneth Crudup
@ 2025-03-02 16:13                           ` Kenneth Crudup
  2025-03-03 10:48                             ` Mika Westerberg
  0 siblings, 1 reply; 28+ messages in thread
From: Kenneth Crudup @ 2025-03-02 16:13 UTC (permalink / raw)
  To: Mika Westerberg, Kenneth Crudup
  Cc: Bjorn Helgaas, ilpo.jarvinen, Bjorn Helgaas, Jian-Hong Pan,
	linux-pci, linux-kernel, Niklāvs Koļesņikovs,
	Andreas Noever, Michael Jamet, Lukas Wunner, Yehezkel Bernat,
	linux-usb


On 2/28/25 08:04, Kenneth Crudup wrote:

> Don't worry about the printk()s WRT to the code; a couple of weeks ago 
> I'd seen an NPE on resume in __tb_path_deactivate_hop so threw in a 
> bunch of tb_port_info(port, "%s(): %d\n", __func__, __LINE__); so I 
> could get an idea of where the crash was.

I've started a separate E-mail about this, but I'd determined those 
crashes were due to d6d458d42e1 ("Handle DisplayPort tunnel activation 
asynchronously").

Since reverting 9d573d1954 and d6d458d42e1 I've been testing several 
resume scenarios (NVMe connected/disconnected and/or external 
DP-tunneled monitor connected/disconnected and have yet to have a resume 
or hibernate failure over several cycles.

Now, how do I help you guys go about fixing these commits?

-K

-- 
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange 
County CA


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))
  2025-03-02 16:13                           ` Kenneth Crudup
@ 2025-03-03 10:48                             ` Mika Westerberg
  0 siblings, 0 replies; 28+ messages in thread
From: Mika Westerberg @ 2025-03-03 10:48 UTC (permalink / raw)
  To: Kenneth Crudup
  Cc: Bjorn Helgaas, ilpo.jarvinen, Bjorn Helgaas, Jian-Hong Pan,
	linux-pci, linux-kernel, Niklāvs Koļesņikovs,
	Andreas Noever, Michael Jamet, Lukas Wunner, Yehezkel Bernat,
	linux-usb

On Sun, Mar 02, 2025 at 08:13:51AM -0800, Kenneth Crudup wrote:
> 
> On 2/28/25 08:04, Kenneth Crudup wrote:
> 
> > Don't worry about the printk()s WRT to the code; a couple of weeks ago
> > I'd seen an NPE on resume in __tb_path_deactivate_hop so threw in a
> > bunch of tb_port_info(port, "%s(): %d\n", __func__, __LINE__); so I
> > could get an idea of where the crash was.
> 
> I've started a separate E-mail about this, but I'd determined those crashes
> were due to d6d458d42e1 ("Handle DisplayPort tunnel activation
> asynchronously").
> 
> Since reverting 9d573d1954 and d6d458d42e1 I've been testing several resume
> scenarios (NVMe connected/disconnected and/or external DP-tunneled monitor
> connected/disconnected and have yet to have a resume or hibernate failure
> over several cycles.
> 
> Now, how do I help you guys go about fixing these commits?

I commented on the other thread. Let's deal with these as two separate
issues and investigate both in isolation.

For others, the second thread is this one:

https://lore.kernel.org/linux-usb/8e175721-806f-45d6-892a-bd3356af80c9@panix.com/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))
  2025-02-26  9:19                     ` Mika Westerberg
@ 2025-03-03 20:00                       ` Lukas Wunner
  2025-03-03 20:57                         ` Kenneth Crudup
  2025-03-04  8:23                         ` Mika Westerberg
  0 siblings, 2 replies; 28+ messages in thread
From: Lukas Wunner @ 2025-03-03 20:00 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Kenneth Crudup, Bjorn Helgaas, ilpo.jarvinen, Bjorn Helgaas,
	Jian-Hong Pan, linux-pci, linux-kernel, Nikl??vs Ko??es??ikovs,
	Andreas Noever, Michael Jamet, Yehezkel Bernat, linux-usb

On Wed, Feb 26, 2025 at 11:19:58AM +0200, Mika Westerberg wrote:
> On Wed, Feb 26, 2025 at 10:10:43AM +0100, Lukas Wunner wrote:
> > On Wed, Feb 26, 2025 at 10:44:04AM +0200, Mika Westerberg wrote:
> > >   [Meteor Lake host] <--> [TB 4 dock] <--> [TB 3 NVMe]
> > [...]
> > > I added "no_console_suspend" to the command line and the did sysrq-w to
> > > get list of blocked tasks. I've attached it just in case it is needed.
> > 
> > This looks like the deadlock we've had for years when hot-removing
> > nested hotplug ports.
> > 
> > If you attach only a single device to the host, I guess the issue
> > does not occur, right?
> 
> Yes.
> 
> > Previous attempts to fix this:
> > 
> > https://lore.kernel.org/all/4c882e25194ba8282b78fe963fec8faae7cf23eb.1529173804.git.lukas@wunner.de/
> > 
> > https://lore.kernel.org/all/20240612181625.3604512-1-kbusch@meta.com/
> 
> Well, it does not happen if I revert the commit so isn't that a
> regresssion?

Does the below fix the issue?

-- >8 --

diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
index ff458e6..b0b4d46 100644
--- a/drivers/pci/hotplug/pciehp_core.c
+++ b/drivers/pci/hotplug/pciehp_core.c
@@ -287,24 +287,26 @@ static int pciehp_suspend(struct pcie_device *dev)
 static bool pciehp_device_replaced(struct controller *ctrl)
 {
 	struct pci_dev *pdev __free(pci_dev_put);
+	u64 dsn;
 	u32 reg;
 
 	pdev = pci_get_slot(ctrl->pcie->port->subordinate, PCI_DEVFN(0, 0));
 	if (!pdev)
-		return true;
+		return false;
 
-	if (pci_read_config_dword(pdev, PCI_VENDOR_ID, &reg) ||
-	    reg != (pdev->vendor | (pdev->device << 16)) ||
-	    pci_read_config_dword(pdev, PCI_CLASS_REVISION, &reg) ||
-	    reg != (pdev->revision | (pdev->class << 8)))
+	if ((pci_read_config_dword(pdev, PCI_VENDOR_ID, &reg) == 0 &&
+	     reg != (pdev->vendor | (pdev->device << 16))) ||
+	    (pci_read_config_dword(pdev, PCI_CLASS_REVISION, &reg) == 0 &&
+	     reg != (pdev->revision | (pdev->class << 8))))
 		return true;
 
 	if (pdev->hdr_type == PCI_HEADER_TYPE_NORMAL &&
-	    (pci_read_config_dword(pdev, PCI_SUBSYSTEM_VENDOR_ID, &reg) ||
-	     reg != (pdev->subsystem_vendor | (pdev->subsystem_device << 16))))
+	    pci_read_config_dword(pdev, PCI_SUBSYSTEM_VENDOR_ID, &reg) == 0 &&
+	    reg != (pdev->subsystem_vendor | (pdev->subsystem_device << 16)))
 		return true;
 
-	if (pci_get_dsn(pdev) != ctrl->dsn)
+	dsn = pci_get_dsn(pdev);
+	if ((dsn || ctrl->dsn) && dsn != ctrl->dsn)
 		return true;
 
 	return false;

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))
  2025-03-03 20:00                       ` Lukas Wunner
@ 2025-03-03 20:57                         ` Kenneth Crudup
  2025-03-04  8:23                         ` Mika Westerberg
  1 sibling, 0 replies; 28+ messages in thread
From: Kenneth Crudup @ 2025-03-03 20:57 UTC (permalink / raw)
  To: Lukas Wunner, Mika Westerberg, Me
  Cc: Bjorn Helgaas, ilpo.jarvinen, Bjorn Helgaas, Jian-Hong Pan,
	linux-pci, linux-kernel, Nikl??vs Ko??es??ikovs, Andreas Noever,
	Michael Jamet, Yehezkel Bernat, linux-usb

On 3/3/25 12:00, Lukas Wunner wrote:

> Does the below fix the issue?

So far, so good! But, part of why it was so hard for me to bisect to the 
Subject: commit was 'cause it didn't always OOPS; but I'll continue to 
test on the most-likely failure mode (CalDigit TS4 to TB NVMe adaptor 
connected at suspend, then nothing (or USB-C dock) on resume).

But if this does indeed fix it, this will make TWO crash bugs on resume 
squashed in less than 12 hours- gotta love Open Source Software!

-Kenny

-- 
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange 
County CA

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))
  2025-03-03 20:00                       ` Lukas Wunner
  2025-03-03 20:57                         ` Kenneth Crudup
@ 2025-03-04  8:23                         ` Mika Westerberg
  2025-03-06 16:45                           ` Lukas Wunner
  1 sibling, 1 reply; 28+ messages in thread
From: Mika Westerberg @ 2025-03-04  8:23 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Kenneth Crudup, Bjorn Helgaas, ilpo.jarvinen, Bjorn Helgaas,
	Jian-Hong Pan, linux-pci, linux-kernel, Nikl??vs Ko??es??ikovs,
	Andreas Noever, Michael Jamet, Yehezkel Bernat, linux-usb

On Mon, Mar 03, 2025 at 09:00:28PM +0100, Lukas Wunner wrote:
> On Wed, Feb 26, 2025 at 11:19:58AM +0200, Mika Westerberg wrote:
> > On Wed, Feb 26, 2025 at 10:10:43AM +0100, Lukas Wunner wrote:
> > > On Wed, Feb 26, 2025 at 10:44:04AM +0200, Mika Westerberg wrote:
> > > >   [Meteor Lake host] <--> [TB 4 dock] <--> [TB 3 NVMe]
> > > [...]
> > > > I added "no_console_suspend" to the command line and the did sysrq-w to
> > > > get list of blocked tasks. I've attached it just in case it is needed.
> > > 
> > > This looks like the deadlock we've had for years when hot-removing
> > > nested hotplug ports.
> > > 
> > > If you attach only a single device to the host, I guess the issue
> > > does not occur, right?
> > 
> > Yes.
> > 
> > > Previous attempts to fix this:
> > > 
> > > https://lore.kernel.org/all/4c882e25194ba8282b78fe963fec8faae7cf23eb.1529173804.git.lukas@wunner.de/
> > > 
> > > https://lore.kernel.org/all/20240612181625.3604512-1-kbusch@meta.com/
> > 
> > Well, it does not happen if I revert the commit so isn't that a
> > regresssion?
> 
> Does the below fix the issue?

Unfortunately I still see the same hang. I double checked, with revert the
problem goes a way and with this patch I still see it.

Steps:

1. Boot the system, nothing connected.
2. Connect TBT 4 dock to the host.
3. Connect TBT 3 NVMe to the TBT4 doc.
4. Authorize both PCIe tunnels, verify devices are there.
5. Enter s2idle.
6. Unplug the TBT 4 dock from the host.
7. Exit s2idle.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))
  2025-03-04  8:23                         ` Mika Westerberg
@ 2025-03-06 16:45                           ` Lukas Wunner
  2025-03-06 16:56                             ` Kenneth Crudup
                                               ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Lukas Wunner @ 2025-03-06 16:45 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Kenneth Crudup, Bjorn Helgaas, ilpo.jarvinen, Bjorn Helgaas,
	Jian-Hong Pan, linux-pci, linux-kernel, Nikl??vs Ko??es??ikovs,
	Andreas Noever, Michael Jamet, Yehezkel Bernat, linux-usb

On Tue, Mar 04, 2025 at 10:23:14AM +0200, Mika Westerberg wrote:
> Unfortunately I still see the same hang. I double checked, with revert the
> problem goes a way and with this patch I still see it.
> 
> Steps:
> 
> 1. Boot the system, nothing connected.
> 2. Connect TBT 4 dock to the host.
> 3. Connect TBT 3 NVMe to the TBT4 doc.
> 4. Authorize both PCIe tunnels, verify devices are there.
> 5. Enter s2idle.
> 6. Unplug the TBT 4 dock from the host.
> 7. Exit s2idle.

Thanks for testing.  Would you mind giving the below a spin?

I've realized this can likely be solved in a much easier way:

The ->resume_noirq callback is invoked while traversing down
the hierarchy and the topmost slot which detects device replacement
already marks everything below as disconnected.  Hence any nested
hotplug ports can just skip the replacement check because they're
disconnected as well.

-- >8 --

diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
index ff458e6..997841c 100644
--- a/drivers/pci/hotplug/pciehp_core.c
+++ b/drivers/pci/hotplug/pciehp_core.c
@@ -286,9 +286,12 @@ static int pciehp_suspend(struct pcie_device *dev)
 
 static bool pciehp_device_replaced(struct controller *ctrl)
 {
-	struct pci_dev *pdev __free(pci_dev_put);
+	struct pci_dev *pdev __free(pci_dev_put) = NULL;
 	u32 reg;
 
+	if (pci_dev_is_disconnected(ctrl->pcie->port))
+		return false;
+
 	pdev = pci_get_slot(ctrl->pcie->port->subordinate, PCI_DEVFN(0, 0));
 	if (!pdev)
 		return true;

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))
  2025-03-06 16:45                           ` Lukas Wunner
@ 2025-03-06 16:56                             ` Kenneth Crudup
  2025-03-06 18:18                               ` Lukas Wunner
  2025-03-06 20:38                             ` Kenneth Crudup
                                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 28+ messages in thread
From: Kenneth Crudup @ 2025-03-06 16:56 UTC (permalink / raw)
  To: Lukas Wunner, Mika Westerberg, Me
  Cc: Bjorn Helgaas, ilpo.jarvinen, Bjorn Helgaas, Jian-Hong Pan,
	linux-pci, linux-kernel, Nikl??vs Ko??es??ikovs, Andreas Noever,
	Michael Jamet, Yehezkel Bernat, linux-usb



Is this a separate commit on top of master, or along with your previous fix?

-Kenny

On 3/6/25 08:45, Lukas Wunner wrote:
> On Tue, Mar 04, 2025 at 10:23:14AM +0200, Mika Westerberg wrote:
>> Unfortunately I still see the same hang. I double checked, with revert the
>> problem goes a way and with this patch I still see it.
>>
>> Steps:
>>
>> 1. Boot the system, nothing connected.
>> 2. Connect TBT 4 dock to the host.
>> 3. Connect TBT 3 NVMe to the TBT4 doc.
>> 4. Authorize both PCIe tunnels, verify devices are there.
>> 5. Enter s2idle.
>> 6. Unplug the TBT 4 dock from the host.
>> 7. Exit s2idle.
> 
> Thanks for testing.  Would you mind giving the below a spin?
> 
> I've realized this can likely be solved in a much easier way:
> 
> The ->resume_noirq callback is invoked while traversing down
> the hierarchy and the topmost slot which detects device replacement
> already marks everything below as disconnected.  Hence any nested
> hotplug ports can just skip the replacement check because they're
> disconnected as well.
> 
> -- >8 --
> 
> diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
> index ff458e6..997841c 100644
> --- a/drivers/pci/hotplug/pciehp_core.c
> +++ b/drivers/pci/hotplug/pciehp_core.c
> @@ -286,9 +286,12 @@ static int pciehp_suspend(struct pcie_device *dev)
>   
>   static bool pciehp_device_replaced(struct controller *ctrl)
>   {
> -	struct pci_dev *pdev __free(pci_dev_put);
> +	struct pci_dev *pdev __free(pci_dev_put) = NULL;
>   	u32 reg;
>   
> +	if (pci_dev_is_disconnected(ctrl->pcie->port))
> +		return false;
> +
>   	pdev = pci_get_slot(ctrl->pcie->port->subordinate, PCI_DEVFN(0, 0));
>   	if (!pdev)
>   		return true;
> 

-- 
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange 
County CA


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))
  2025-03-06 16:56                             ` Kenneth Crudup
@ 2025-03-06 18:18                               ` Lukas Wunner
  0 siblings, 0 replies; 28+ messages in thread
From: Lukas Wunner @ 2025-03-06 18:18 UTC (permalink / raw)
  To: Kenneth Crudup
  Cc: Mika Westerberg, Bjorn Helgaas, ilpo.jarvinen, Bjorn Helgaas,
	Jian-Hong Pan, linux-pci, linux-kernel, Nikl??vs Ko??es??ikovs,
	Andreas Noever, Michael Jamet, Yehezkel Bernat, linux-usb

On Thu, Mar 06, 2025 at 08:56:44AM -0800, Kenneth Crudup wrote:
> On 3/6/25 08:45, Lukas Wunner wrote:
> > Thanks for testing.  Would you mind giving the below a spin?
> 
> Is this a separate commit on top of master, or along with your previous fix?

It's a separate commit on top of master.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))
  2025-03-06 16:45                           ` Lukas Wunner
  2025-03-06 16:56                             ` Kenneth Crudup
@ 2025-03-06 20:38                             ` Kenneth Crudup
  2025-03-07  2:04                             ` Kenneth Crudup
  2025-03-07 10:34                             ` Mika Westerberg
  3 siblings, 0 replies; 28+ messages in thread
From: Kenneth Crudup @ 2025-03-06 20:38 UTC (permalink / raw)
  To: Lukas Wunner, Mika Westerberg, Me
  Cc: Bjorn Helgaas, ilpo.jarvinen, Bjorn Helgaas, Jian-Hong Pan,
	linux-pci, linux-kernel, Nikl??vs Ko??es??ikovs, Andreas Noever,
	Michael Jamet, Yehezkel Bernat, linux-usb


I'll do more testing but it's been a couple of attempts and it hasn't 
locked up on me.

Curious to see how Mika fares with it.

-Kenny


On 3/6/25 08:45, Lukas Wunner wrote:
> On Tue, Mar 04, 2025 at 10:23:14AM +0200, Mika Westerberg wrote:
>> Unfortunately I still see the same hang. I double checked, with revert the
>> problem goes a way and with this patch I still see it.
>>
>> Steps:
>>
>> 1. Boot the system, nothing connected.
>> 2. Connect TBT 4 dock to the host.
>> 3. Connect TBT 3 NVMe to the TBT4 doc.
>> 4. Authorize both PCIe tunnels, verify devices are there.
>> 5. Enter s2idle.
>> 6. Unplug the TBT 4 dock from the host.
>> 7. Exit s2idle.
> 
> Thanks for testing.  Would you mind giving the below a spin?
> 
> I've realized this can likely be solved in a much easier way:
> 
> The ->resume_noirq callback is invoked while traversing down
> the hierarchy and the topmost slot which detects device replacement
> already marks everything below as disconnected.  Hence any nested
> hotplug ports can just skip the replacement check because they're
> disconnected as well.
> 
> -- >8 --
> 
> diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
> index ff458e6..997841c 100644
> --- a/drivers/pci/hotplug/pciehp_core.c
> +++ b/drivers/pci/hotplug/pciehp_core.c
> @@ -286,9 +286,12 @@ static int pciehp_suspend(struct pcie_device *dev)
>   
>   static bool pciehp_device_replaced(struct controller *ctrl)
>   {
> -	struct pci_dev *pdev __free(pci_dev_put);
> +	struct pci_dev *pdev __free(pci_dev_put) = NULL;
>   	u32 reg;
>   
> +	if (pci_dev_is_disconnected(ctrl->pcie->port))
> +		return false;
> +
>   	pdev = pci_get_slot(ctrl->pcie->port->subordinate, PCI_DEVFN(0, 0));
>   	if (!pdev)
>   		return true;
> 

-- 
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange 
County CA


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))
  2025-03-06 16:45                           ` Lukas Wunner
  2025-03-06 16:56                             ` Kenneth Crudup
  2025-03-06 20:38                             ` Kenneth Crudup
@ 2025-03-07  2:04                             ` Kenneth Crudup
  2025-03-07 10:34                             ` Mika Westerberg
  3 siblings, 0 replies; 28+ messages in thread
From: Kenneth Crudup @ 2025-03-07  2:04 UTC (permalink / raw)
  To: Lukas Wunner, Mika Westerberg, Me
  Cc: Bjorn Helgaas, ilpo.jarvinen, Bjorn Helgaas, Jian-Hong Pan,
	linux-pci, linux-kernel, Nikl??vs Ko??es??ikovs, Andreas Noever,
	Michael Jamet, Yehezkel Bernat, linux-usb


On 3/6/25 08:45, Lukas Wunner wrote:

> Thanks for testing.  Would you mind giving the below a spin?
> I've realized this can likely be solved in a much easier way:

OK, I've tried all the scenarios that had failures with your original 
commit, and with your latest change I've not had any issues.

ACK :)

-Kenny

-- 
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange 
County CA


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac))
  2025-03-06 16:45                           ` Lukas Wunner
                                               ` (2 preceding siblings ...)
  2025-03-07  2:04                             ` Kenneth Crudup
@ 2025-03-07 10:34                             ` Mika Westerberg
  3 siblings, 0 replies; 28+ messages in thread
From: Mika Westerberg @ 2025-03-07 10:34 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Kenneth Crudup, Bjorn Helgaas, ilpo.jarvinen, Bjorn Helgaas,
	Jian-Hong Pan, linux-pci, linux-kernel, Nikl??vs Ko??es??ikovs,
	Andreas Noever, Michael Jamet, Yehezkel Bernat, linux-usb

On Thu, Mar 06, 2025 at 05:45:23PM +0100, Lukas Wunner wrote:
> On Tue, Mar 04, 2025 at 10:23:14AM +0200, Mika Westerberg wrote:
> > Unfortunately I still see the same hang. I double checked, with revert the
> > problem goes a way and with this patch I still see it.
> > 
> > Steps:
> > 
> > 1. Boot the system, nothing connected.
> > 2. Connect TBT 4 dock to the host.
> > 3. Connect TBT 3 NVMe to the TBT4 doc.
> > 4. Authorize both PCIe tunnels, verify devices are there.
> > 5. Enter s2idle.
> > 6. Unplug the TBT 4 dock from the host.
> > 7. Exit s2idle.
> 
> Thanks for testing.  Would you mind giving the below a spin?

Sure.

> I've realized this can likely be solved in a much easier way:
> 
> The ->resume_noirq callback is invoked while traversing down
> the hierarchy and the topmost slot which detects device replacement
> already marks everything below as disconnected.  Hence any nested
> hotplug ports can just skip the replacement check because they're
> disconnected as well.

Makes sense.

Tried the patch now and it solves the issue. Thanks!

Tested-by: Mika Westerberg <mika.westerberg@linux.intel.com>

> 
> -- >8 --
> 
> diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
> index ff458e6..997841c 100644
> --- a/drivers/pci/hotplug/pciehp_core.c
> +++ b/drivers/pci/hotplug/pciehp_core.c
> @@ -286,9 +286,12 @@ static int pciehp_suspend(struct pcie_device *dev)
>  
>  static bool pciehp_device_replaced(struct controller *ctrl)
>  {
> -	struct pci_dev *pdev __free(pci_dev_put);
> +	struct pci_dev *pdev __free(pci_dev_put) = NULL;
>  	u32 reg;
>  
> +	if (pci_dev_is_disconnected(ctrl->pcie->port))
> +		return false;
> +
>  	pdev = pci_get_slot(ctrl->pcie->port->subordinate, PCI_DEVFN(0, 0));
>  	if (!pdev)
>  		return true;

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2025-03-07 10:35 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <b2abd254-d11f-4ef7-8664-b9e5a1409abc@panix.com>
2025-02-10 21:05 ` PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac) Bjorn Helgaas
2025-02-11  0:18   ` Kenneth Crudup
2025-02-11  5:57     ` Mika Westerberg
2025-02-11  6:17       ` diagnosing resume failures after disconnected USB4 drives (Was: Re: PCI/ASPM: Fix L1SS saving (linus/master commit 7507eb3e7bfac)) Kenneth Crudup
2025-02-13 13:59         ` Mika Westerberg
2025-02-13 19:19           ` Kenneth Crudup
2025-02-14 16:29             ` Mika Westerberg
2025-02-14 17:39               ` Kenneth Crudup
2025-02-26  8:44                 ` Mika Westerberg
2025-02-26  9:10                   ` Lukas Wunner
2025-02-26  9:19                     ` Mika Westerberg
2025-03-03 20:00                       ` Lukas Wunner
2025-03-03 20:57                         ` Kenneth Crudup
2025-03-04  8:23                         ` Mika Westerberg
2025-03-06 16:45                           ` Lukas Wunner
2025-03-06 16:56                             ` Kenneth Crudup
2025-03-06 18:18                               ` Lukas Wunner
2025-03-06 20:38                             ` Kenneth Crudup
2025-03-07  2:04                             ` Kenneth Crudup
2025-03-07 10:34                             ` Mika Westerberg
2025-02-26 15:31                   ` Kenneth Crudup
2025-02-26 21:13                   ` Kenneth Crudup
2025-02-26 21:14                   ` Kenneth Crudup
2025-02-27 17:46                     ` Kenneth Crudup
2025-02-28 10:49                       ` Mika Westerberg
2025-02-28 16:04                         ` Kenneth Crudup
2025-03-02 16:13                           ` Kenneth Crudup
2025-03-03 10:48                             ` Mika Westerberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox