[PATCH] [RFC] thunderbolt: Add delay for Dell U2725QE link width

public inbox for linux-usb@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] [RFC] thunderbolt: Add delay for Dell U2725QE link width
@ 2025-12-09  5:41 Chia-Lin Kao (AceLan)
  2025-12-09  7:06 ` Mika Westerberg
  0 siblings, 1 reply; 21+ messages in thread
From: Chia-Lin Kao (AceLan) @ 2025-12-09  5:41 UTC (permalink / raw)
  To: Andreas Noever, Mika Westerberg, Yehezkel Bernat, linux-usb,
	linux-kernel

When plugging in a Dell U2725QE Thunderbolt monitor, the kernel produces
a call trace during initial enumeration. The device automatically
disconnects and reconnects ~3 seconds later, and works correctly on the
second attempt.

Issue Description:
==================
The Dell U2725QE (USB4 device 8087:b26) requires additional time during
link width negotiation from single lane to dual lane. On first plug, the
following sequence occurs:

1. Port state reaches TB_PORT_UP (link established, single lane)
2. Path activation begins immediately
3. tb_path_activate() - > tb_port_write() returns -ENOTCONN (error -107)
4. Call trace is generated at tb_path_activate()
5. Device disconnects/reconnects automatically after ~3 seconds
6. Second attempt succeeds with full dual-lane bandwidth

First attempt dmesg (failure):
-------------------------------
[   36.030347] thunderbolt 0000:c7:00.6: 2:16: available bandwidth for new USB3 tunnel 9000/9000 Mb/s
[   36.030613] thunderbolt 0000:c7:00.6: 2: USB3 tunnel creation failed
[   36.031530] thunderbolt 0000:c7:00.6: PCIe Down path activation failed
[   36.031531] WARNING: drivers/thunderbolt/path.c:589 at 0x0, CPU#12: pool-/usr/libex/3145

Second attempt dmesg (success):
--------------------------------
[   40.440012] thunderbolt 0000:c7:00.6: 2:16: available bandwidth for new USB3 tunnel 36000/36000 Mb/s
[   40.440261] thunderbolt 0000:c7:00.6: 2:16: maximum required bandwidth for USB3 tunnel 9000 Mb/s
[   40.440269] thunderbolt 0000:c7:00.6: 0:4 <-> 2:16 (USB3): activating
[   40.440271] thunderbolt 0000:c7:00.6: 0:4 <-> 2:16 (USB3): allocating initial bandwidth 9000/9000 Mb/s

The bandwidth difference (9000 vs 36000 Mb/s) indicates the first attempt
occurs while the link is still in single-lane mode.

Root Cause Analysis:
====================
The error originates from the Thunderbolt/USB4 device hardware itself:

1. Port config space read/write returns TB_CFG_ERROR_PORT_NOT_CONNECTED
2. This gets translated to -ENOTCONN in tb_cfg_get_error()
3. The port's control channel is temporarily unavailable during state
   transition from single lane to dual lane (lane bonding)

The comment in drivers/thunderbolt/ctl.c explains this is expected:
  "Port is not connected. This can happen during surprise removal.
   Do not warn."

Attempted Solutions:
====================
1. Retry logic on -ENOTCONN in tb_path_activate():
   Result: Caused host port (0:0) lockup with hundreds of "downstream
   port is locked" errors. Rejected by user.

2. Increased tb_port_wait_for_link_width() timeout from 100ms to 3000ms:
   Result: Did not resolve the issue. The timeout increase alone is
   insufficient because the port state hasn't reached TB_PORT_UP when
   lane bonding is attempted.

3. Added msleep(2000) at various points in enumeration flow:
   Locations tested:
   - Before tb_switch_configure(): Works ✓
   - Before tb_switch_add(): Works ✓
   - Before usb4_port_hotplug_enable(): Works ✓
   - After tb_switch_add(): Doesn't work ✗
   - In tb_configure_link(): Doesn't work ✗
   - In tb_switch_lane_bonding_enable(): Doesn't work ✗
   - In tb_port_wait_for_link_width(): Doesn't work ✗

   The pattern shows the delay must occur BEFORE hotplug enable, which
   happens early in tb_switch_port_hotplug_enable() -> usb4_port_hotplug_enable().

Current Workaround:
===================
Add a 2-second delay in tb_wait_for_port() when the port state reaches
TB_PORT_UP. This is the earliest point where we know:
- The link is physically established
- The device is responsive
- But lane width negotiation may still be in progress

This location is chosen because:
1. It's called during port enumeration before any tunnel creation
2. The port has just transitioned to TB_PORT_UP state
3. Allows sufficient time for lane bonding to complete
4. Avoids affecting other code paths

Testing Results:
================
With this patch:
- No call trace on first plug
- Device enumerates correctly on first attempt
- Full bandwidth (36000 Mb/s) available immediately
- No disconnect/reconnect cycle
- USB and PCIe tunnels create successfully

Without this patch:
- Call trace on every first plug
- Only 9000 Mb/s bandwidth (single lane) on first attempt
- Automatic disconnect/reconnect after ~3 seconds
- Second attempt works with 36000 Mb/s

Discussion Points for RFC:
===========================
1. Is a fixed 2-second delay acceptable, or should we poll for a
   specific hardware state?

2. Should we check PORT_CS_18_TIP (Transition In Progress) bit instead
   of using a fixed delay?

3. Is there a better location for this delay in the enumeration flow?

4. Should this be device-specific (based on vendor/device ID) or apply
   to all USB4 devices?

5. The 100ms timeout in tb_switch_lane_bonding_enable() may be too
   short for other devices as well. Should we increase it universally?

Hardware Details:
=================
Device: Dell U2725QE Thunderbolt Monitor
USB4 Router: 8087:b26 (Intel USB4 controller)
Host: AMD Thunderbolt 4 controller (0000:c7:00.6)

Signed-off-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com>
---
Full dmesg log available at: https://paste.ubuntu.com/p/CXs2T4XzZ3/
---
 drivers/thunderbolt/switch.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/thunderbolt/switch.c b/drivers/thunderbolt/switch.c
index b3948aad0b955..e0c65e5fb0dca 100644
--- a/drivers/thunderbolt/switch.c
+++ b/drivers/thunderbolt/switch.c
@@ -530,6 +530,8 @@ int tb_wait_for_port(struct tb_port *port, bool wait_if_unplugged)
 			return 0;

 		case TB_PORT_UP:
+			msleep(2000);
+			fallthrough;
 		case TB_PORT_TX_CL0S:
 		case TB_PORT_RX_CL0S:
 		case TB_PORT_CL1:
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] [RFC] thunderbolt: Add delay for Dell U2725QE link width
  2025-12-09  5:41 [PATCH] [RFC] thunderbolt: Add delay for Dell U2725QE link width Chia-Lin Kao (AceLan)
@ 2025-12-09  7:06 ` Mika Westerberg
  2025-12-09 16:49   ` Mario Limonciello
  2025-12-10  3:15   ` Chia-Lin Kao (AceLan)
  0 siblings, 2 replies; 21+ messages in thread
From: Mika Westerberg @ 2025-12-09  7:06 UTC (permalink / raw)
  To: Chia-Lin Kao (AceLan)
  Cc: Andreas Noever, Mika Westerberg, Yehezkel Bernat, linux-usb,
	linux-kernel, Mario Limonciello

+Mario since this is AMD related.

[Also keeping all the context].

On Tue, Dec 09, 2025 at 01:41:41PM +0800, Chia-Lin Kao (AceLan) wrote:
> When plugging in a Dell U2725QE Thunderbolt monitor, the kernel produces
> a call trace during initial enumeration. The device automatically
> disconnects and reconnects ~3 seconds later, and works correctly on the
> second attempt.
> 
> Issue Description:
> ==================
> The Dell U2725QE (USB4 device 8087:b26) requires additional time during
> link width negotiation from single lane to dual lane. On first plug, the
> following sequence occurs:
> 
> 1. Port state reaches TB_PORT_UP (link established, single lane)
> 2. Path activation begins immediately
> 3. tb_path_activate() - > tb_port_write() returns -ENOTCONN (error -107)
> 4. Call trace is generated at tb_path_activate()
> 5. Device disconnects/reconnects automatically after ~3 seconds
> 6. Second attempt succeeds with full dual-lane bandwidth
> 
> First attempt dmesg (failure):
> -------------------------------
> [   36.030347] thunderbolt 0000:c7:00.6: 2:16: available bandwidth for new USB3 tunnel 9000/9000 Mb/s
> [   36.030613] thunderbolt 0000:c7:00.6: 2: USB3 tunnel creation failed
> [   36.031530] thunderbolt 0000:c7:00.6: PCIe Down path activation failed
> [   36.031531] WARNING: drivers/thunderbolt/path.c:589 at 0x0, CPU#12: pool-/usr/libex/3145
> 
> Second attempt dmesg (success):
> --------------------------------
> [   40.440012] thunderbolt 0000:c7:00.6: 2:16: available bandwidth for new USB3 tunnel 36000/36000 Mb/s
> [   40.440261] thunderbolt 0000:c7:00.6: 2:16: maximum required bandwidth for USB3 tunnel 9000 Mb/s
> [   40.440269] thunderbolt 0000:c7:00.6: 0:4 <-> 2:16 (USB3): activating
> [   40.440271] thunderbolt 0000:c7:00.6: 0:4 <-> 2:16 (USB3): allocating initial bandwidth 9000/9000 Mb/s
> 
> The bandwidth difference (9000 vs 36000 Mb/s) indicates the first attempt
> occurs while the link is still in single-lane mode.
> 
> Root Cause Analysis:
> ====================
> The error originates from the Thunderbolt/USB4 device hardware itself:
> 
> 1. Port config space read/write returns TB_CFG_ERROR_PORT_NOT_CONNECTED
> 2. This gets translated to -ENOTCONN in tb_cfg_get_error()
> 3. The port's control channel is temporarily unavailable during state
>    transition from single lane to dual lane (lane bonding)
> 
> The comment in drivers/thunderbolt/ctl.c explains this is expected:
>   "Port is not connected. This can happen during surprise removal.
>    Do not warn."
> 
> Attempted Solutions:
> ====================
> 1. Retry logic on -ENOTCONN in tb_path_activate():
>    Result: Caused host port (0:0) lockup with hundreds of "downstream
>    port is locked" errors. Rejected by user.
> 
> 2. Increased tb_port_wait_for_link_width() timeout from 100ms to 3000ms:
>    Result: Did not resolve the issue. The timeout increase alone is
>    insufficient because the port state hasn't reached TB_PORT_UP when
>    lane bonding is attempted.
> 
> 3. Added msleep(2000) at various points in enumeration flow:
>    Locations tested:
>    - Before tb_switch_configure(): Works ✓
>    - Before tb_switch_add(): Works ✓
>    - Before usb4_port_hotplug_enable(): Works ✓
>    - After tb_switch_add(): Doesn't work ✗
>    - In tb_configure_link(): Doesn't work ✗
>    - In tb_switch_lane_bonding_enable(): Doesn't work ✗
>    - In tb_port_wait_for_link_width(): Doesn't work ✗
> 
>    The pattern shows the delay must occur BEFORE hotplug enable, which
>    happens early in tb_switch_port_hotplug_enable() -> usb4_port_hotplug_enable().
> 
> Current Workaround:
> ===================
> Add a 2-second delay in tb_wait_for_port() when the port state reaches
> TB_PORT_UP. This is the earliest point where we know:
> - The link is physically established
> - The device is responsive
> - But lane width negotiation may still be in progress
> 
> This location is chosen because:
> 1. It's called during port enumeration before any tunnel creation
> 2. The port has just transitioned to TB_PORT_UP state
> 3. Allows sufficient time for lane bonding to complete
> 4. Avoids affecting other code paths
> 
> Testing Results:
> ================
> With this patch:
> - No call trace on first plug
> - Device enumerates correctly on first attempt
> - Full bandwidth (36000 Mb/s) available immediately
> - No disconnect/reconnect cycle
> - USB and PCIe tunnels create successfully
> 
> Without this patch:
> - Call trace on every first plug
> - Only 9000 Mb/s bandwidth (single lane) on first attempt
> - Automatic disconnect/reconnect after ~3 seconds
> - Second attempt works with 36000 Mb/s
> 
> Discussion Points for RFC:
> ===========================
> 1. Is a fixed 2-second delay acceptable, or should we poll for a
>    specific hardware state?
> 
> 2. Should we check PORT_CS_18_TIP (Transition In Progress) bit instead
>    of using a fixed delay?
> 
> 3. Is there a better location for this delay in the enumeration flow?
> 
> 4. Should this be device-specific (based on vendor/device ID) or apply
>    to all USB4 devices?
> 
> 5. The 100ms timeout in tb_switch_lane_bonding_enable() may be too
>    short for other devices as well. Should we increase it universally?

We should understand the issue better. This is Intel Goshen Ridge based
monitor which I'm pretty sure does not require additional quirks, at least
I have not heard any issues like this. I suspect this is combination of the
AMD and Intel hardware that is causing the issue.

Looking at your dmesg, even before your issue there is suspicious log
entry:

[    5.852476] localhost kernel: [31] thunderbolt 0000:c7:00.5: acking hot unplug event on 0:6
[    5.852492] localhost kernel: [12] thunderbolt 0000:c7:00.5: 0:6: DP IN resource unavailable: adapter unplug

This causes tearing down the DP tunnel. It is unexpected for the host
router to send this unless you plugged monitor directly to some of the
Type-C ports at this time?

I wonder if you could take trace logs too from the issue? Instructions:

https://github.com/intel/tbtools?tab=readme-ov-file#tracing
https://github.com/intel/tbtools/wiki/Useful-Commands#tracing

Please provide both full dmesg and the trace.out or the merged one. That
would allow us to look what is going on (hopefully).

> Hardware Details:
> =================
> Device: Dell U2725QE Thunderbolt Monitor
> USB4 Router: 8087:b26 (Intel USB4 controller)
> Host: AMD Thunderbolt 4 controller (0000:c7:00.6)
> 
> Signed-off-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com>
> ---
> Full dmesg log available at: https://paste.ubuntu.com/p/CXs2T4XzZ3/
> ---
>  drivers/thunderbolt/switch.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/thunderbolt/switch.c b/drivers/thunderbolt/switch.c
> index b3948aad0b955..e0c65e5fb0dca 100644
> --- a/drivers/thunderbolt/switch.c
> +++ b/drivers/thunderbolt/switch.c
> @@ -530,6 +530,8 @@ int tb_wait_for_port(struct tb_port *port, bool wait_if_unplugged)
>  			return 0;
>  
>  		case TB_PORT_UP:
> +			msleep(2000);
> +			fallthrough;
>  		case TB_PORT_TX_CL0S:
>  		case TB_PORT_RX_CL0S:
>  		case TB_PORT_CL1:
> -- 
> 2.43.0

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] [RFC] thunderbolt: Add delay for Dell U2725QE link width
  2025-12-09  7:06 ` Mika Westerberg
@ 2025-12-09 16:49   ` Mario Limonciello
  2025-12-10  5:33     ` Chia-Lin Kao (AceLan)
  2025-12-10  3:15   ` Chia-Lin Kao (AceLan)
  1 sibling, 1 reply; 21+ messages in thread
From: Mario Limonciello @ 2025-12-09 16:49 UTC (permalink / raw)
  To: Mika Westerberg, Chia-Lin Kao (AceLan), Sanath.S
  Cc: Andreas Noever, Mika Westerberg, Yehezkel Bernat, linux-usb,
	linux-kernel

+Sanath too

On 12/9/2025 1:06 AM, Mika Westerberg wrote:
> +Mario since this is AMD related.
> 
> [Also keeping all the context].
> 

Thanks for adding me.  A few other thoughts I have:

1) Is it possible that the USB4 controller in the monitor is powering up 
or exiting a low power state during the first hotplug?

2) Are you sure this only happens on AMD host?  What if you cold boot 
the monitor with Intel host?


> On Tue, Dec 09, 2025 at 01:41:41PM +0800, Chia-Lin Kao (AceLan) wrote:
>> When plugging in a Dell U2725QE Thunderbolt monitor, the kernel produces
>> a call trace during initial enumeration. The device automatically
>> disconnects and reconnects ~3 seconds later, and works correctly on the
>> second attempt.
>>
>> Issue Description:
>> ==================
>> The Dell U2725QE (USB4 device 8087:b26) requires additional time during
>> link width negotiation from single lane to dual lane. On first plug, the
>> following sequence occurs:
>>
>> 1. Port state reaches TB_PORT_UP (link established, single lane)
>> 2. Path activation begins immediately
>> 3. tb_path_activate() - > tb_port_write() returns -ENOTCONN (error -107)
>> 4. Call trace is generated at tb_path_activate()
>> 5. Device disconnects/reconnects automatically after ~3 seconds
>> 6. Second attempt succeeds with full dual-lane bandwidth
>>
>> First attempt dmesg (failure):
>> -------------------------------
>> [   36.030347] thunderbolt 0000:c7:00.6: 2:16: available bandwidth for new USB3 tunnel 9000/9000 Mb/s
>> [   36.030613] thunderbolt 0000:c7:00.6: 2: USB3 tunnel creation failed
>> [   36.031530] thunderbolt 0000:c7:00.6: PCIe Down path activation failed
>> [   36.031531] WARNING: drivers/thunderbolt/path.c:589 at 0x0, CPU#12: pool-/usr/libex/3145
>>
>> Second attempt dmesg (success):
>> --------------------------------
>> [   40.440012] thunderbolt 0000:c7:00.6: 2:16: available bandwidth for new USB3 tunnel 36000/36000 Mb/s
>> [   40.440261] thunderbolt 0000:c7:00.6: 2:16: maximum required bandwidth for USB3 tunnel 9000 Mb/s
>> [   40.440269] thunderbolt 0000:c7:00.6: 0:4 <-> 2:16 (USB3): activating
>> [   40.440271] thunderbolt 0000:c7:00.6: 0:4 <-> 2:16 (USB3): allocating initial bandwidth 9000/9000 Mb/s
>>
>> The bandwidth difference (9000 vs 36000 Mb/s) indicates the first attempt
>> occurs while the link is still in single-lane mode.
>>
>> Root Cause Analysis:
>> ====================
>> The error originates from the Thunderbolt/USB4 device hardware itself:
>>
>> 1. Port config space read/write returns TB_CFG_ERROR_PORT_NOT_CONNECTED
>> 2. This gets translated to -ENOTCONN in tb_cfg_get_error()
>> 3. The port's control channel is temporarily unavailable during state
>>     transition from single lane to dual lane (lane bonding)
>>
>> The comment in drivers/thunderbolt/ctl.c explains this is expected:
>>    "Port is not connected. This can happen during surprise removal.
>>     Do not warn."
>>
>> Attempted Solutions:
>> ====================
>> 1. Retry logic on -ENOTCONN in tb_path_activate():
>>     Result: Caused host port (0:0) lockup with hundreds of "downstream
>>     port is locked" errors. Rejected by user.
>>
>> 2. Increased tb_port_wait_for_link_width() timeout from 100ms to 3000ms:
>>     Result: Did not resolve the issue. The timeout increase alone is
>>     insufficient because the port state hasn't reached TB_PORT_UP when
>>     lane bonding is attempted.
>>
>> 3. Added msleep(2000) at various points in enumeration flow:
>>     Locations tested:
>>     - Before tb_switch_configure(): Works ✓
>>     - Before tb_switch_add(): Works ✓
>>     - Before usb4_port_hotplug_enable(): Works ✓
>>     - After tb_switch_add(): Doesn't work ✗
>>     - In tb_configure_link(): Doesn't work ✗
>>     - In tb_switch_lane_bonding_enable(): Doesn't work ✗
>>     - In tb_port_wait_for_link_width(): Doesn't work ✗
>>
>>     The pattern shows the delay must occur BEFORE hotplug enable, which
>>     happens early in tb_switch_port_hotplug_enable() -> usb4_port_hotplug_enable().
>>
>> Current Workaround:
>> ===================
>> Add a 2-second delay in tb_wait_for_port() when the port state reaches
>> TB_PORT_UP. This is the earliest point where we know:
>> - The link is physically established
>> - The device is responsive
>> - But lane width negotiation may still be in progress
>>
>> This location is chosen because:
>> 1. It's called during port enumeration before any tunnel creation
>> 2. The port has just transitioned to TB_PORT_UP state
>> 3. Allows sufficient time for lane bonding to complete
>> 4. Avoids affecting other code paths
>>
>> Testing Results:
>> ================
>> With this patch:
>> - No call trace on first plug
>> - Device enumerates correctly on first attempt
>> - Full bandwidth (36000 Mb/s) available immediately
>> - No disconnect/reconnect cycle
>> - USB and PCIe tunnels create successfully
>>
>> Without this patch:
>> - Call trace on every first plug
>> - Only 9000 Mb/s bandwidth (single lane) on first attempt
>> - Automatic disconnect/reconnect after ~3 seconds
>> - Second attempt works with 36000 Mb/s
>>
>> Discussion Points for RFC:
>> ===========================
>> 1. Is a fixed 2-second delay acceptable, or should we poll for a
>>     specific hardware state?
>>
>> 2. Should we check PORT_CS_18_TIP (Transition In Progress) bit instead
>>     of using a fixed delay?
>>
>> 3. Is there a better location for this delay in the enumeration flow?
>>
>> 4. Should this be device-specific (based on vendor/device ID) or apply
>>     to all USB4 devices?
>>
>> 5. The 100ms timeout in tb_switch_lane_bonding_enable() may be too
>>     short for other devices as well. Should we increase it universally?
> 
> We should understand the issue better. This is Intel Goshen Ridge based
> monitor which I'm pretty sure does not require additional quirks, at least
> I have not heard any issues like this. I suspect this is combination of the
> AMD and Intel hardware that is causing the issue.
> 
> Looking at your dmesg, even before your issue there is suspicious log
> entry:
> 
> [    5.852476] localhost kernel: [31] thunderbolt 0000:c7:00.5: acking hot unplug event on 0:6
> [    5.852492] localhost kernel: [12] thunderbolt 0000:c7:00.5: 0:6: DP IN resource unavailable: adapter unplug
> 
> This causes tearing down the DP tunnel. It is unexpected for the host
> router to send this unless you plugged monitor directly to some of the
> Type-C ports at this time?
> 
> I wonder if you could take trace logs too from the issue? Instructions:
> 
> https://github.com/intel/tbtools?tab=readme-ov-file#tracing
> https://github.com/intel/tbtools/wiki/Useful-Commands#tracing
> 
> Please provide both full dmesg and the trace.out or the merged one. That
> would allow us to look what is going on (hopefully).

We need to be careful trusting the LLM conclusions.

Hopefully the traces requested by Mika show what's going on.

If they don't, then I think the next step will be a USB4 analyzer.

> 
>> Hardware Details:
>> =================
>> Device: Dell U2725QE Thunderbolt Monitor
>> USB4 Router: 8087:b26 (Intel USB4 controller)
>> Host: AMD Thunderbolt 4 controller (0000:c7:00.6)

What sort of hardware is the AMD host?  PCI BDF is meaningless.

>>
>> Signed-off-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com>
>> ---
>> Full dmesg log available at: https://paste.ubuntu.com/p/CXs2T4XzZ3/
>> ---
>>   drivers/thunderbolt/switch.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/thunderbolt/switch.c b/drivers/thunderbolt/switch.c
>> index b3948aad0b955..e0c65e5fb0dca 100644
>> --- a/drivers/thunderbolt/switch.c
>> +++ b/drivers/thunderbolt/switch.c
>> @@ -530,6 +530,8 @@ int tb_wait_for_port(struct tb_port *port, bool wait_if_unplugged)
>>   			return 0;
>>   
>>   		case TB_PORT_UP:
>> +			msleep(2000);
>> +			fallthrough;
>>   		case TB_PORT_TX_CL0S:
>>   		case TB_PORT_RX_CL0S:
>>   		case TB_PORT_CL1:
>> -- 
>> 2.43.0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] [RFC] thunderbolt: Add delay for Dell U2725QE link width
  2025-12-09  7:06 ` Mika Westerberg
  2025-12-09 16:49   ` Mario Limonciello
@ 2025-12-10  3:15   ` Chia-Lin Kao (AceLan)
  2025-12-10  7:41     ` Mika Westerberg
  1 sibling, 1 reply; 21+ messages in thread
From: Chia-Lin Kao (AceLan) @ 2025-12-10  3:15 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Andreas Noever, Mika Westerberg, Yehezkel Bernat, linux-usb,
	linux-kernel, Mario Limonciello

Hi Mika,

On Tue, Dec 09, 2025 at 08:06:33AM +0100, Mika Westerberg wrote:
> +Mario since this is AMD related.
> 
> [Also keeping all the context].
> 
> On Tue, Dec 09, 2025 at 01:41:41PM +0800, Chia-Lin Kao (AceLan) wrote:
> > When plugging in a Dell U2725QE Thunderbolt monitor, the kernel produces
> > a call trace during initial enumeration. The device automatically
> > disconnects and reconnects ~3 seconds later, and works correctly on the
> > second attempt.
> > 
> > Issue Description:
> > ==================
> > The Dell U2725QE (USB4 device 8087:b26) requires additional time during
> > link width negotiation from single lane to dual lane. On first plug, the
> > following sequence occurs:
> > 
> > 1. Port state reaches TB_PORT_UP (link established, single lane)
> > 2. Path activation begins immediately
> > 3. tb_path_activate() - > tb_port_write() returns -ENOTCONN (error -107)
> > 4. Call trace is generated at tb_path_activate()
> > 5. Device disconnects/reconnects automatically after ~3 seconds
> > 6. Second attempt succeeds with full dual-lane bandwidth
> > 
> > First attempt dmesg (failure):
> > -------------------------------
> > [   36.030347] thunderbolt 0000:c7:00.6: 2:16: available bandwidth for new USB3 tunnel 9000/9000 Mb/s
> > [   36.030613] thunderbolt 0000:c7:00.6: 2: USB3 tunnel creation failed
> > [   36.031530] thunderbolt 0000:c7:00.6: PCIe Down path activation failed
> > [   36.031531] WARNING: drivers/thunderbolt/path.c:589 at 0x0, CPU#12: pool-/usr/libex/3145
> > 
> > Second attempt dmesg (success):
> > --------------------------------
> > [   40.440012] thunderbolt 0000:c7:00.6: 2:16: available bandwidth for new USB3 tunnel 36000/36000 Mb/s
> > [   40.440261] thunderbolt 0000:c7:00.6: 2:16: maximum required bandwidth for USB3 tunnel 9000 Mb/s
> > [   40.440269] thunderbolt 0000:c7:00.6: 0:4 <-> 2:16 (USB3): activating
> > [   40.440271] thunderbolt 0000:c7:00.6: 0:4 <-> 2:16 (USB3): allocating initial bandwidth 9000/9000 Mb/s
> > 
> > The bandwidth difference (9000 vs 36000 Mb/s) indicates the first attempt
> > occurs while the link is still in single-lane mode.
> > 
> > Root Cause Analysis:
> > ====================
> > The error originates from the Thunderbolt/USB4 device hardware itself:
> > 
> > 1. Port config space read/write returns TB_CFG_ERROR_PORT_NOT_CONNECTED
> > 2. This gets translated to -ENOTCONN in tb_cfg_get_error()
> > 3. The port's control channel is temporarily unavailable during state
> >    transition from single lane to dual lane (lane bonding)
> > 
> > The comment in drivers/thunderbolt/ctl.c explains this is expected:
> >   "Port is not connected. This can happen during surprise removal.
> >    Do not warn."
> > 
> > Attempted Solutions:
> > ====================
> > 1. Retry logic on -ENOTCONN in tb_path_activate():
> >    Result: Caused host port (0:0) lockup with hundreds of "downstream
> >    port is locked" errors. Rejected by user.
> > 
> > 2. Increased tb_port_wait_for_link_width() timeout from 100ms to 3000ms:
> >    Result: Did not resolve the issue. The timeout increase alone is
> >    insufficient because the port state hasn't reached TB_PORT_UP when
> >    lane bonding is attempted.
> > 
> > 3. Added msleep(2000) at various points in enumeration flow:
> >    Locations tested:
> >    - Before tb_switch_configure(): Works ✓
> >    - Before tb_switch_add(): Works ✓
> >    - Before usb4_port_hotplug_enable(): Works ✓
> >    - After tb_switch_add(): Doesn't work ✗
> >    - In tb_configure_link(): Doesn't work ✗
> >    - In tb_switch_lane_bonding_enable(): Doesn't work ✗
> >    - In tb_port_wait_for_link_width(): Doesn't work ✗
> > 
> >    The pattern shows the delay must occur BEFORE hotplug enable, which
> >    happens early in tb_switch_port_hotplug_enable() -> usb4_port_hotplug_enable().
> > 
> > Current Workaround:
> > ===================
> > Add a 2-second delay in tb_wait_for_port() when the port state reaches
> > TB_PORT_UP. This is the earliest point where we know:
> > - The link is physically established
> > - The device is responsive
> > - But lane width negotiation may still be in progress
> > 
> > This location is chosen because:
> > 1. It's called during port enumeration before any tunnel creation
> > 2. The port has just transitioned to TB_PORT_UP state
> > 3. Allows sufficient time for lane bonding to complete
> > 4. Avoids affecting other code paths
> > 
> > Testing Results:
> > ================
> > With this patch:
> > - No call trace on first plug
> > - Device enumerates correctly on first attempt
> > - Full bandwidth (36000 Mb/s) available immediately
> > - No disconnect/reconnect cycle
> > - USB and PCIe tunnels create successfully
> > 
> > Without this patch:
> > - Call trace on every first plug
> > - Only 9000 Mb/s bandwidth (single lane) on first attempt
> > - Automatic disconnect/reconnect after ~3 seconds
> > - Second attempt works with 36000 Mb/s
> > 
> > Discussion Points for RFC:
> > ===========================
> > 1. Is a fixed 2-second delay acceptable, or should we poll for a
> >    specific hardware state?
> > 
> > 2. Should we check PORT_CS_18_TIP (Transition In Progress) bit instead
> >    of using a fixed delay?
> > 
> > 3. Is there a better location for this delay in the enumeration flow?
> > 
> > 4. Should this be device-specific (based on vendor/device ID) or apply
> >    to all USB4 devices?
> > 
> > 5. The 100ms timeout in tb_switch_lane_bonding_enable() may be too
> >    short for other devices as well. Should we increase it universally?
> 
> We should understand the issue better. This is Intel Goshen Ridge based
> monitor which I'm pretty sure does not require additional quirks, at least
> I have not heard any issues like this. I suspect this is combination of the
> AMD and Intel hardware that is causing the issue.
Actually, we encountered the same issue on Intel machine, too.
Here is the log captured by my ex-colleague, and at that time he used
6.16-rc4 drmtip kernel and should have reported this issue somewhere.
https://paste.ubuntu.com/p/bJkBTdYMp6/

The log combines with drm debug log, and becomes too large to be
pasted on the pastebin, so I removed some unrelated lines between 44s
~ 335s.

> 
> Looking at your dmesg, even before your issue there is suspicious log
> entry:
> 
> [    5.852476] localhost kernel: [31] thunderbolt 0000:c7:00.5: acking hot unplug event on 0:6
> [    5.852492] localhost kernel: [12] thunderbolt 0000:c7:00.5: 0:6: DP IN resource unavailable: adapter unplug
> 
> This causes tearing down the DP tunnel. It is unexpected for the host
> router to send this unless you plugged monitor directly to some of the
> Type-C ports at this time?
No, I didn't plug any device during booting.

> 
> I wonder if you could take trace logs too from the issue? Instructions:
> 
> https://github.com/intel/tbtools?tab=readme-ov-file#tracing
> https://github.com/intel/tbtools/wiki/Useful-Commands#tracing
> 
> Please provide both full dmesg and the trace.out or the merged one. That
> would allow us to look what is going on (hopefully).
Got it, I'll do it tomorrow.

> 
> > Hardware Details:
> > =================
> > Device: Dell U2725QE Thunderbolt Monitor
> > USB4 Router: 8087:b26 (Intel USB4 controller)
> > Host: AMD Thunderbolt 4 controller (0000:c7:00.6)
> > 
> > Signed-off-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com>
> > ---
> > Full dmesg log available at: https://paste.ubuntu.com/p/CXs2T4XzZ3/
> > ---
> >  drivers/thunderbolt/switch.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/drivers/thunderbolt/switch.c b/drivers/thunderbolt/switch.c
> > index b3948aad0b955..e0c65e5fb0dca 100644
> > --- a/drivers/thunderbolt/switch.c
> > +++ b/drivers/thunderbolt/switch.c
> > @@ -530,6 +530,8 @@ int tb_wait_for_port(struct tb_port *port, bool wait_if_unplugged)
> >  			return 0;
> >  
> >  		case TB_PORT_UP:
> > +			msleep(2000);
> > +			fallthrough;
> >  		case TB_PORT_TX_CL0S:
> >  		case TB_PORT_RX_CL0S:
> >  		case TB_PORT_CL1:
> > -- 
> > 2.43.0

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] [RFC] thunderbolt: Add delay for Dell U2725QE link width
  2025-12-09 16:49   ` Mario Limonciello
@ 2025-12-10  5:33     ` Chia-Lin Kao (AceLan)
  0 siblings, 0 replies; 21+ messages in thread
From: Chia-Lin Kao (AceLan) @ 2025-12-10  5:33 UTC (permalink / raw)
  To: Mario Limonciello
  Cc: Mika Westerberg, Sanath.S, Andreas Noever, Mika Westerberg,
	Yehezkel Bernat, linux-usb, linux-kernel

On Tue, Dec 09, 2025 at 10:49:46AM -0600, Mario Limonciello wrote:
> +Sanath too
> 
> On 12/9/2025 1:06 AM, Mika Westerberg wrote:
> > +Mario since this is AMD related.
> > 
> > [Also keeping all the context].
> > 
> 
> Thanks for adding me.  A few other thoughts I have:
> 
> 1) Is it possible that the USB4 controller in the monitor is powering up or
> exiting a low power state during the first hotplug?
That's a good point. We currently only have one TB4 monitor.
I'll check with out QA to see if we have other TB4 devices to try.

The issue can't be reproduced with TB3 monitor or TB3 storage.
> 
> 2) Are you sure this only happens on AMD host?  What if you cold boot the
> monitor with Intel host?
The issue could be reproduced with Intel host, too.
I addressed this in reply to Mika's email.

> 
> 
> > On Tue, Dec 09, 2025 at 01:41:41PM +0800, Chia-Lin Kao (AceLan) wrote:
> > > When plugging in a Dell U2725QE Thunderbolt monitor, the kernel produces
> > > a call trace during initial enumeration. The device automatically
> > > disconnects and reconnects ~3 seconds later, and works correctly on the
> > > second attempt.
> > > 
> > > Issue Description:
> > > ==================
> > > The Dell U2725QE (USB4 device 8087:b26) requires additional time during
> > > link width negotiation from single lane to dual lane. On first plug, the
> > > following sequence occurs:
> > > 
> > > 1. Port state reaches TB_PORT_UP (link established, single lane)
> > > 2. Path activation begins immediately
> > > 3. tb_path_activate() - > tb_port_write() returns -ENOTCONN (error -107)
> > > 4. Call trace is generated at tb_path_activate()
> > > 5. Device disconnects/reconnects automatically after ~3 seconds
> > > 6. Second attempt succeeds with full dual-lane bandwidth
> > > 
> > > First attempt dmesg (failure):
> > > -------------------------------
> > > [   36.030347] thunderbolt 0000:c7:00.6: 2:16: available bandwidth for new USB3 tunnel 9000/9000 Mb/s
> > > [   36.030613] thunderbolt 0000:c7:00.6: 2: USB3 tunnel creation failed
> > > [   36.031530] thunderbolt 0000:c7:00.6: PCIe Down path activation failed
> > > [   36.031531] WARNING: drivers/thunderbolt/path.c:589 at 0x0, CPU#12: pool-/usr/libex/3145
> > > 
> > > Second attempt dmesg (success):
> > > --------------------------------
> > > [   40.440012] thunderbolt 0000:c7:00.6: 2:16: available bandwidth for new USB3 tunnel 36000/36000 Mb/s
> > > [   40.440261] thunderbolt 0000:c7:00.6: 2:16: maximum required bandwidth for USB3 tunnel 9000 Mb/s
> > > [   40.440269] thunderbolt 0000:c7:00.6: 0:4 <-> 2:16 (USB3): activating
> > > [   40.440271] thunderbolt 0000:c7:00.6: 0:4 <-> 2:16 (USB3): allocating initial bandwidth 9000/9000 Mb/s
> > > 
> > > The bandwidth difference (9000 vs 36000 Mb/s) indicates the first attempt
> > > occurs while the link is still in single-lane mode.
> > > 
> > > Root Cause Analysis:
> > > ====================
> > > The error originates from the Thunderbolt/USB4 device hardware itself:
> > > 
> > > 1. Port config space read/write returns TB_CFG_ERROR_PORT_NOT_CONNECTED
> > > 2. This gets translated to -ENOTCONN in tb_cfg_get_error()
> > > 3. The port's control channel is temporarily unavailable during state
> > >     transition from single lane to dual lane (lane bonding)
> > > 
> > > The comment in drivers/thunderbolt/ctl.c explains this is expected:
> > >    "Port is not connected. This can happen during surprise removal.
> > >     Do not warn."
> > > 
> > > Attempted Solutions:
> > > ====================
> > > 1. Retry logic on -ENOTCONN in tb_path_activate():
> > >     Result: Caused host port (0:0) lockup with hundreds of "downstream
> > >     port is locked" errors. Rejected by user.
> > > 
> > > 2. Increased tb_port_wait_for_link_width() timeout from 100ms to 3000ms:
> > >     Result: Did not resolve the issue. The timeout increase alone is
> > >     insufficient because the port state hasn't reached TB_PORT_UP when
> > >     lane bonding is attempted.
> > > 
> > > 3. Added msleep(2000) at various points in enumeration flow:
> > >     Locations tested:
> > >     - Before tb_switch_configure(): Works ✓
> > >     - Before tb_switch_add(): Works ✓
> > >     - Before usb4_port_hotplug_enable(): Works ✓
> > >     - After tb_switch_add(): Doesn't work ✗
> > >     - In tb_configure_link(): Doesn't work ✗
> > >     - In tb_switch_lane_bonding_enable(): Doesn't work ✗
> > >     - In tb_port_wait_for_link_width(): Doesn't work ✗
> > > 
> > >     The pattern shows the delay must occur BEFORE hotplug enable, which
> > >     happens early in tb_switch_port_hotplug_enable() -> usb4_port_hotplug_enable().
> > > 
> > > Current Workaround:
> > > ===================
> > > Add a 2-second delay in tb_wait_for_port() when the port state reaches
> > > TB_PORT_UP. This is the earliest point where we know:
> > > - The link is physically established
> > > - The device is responsive
> > > - But lane width negotiation may still be in progress
> > > 
> > > This location is chosen because:
> > > 1. It's called during port enumeration before any tunnel creation
> > > 2. The port has just transitioned to TB_PORT_UP state
> > > 3. Allows sufficient time for lane bonding to complete
> > > 4. Avoids affecting other code paths
> > > 
> > > Testing Results:
> > > ================
> > > With this patch:
> > > - No call trace on first plug
> > > - Device enumerates correctly on first attempt
> > > - Full bandwidth (36000 Mb/s) available immediately
> > > - No disconnect/reconnect cycle
> > > - USB and PCIe tunnels create successfully
> > > 
> > > Without this patch:
> > > - Call trace on every first plug
> > > - Only 9000 Mb/s bandwidth (single lane) on first attempt
> > > - Automatic disconnect/reconnect after ~3 seconds
> > > - Second attempt works with 36000 Mb/s
> > > 
> > > Discussion Points for RFC:
> > > ===========================
> > > 1. Is a fixed 2-second delay acceptable, or should we poll for a
> > >     specific hardware state?
> > > 
> > > 2. Should we check PORT_CS_18_TIP (Transition In Progress) bit instead
> > >     of using a fixed delay?
> > > 
> > > 3. Is there a better location for this delay in the enumeration flow?
> > > 
> > > 4. Should this be device-specific (based on vendor/device ID) or apply
> > >     to all USB4 devices?
> > > 
> > > 5. The 100ms timeout in tb_switch_lane_bonding_enable() may be too
> > >     short for other devices as well. Should we increase it universally?
> > 
> > We should understand the issue better. This is Intel Goshen Ridge based
> > monitor which I'm pretty sure does not require additional quirks, at least
> > I have not heard any issues like this. I suspect this is combination of the
> > AMD and Intel hardware that is causing the issue.
> > 
> > Looking at your dmesg, even before your issue there is suspicious log
> > entry:
> > 
> > [    5.852476] localhost kernel: [31] thunderbolt 0000:c7:00.5: acking hot unplug event on 0:6
> > [    5.852492] localhost kernel: [12] thunderbolt 0000:c7:00.5: 0:6: DP IN resource unavailable: adapter unplug
> > 
> > This causes tearing down the DP tunnel. It is unexpected for the host
> > router to send this unless you plugged monitor directly to some of the
> > Type-C ports at this time?
> > 
> > I wonder if you could take trace logs too from the issue? Instructions:
> > 
> > https://github.com/intel/tbtools?tab=readme-ov-file#tracing
> > https://github.com/intel/tbtools/wiki/Useful-Commands#tracing
> > 
> > Please provide both full dmesg and the trace.out or the merged one. That
> > would allow us to look what is going on (hopefully).
> 
> We need to be careful trusting the LLM conclusions.
> 
> Hopefully the traces requested by Mika show what's going on.
> 
> If they don't, then I think the next step will be a USB4 analyzer.
> 
> > 
> > > Hardware Details:
> > > =================
> > > Device: Dell U2725QE Thunderbolt Monitor
> > > USB4 Router: 8087:b26 (Intel USB4 controller)
> > > Host: AMD Thunderbolt 4 controller (0000:c7:00.6)
> 
> What sort of hardware is the AMD host?  PCI BDF is meaningless.
It's a Dell's ongoing project.

ubuntu@localhost:~$ lscpu
Architecture:                x86_64
  CPU op-mode(s):            32-bit, 64-bit
  Address sizes:             48 bits physical, 48 bits virtual
  Byte Order:                Little Endian
CPU(s):                      24
  On-line CPU(s) list:       0-23
Vendor ID:                   AuthenticAMD
  Model name:                AMD Ryzen AI 9 HX PRO 370 w/ Radeon 890M
    CPU family:              26
    Model:                   36
    Thread(s) per core:      2
    Core(s) per socket:      12
    Socket(s):               1
    Stepping:                0
    Frequency boost:         enabled
    CPU(s) scaling MHz:      24%
    CPU max MHz:             5157.8950
    CPU min MHz:             605.2640
    BogoMIPS:                3992.19
    Flags:                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxe
                             xt fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmpe
                             rf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy
                              svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb 
                             bpext perfctr_llc mwaitx cpuid_fault cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced v
                             mmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq adx smap avx512ifma clflushop
                             t clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_
                             local user_shstk avx_vnni avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save 
                             tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl 
                             vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid bu
                             s_lock_detect movdiri movdir64b overflow_recov succor smca fsrm avx512_vp2intersect flush_l1d amd_lbr_pmc_freeze
Virtualization features:     
  Virtualization:            AMD-V
Caches (sum of all):         
  L1d:                       576 KiB (12 instances)
  L1i:                       384 KiB (12 instances)
  L2:                        12 MiB (12 instances)
  L3:                        24 MiB (2 instances)
NUMA:                        
  NUMA node(s):              1
  NUMA node0 CPU(s):         0-23
Vulnerabilities:             
  Gather data sampling:      Not affected
  Ghostwrite:                Not affected
  Indirect target selection: Not affected
  Itlb multihit:             Not affected
  L1tf:                      Not affected
  Mds:                       Not affected
  Meltdown:                  Not affected
  Mmio stale data:           Not affected
  Old microcode:             Not affected
  Reg file data sampling:    Not affected
  Retbleed:                  Not affected
  Spec rstack overflow:      Mitigation; IBPB on VMEXIT only
  Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected
  Srbds:                     Not affected
  Tsa:                       Not affected
  Tsx async abort:           Not affected
  Vmscape:                   Mitigation; IBPB on VMEXIT

ubuntu@localhost:~$ sudo lspci -vvnns c7:00.6
c7:00.6 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Device [1022:151d] (prog-if 40 [USB4 Host Interface])
        Subsystem: Advanced Micro Devices, Inc. [AMD] Device [1022:151d]
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 47
        IOMMU group: 32
        Region 0: Memory at b0c80000 (64-bit, non-prefetchable) [size=512K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [64] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0W
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 16GT/s, Width x16
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
                         10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete- EqualizationPhase1-
                         EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [a0] MSI: Enable- Count=1/16 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [c0] MSI-X: Enable+ Count=16 Masked-
                Vector table: BAR=0 offset=0007e000
                PBA: BAR=0 offset=0007f000
        Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [2a0 v1] Access Control Services
                ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        Kernel driver in use: thunderbolt
        Kernel modules: thunderbolt

> 
> > > 
> > > Signed-off-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com>
> > > ---
> > > Full dmesg log available at: https://paste.ubuntu.com/p/CXs2T4XzZ3/
> > > ---
> > >   drivers/thunderbolt/switch.c | 2 ++
> > >   1 file changed, 2 insertions(+)
> > > 
> > > diff --git a/drivers/thunderbolt/switch.c b/drivers/thunderbolt/switch.c
> > > index b3948aad0b955..e0c65e5fb0dca 100644
> > > --- a/drivers/thunderbolt/switch.c
> > > +++ b/drivers/thunderbolt/switch.c
> > > @@ -530,6 +530,8 @@ int tb_wait_for_port(struct tb_port *port, bool wait_if_unplugged)
> > >   			return 0;
> > >   		case TB_PORT_UP:
> > > +			msleep(2000);
> > > +			fallthrough;
> > >   		case TB_PORT_TX_CL0S:
> > >   		case TB_PORT_RX_CL0S:
> > >   		case TB_PORT_CL1:
> > > -- 
> > > 2.43.0
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] [RFC] thunderbolt: Add delay for Dell U2725QE link width
  2025-12-10  3:15   ` Chia-Lin Kao (AceLan)
@ 2025-12-10  7:41     ` Mika Westerberg
  2025-12-10 21:42       ` Mario Limonciello
  0 siblings, 1 reply; 21+ messages in thread
From: Mika Westerberg @ 2025-12-10  7:41 UTC (permalink / raw)
  To: Chia-Lin Kao (AceLan), Andreas Noever, Mika Westerberg,
	Yehezkel Bernat, linux-usb, linux-kernel, Mario Limonciello,
	Sanath.S

Hi,

On Wed, Dec 10, 2025 at 11:15:25AM +0800, Chia-Lin Kao (AceLan) wrote:
> > We should understand the issue better. This is Intel Goshen Ridge based
> > monitor which I'm pretty sure does not require additional quirks, at least
> > I have not heard any issues like this. I suspect this is combination of the
> > AMD and Intel hardware that is causing the issue.
> Actually, we encountered the same issue on Intel machine, too.
> Here is the log captured by my ex-colleague, and at that time he used
> 6.16-rc4 drmtip kernel and should have reported this issue somewhere.
> https://paste.ubuntu.com/p/bJkBTdYMp6/
> 
> The log combines with drm debug log, and becomes too large to be
> pasted on the pastebin, so I removed some unrelated lines between 44s
> ~ 335s.

Okay I see similar unplug there:

[  337.429646] [374] thunderbolt:tb_handle_dp_bandwidth_request:2752: thunderbolt 0000:00:0d.2: 0:5: handling bandwidth allocation request, retry 0
...
[  337.430291] [165] thunderbolt:tb_cfg_ack_plug:842: thunderbolt 0000:00:0d.2: acking hot unplug event on 0:1

We had an issue with MST monitors but that resulted unplug of the DP OUT
not link going down. That was fixed with:

  9cb15478916e ("drm/i915/dp_mst: Work around Thunderbolt sink disconnect after SINK_COUNT_ESI read")

If you have Intel hardware still there it would be good if you could try
and provide trace from that as well.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] [RFC] thunderbolt: Add delay for Dell U2725QE link width
  2025-12-10  7:41     ` Mika Westerberg
@ 2025-12-10 21:42       ` Mario Limonciello
       [not found]         ` <coxrm5gishdztghznuvzafg2pbdk4qk3ttbkbq7t5whsfv2lk5@3gqepcs6h4uc>
  0 siblings, 1 reply; 21+ messages in thread
From: Mario Limonciello @ 2025-12-10 21:42 UTC (permalink / raw)
  To: Mika Westerberg, Chia-Lin Kao (AceLan), Andreas Noever,
	Mika Westerberg, Yehezkel Bernat, linux-usb, linux-kernel,
	Sanath.S, Lin, Wayne

+Wayne

Here is the full thread since you're being added in late.

https://lore.kernel.org/linux-usb/20251209054141.1975982-1-acelan.kao@canonical.com/

On 12/10/25 1:41 AM, Mika Westerberg wrote:
> Hi,
> 
> On Wed, Dec 10, 2025 at 11:15:25AM +0800, Chia-Lin Kao (AceLan) wrote:
>>> We should understand the issue better. This is Intel Goshen Ridge based
>>> monitor which I'm pretty sure does not require additional quirks, at least
>>> I have not heard any issues like this. I suspect this is combination of the
>>> AMD and Intel hardware that is causing the issue.
>> Actually, we encountered the same issue on Intel machine, too.
>> Here is the log captured by my ex-colleague, and at that time he used
>> 6.16-rc4 drmtip kernel and should have reported this issue somewhere.
>> https://paste.ubuntu.com/p/bJkBTdYMp6/
>>
>> The log combines with drm debug log, and becomes too large to be
>> pasted on the pastebin, so I removed some unrelated lines between 44s
>> ~ 335s.
> 
> Okay I see similar unplug there:
> 
> [  337.429646] [374] thunderbolt:tb_handle_dp_bandwidth_request:2752: thunderbolt 0000:00:0d.2: 0:5: handling bandwidth allocation request, retry 0
> ...
> [  337.430291] [165] thunderbolt:tb_cfg_ack_plug:842: thunderbolt 0000:00:0d.2: acking hot unplug event on 0:1
> 
> We had an issue with MST monitors but that resulted unplug of the DP OUT
> not link going down. That was fixed with:
> 
>    9cb15478916e ("drm/i915/dp_mst: Work around Thunderbolt sink disconnect after SINK_COUNT_ESI read")
> 
> If you have Intel hardware still there it would be good if you could try
> and provide trace from that as well.

If that does help; we could experiment with doing something similar in 
amdgpu too.

It would mean it's not really an iTBT DP-in adapter's firmware issue in 
that case.

Acelan,

If you want to try to port 9cb15478916e over the interrupt handler in 
amdgpu that needs the change I expect to be dp_read_hpd_rx_irq_data().

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] [RFC] thunderbolt: Add delay for Dell U2725QE link width
       [not found]         ` <coxrm5gishdztghznuvzafg2pbdk4qk3ttbkbq7t5whsfv2lk5@3gqepcs6h4uc>
@ 2025-12-12 12:39           ` Mika Westerberg
  2025-12-12 14:40             ` Mario Limonciello
  0 siblings, 1 reply; 21+ messages in thread
From: Mika Westerberg @ 2025-12-12 12:39 UTC (permalink / raw)
  To: Chia-Lin Kao (AceLan), Andreas Noever, Mika Westerberg,
	Yehezkel Bernat, linux-usb, linux-kernel, Sanath.S, Lin, Wayne,
	Mario Limonciello

Hi,

On Fri, Dec 12, 2025 at 12:10:24PM +0800, Chia-Lin Kao (AceLan) wrote:
> Hi Mika,
> 
> On Wed, Dec 10, 2025 at 03:42:21PM -0600, Mario Limonciello wrote:
> > +Wayne
> > 
> > Here is the full thread since you're being added in late.
> > 
> > https://lore.kernel.org/linux-usb/20251209054141.1975982-1-acelan.kao@canonical.com/
> > 
> > On 12/10/25 1:41 AM, Mika Westerberg wrote:
> > > Hi,
> > > 
> > > On Wed, Dec 10, 2025 at 11:15:25AM +0800, Chia-Lin Kao (AceLan) wrote:
> > > > > We should understand the issue better. This is Intel Goshen Ridge based
> > > > > monitor which I'm pretty sure does not require additional quirks, at least
> > > > > I have not heard any issues like this. I suspect this is combination of the
> > > > > AMD and Intel hardware that is causing the issue.
> > > > Actually, we encountered the same issue on Intel machine, too.
> > > > Here is the log captured by my ex-colleague, and at that time he used
> > > > 6.16-rc4 drmtip kernel and should have reported this issue somewhere.
> > > > https://paste.ubuntu.com/p/bJkBTdYMp6/
> > > > 
> > > > The log combines with drm debug log, and becomes too large to be
> > > > pasted on the pastebin, so I removed some unrelated lines between 44s
> > > > ~ 335s.
> > > 
> > > Okay I see similar unplug there:
> > > 
> > > [  337.429646] [374] thunderbolt:tb_handle_dp_bandwidth_request:2752: thunderbolt 0000:00:0d.2: 0:5: handling bandwidth allocation request, retry 0
> > > ...
> > > [  337.430291] [165] thunderbolt:tb_cfg_ack_plug:842: thunderbolt 0000:00:0d.2: acking hot unplug event on 0:1
> > > 
> > > We had an issue with MST monitors but that resulted unplug of the DP OUT
> > > not link going down. That was fixed with:
> > > 
> > >    9cb15478916e ("drm/i915/dp_mst: Work around Thunderbolt sink disconnect after SINK_COUNT_ESI read")
> > > 
> > > If you have Intel hardware still there it would be good if you could try
> > > and provide trace from that as well.
> I tried the latest mainline kernel, d358e5254674, which should include the commit you
> mentioned, but no luck.
> 
> I put all the logs here for better reference
> https://people.canonical.com/~acelan/bugs/tbt_call_trace/
> 
> Here is how I get the log
> ```
> $ cat debug
> #!/bin/sh
> 
> . ~/.cargo/env
> sudo ~/.cargo/bin/tbtrace enable
> sleep 10 # plug in the monitor
> sudo ~/.cargo/bin/tbtrace disable
> sudo ~/.cargo/bin/tbtrace dump -vv > trace.out
> sudo dmesg > dmesg.out
> ./tbtools/scripts/merge-logs.py dmesg.out trace.out > merged.out
> ```
> 
> And here is the log
> https://people.canonical.com/~acelan/bugs/tbt_call_trace/intel/merged_6.18.0-d358e5254674+.out

Thanks!

It shows that right before the unplug the driver is still enumerating
retimers:

[   39.812733] tb_tx Read Request Domain 0 Route 3 Adapter 1 / Lane
               0x00/---- 0x00000000 0b00000000 00000000 00000000 00000000 .... Route String High
               0x01/---- 0x00000003 0b00000000 00000000 00000000 00000011 .... Route String Low
               0x02/---- 0x02082091 0b00000010 00001000 00100000 10010001 ....
                 [00:12]       0x91 Address
                 [13:18]        0x1 Read Size
                 [19:24]        0x1 Adapter Num
                 [25:26]        0x1 Configuration Space (CS) → Adapter Configuration Space
                 [27:28]        0x0 Sequence Number (SN)
[   39.813005] tb_rx Read Response Domain 0 Route 3 Adapter 1 / Lane
               0x00/---- 0x80000000 0b10000000 00000000 00000000 00000000 .... Route String High
               0x01/---- 0x00000003 0b00000000 00000000 00000000 00000011 .... Route String Low
               0x02/---- 0x02082091 0b00000010 00001000 00100000 10010001 ....
                 [00:12]       0x91 Address
                 [13:18]        0x1 Read Size
                 [19:24]        0x1 Adapter Num
                 [25:26]        0x1 Configuration Space (CS) → Adapter Configuration Space
                 [27:28]        0x0 Sequence Number (SN)
               0x03/0091 0x81620408 0b10000001 01100010 00000100 00001000 .b.. PORT_CS_1
                 [00:07]        0x8 Address
                 [08:15]        0x4 Length
                 [16:18]        0x2 Target
                 [20:23]        0x6 Re-timer Index
                 [24:24]        0x1 WnR
                 [25:25]        0x0 No Response (NR)
                 [26:26]        0x0 Result Code (RC)
                 [31:31]        0x1 Pending (PND)
[   39.814180] tb_tx Read Request Domain 0 Route 3 Adapter 1 / Lane
               0x00/---- 0x00000000 0b00000000 00000000 00000000 00000000 .... Route String High
               0x01/---- 0x00000003 0b00000000 00000000 00000000 00000011 .... Route String Low
               0x02/---- 0x02082091 0b00000010 00001000 00100000 10010001 ....
                 [00:12]       0x91 Address
                 [13:18]        0x1 Read Size
                 [19:24]        0x1 Adapter Num
                 [25:26]        0x1 Configuration Space (CS) → Adapter Configuration Space
                 [27:28]        0x0 Sequence Number (SN)
[   39.815193] tb_event Hot Plug Event Packet Domain 0 Route 0 Adapter 3 / Lane
               0x00/---- 0x80000000 0b10000000 00000000 00000000 00000000 .... Route String High
               0x01/---- 0x00000000 0b00000000 00000000 00000000 00000000 .... Route String Low
               0x02/---- 0x80000003 0b10000000 00000000 00000000 00000011 ....
                 [00:05]        0x3 Adapter Num
                 [31:31]        0x1 UPG
[   39.815196] [2821] thunderbolt 0000:00:0d.2: acking hot unplug event on 0:3

By default it does not access retimers beyond the Type-C connector. I
wonder if you have CONFIG_USB4_DEBUGFS_MARGINING set in your kernel
.config? And if yes can you disable that and try again.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] [RFC] thunderbolt: Add delay for Dell U2725QE link width
  2025-12-12 12:39           ` Mika Westerberg
@ 2025-12-12 14:40             ` Mario Limonciello
  2025-12-17  3:06               ` AceLan Kao
  0 siblings, 1 reply; 21+ messages in thread
From: Mario Limonciello @ 2025-12-12 14:40 UTC (permalink / raw)
  To: Mika Westerberg, Chia-Lin Kao (AceLan), Andreas Noever,
	Mika Westerberg, Yehezkel Bernat, linux-usb, linux-kernel,
	Sanath.S, Lin, Wayne

On 12/12/25 6:39 AM, Mika Westerberg wrote:
> Hi,
> 
> On Fri, Dec 12, 2025 at 12:10:24PM +0800, Chia-Lin Kao (AceLan) wrote:
>> Hi Mika,
>>
>> On Wed, Dec 10, 2025 at 03:42:21PM -0600, Mario Limonciello wrote:
>>> +Wayne
>>>
>>> Here is the full thread since you're being added in late.
>>>
>>> https://lore.kernel.org/linux-usb/20251209054141.1975982-1-acelan.kao@canonical.com/
>>>
>>> On 12/10/25 1:41 AM, Mika Westerberg wrote:
>>>> Hi,
>>>>
>>>> On Wed, Dec 10, 2025 at 11:15:25AM +0800, Chia-Lin Kao (AceLan) wrote:
>>>>>> We should understand the issue better. This is Intel Goshen Ridge based
>>>>>> monitor which I'm pretty sure does not require additional quirks, at least
>>>>>> I have not heard any issues like this. I suspect this is combination of the
>>>>>> AMD and Intel hardware that is causing the issue.
>>>>> Actually, we encountered the same issue on Intel machine, too.
>>>>> Here is the log captured by my ex-colleague, and at that time he used
>>>>> 6.16-rc4 drmtip kernel and should have reported this issue somewhere.
>>>>> https://paste.ubuntu.com/p/bJkBTdYMp6/
>>>>>
>>>>> The log combines with drm debug log, and becomes too large to be
>>>>> pasted on the pastebin, so I removed some unrelated lines between 44s
>>>>> ~ 335s.
>>>>
>>>> Okay I see similar unplug there:
>>>>
>>>> [  337.429646] [374] thunderbolt:tb_handle_dp_bandwidth_request:2752: thunderbolt 0000:00:0d.2: 0:5: handling bandwidth allocation request, retry 0
>>>> ...
>>>> [  337.430291] [165] thunderbolt:tb_cfg_ack_plug:842: thunderbolt 0000:00:0d.2: acking hot unplug event on 0:1
>>>>
>>>> We had an issue with MST monitors but that resulted unplug of the DP OUT
>>>> not link going down. That was fixed with:
>>>>
>>>>     9cb15478916e ("drm/i915/dp_mst: Work around Thunderbolt sink disconnect after SINK_COUNT_ESI read")
>>>>
>>>> If you have Intel hardware still there it would be good if you could try
>>>> and provide trace from that as well.
>> I tried the latest mainline kernel, d358e5254674, which should include the commit you
>> mentioned, but no luck.
>>
>> I put all the logs here for better reference
>> https://people.canonical.com/~acelan/bugs/tbt_call_trace/
>>
>> Here is how I get the log
>> ```
>> $ cat debug
>> #!/bin/sh
>>
>> . ~/.cargo/env
>> sudo ~/.cargo/bin/tbtrace enable
>> sleep 10 # plug in the monitor
>> sudo ~/.cargo/bin/tbtrace disable
>> sudo ~/.cargo/bin/tbtrace dump -vv > trace.out
>> sudo dmesg > dmesg.out
>> ./tbtools/scripts/merge-logs.py dmesg.out trace.out > merged.out
>> ```
>>
>> And here is the log
>> https://people.canonical.com/~acelan/bugs/tbt_call_trace/intel/merged_6.18.0-d358e5254674+.out
> 
> Thanks!
> 
> It shows that right before the unplug the driver is still enumerating
> retimers:
> 
> [   39.812733] tb_tx Read Request Domain 0 Route 3 Adapter 1 / Lane
>                 0x00/---- 0x00000000 0b00000000 00000000 00000000 00000000 .... Route String High
>                 0x01/---- 0x00000003 0b00000000 00000000 00000000 00000011 .... Route String Low
>                 0x02/---- 0x02082091 0b00000010 00001000 00100000 10010001 ....
>                   [00:12]       0x91 Address
>                   [13:18]        0x1 Read Size
>                   [19:24]        0x1 Adapter Num
>                   [25:26]        0x1 Configuration Space (CS) → Adapter Configuration Space
>                   [27:28]        0x0 Sequence Number (SN)
> [   39.813005] tb_rx Read Response Domain 0 Route 3 Adapter 1 / Lane
>                 0x00/---- 0x80000000 0b10000000 00000000 00000000 00000000 .... Route String High
>                 0x01/---- 0x00000003 0b00000000 00000000 00000000 00000011 .... Route String Low
>                 0x02/---- 0x02082091 0b00000010 00001000 00100000 10010001 ....
>                   [00:12]       0x91 Address
>                   [13:18]        0x1 Read Size
>                   [19:24]        0x1 Adapter Num
>                   [25:26]        0x1 Configuration Space (CS) → Adapter Configuration Space
>                   [27:28]        0x0 Sequence Number (SN)
>                 0x03/0091 0x81620408 0b10000001 01100010 00000100 00001000 .b.. PORT_CS_1
>                   [00:07]        0x8 Address
>                   [08:15]        0x4 Length
>                   [16:18]        0x2 Target
>                   [20:23]        0x6 Re-timer Index
>                   [24:24]        0x1 WnR
>                   [25:25]        0x0 No Response (NR)
>                   [26:26]        0x0 Result Code (RC)
>                   [31:31]        0x1 Pending (PND)
> [   39.814180] tb_tx Read Request Domain 0 Route 3 Adapter 1 / Lane
>                 0x00/---- 0x00000000 0b00000000 00000000 00000000 00000000 .... Route String High
>                 0x01/---- 0x00000003 0b00000000 00000000 00000000 00000011 .... Route String Low
>                 0x02/---- 0x02082091 0b00000010 00001000 00100000 10010001 ....
>                   [00:12]       0x91 Address
>                   [13:18]        0x1 Read Size
>                   [19:24]        0x1 Adapter Num
>                   [25:26]        0x1 Configuration Space (CS) → Adapter Configuration Space
>                   [27:28]        0x0 Sequence Number (SN)
> [   39.815193] tb_event Hot Plug Event Packet Domain 0 Route 0 Adapter 3 / Lane
>                 0x00/---- 0x80000000 0b10000000 00000000 00000000 00000000 .... Route String High
>                 0x01/---- 0x00000000 0b00000000 00000000 00000000 00000000 .... Route String Low
>                 0x02/---- 0x80000003 0b10000000 00000000 00000000 00000011 ....
>                   [00:05]        0x3 Adapter Num
>                   [31:31]        0x1 UPG
> [   39.815196] [2821] thunderbolt 0000:00:0d.2: acking hot unplug event on 0:3
> 
> By default it does not access retimers beyond the Type-C connector. I
> wonder if you have CONFIG_USB4_DEBUGFS_MARGINING set in your kernel
> .config? And if yes can you disable that and try again.

If this does end up being the reason - maybe we should make this really 
noisy in logs and/or taint the kernel.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] [RFC] thunderbolt: Add delay for Dell U2725QE link width
  2025-12-12 14:40             ` Mario Limonciello
@ 2025-12-17  3:06               ` AceLan Kao
  2025-12-17 12:55                 ` Mika Westerberg
  0 siblings, 1 reply; 21+ messages in thread
From: AceLan Kao @ 2025-12-17  3:06 UTC (permalink / raw)
  To: Mario Limonciello
  Cc: Mika Westerberg, Andreas Noever, Mika Westerberg, Yehezkel Bernat,
	linux-usb, linux-kernel, Sanath.S, Lin, Wayne

Mario Limonciello <mario.limonciello@amd.com> 於 2025年12月12日週五 下午10:40寫道：
>
> On 12/12/25 6:39 AM, Mika Westerberg wrote:
> > Hi,
> >
> > On Fri, Dec 12, 2025 at 12:10:24PM +0800, Chia-Lin Kao (AceLan) wrote:
> >> Hi Mika,
> >>
> >> On Wed, Dec 10, 2025 at 03:42:21PM -0600, Mario Limonciello wrote:
> >>> +Wayne
> >>>
> >>> Here is the full thread since you're being added in late.
> >>>
> >>> https://lore.kernel.org/linux-usb/20251209054141.1975982-1-acelan.kao@canonical.com/
> >>>
> >>> On 12/10/25 1:41 AM, Mika Westerberg wrote:
> >>>> Hi,
> >>>>
> >>>> On Wed, Dec 10, 2025 at 11:15:25AM +0800, Chia-Lin Kao (AceLan) wrote:
> >>>>>> We should understand the issue better. This is Intel Goshen Ridge based
> >>>>>> monitor which I'm pretty sure does not require additional quirks, at least
> >>>>>> I have not heard any issues like this. I suspect this is combination of the
> >>>>>> AMD and Intel hardware that is causing the issue.
> >>>>> Actually, we encountered the same issue on Intel machine, too.
> >>>>> Here is the log captured by my ex-colleague, and at that time he used
> >>>>> 6.16-rc4 drmtip kernel and should have reported this issue somewhere.
> >>>>> https://paste.ubuntu.com/p/bJkBTdYMp6/
> >>>>>
> >>>>> The log combines with drm debug log, and becomes too large to be
> >>>>> pasted on the pastebin, so I removed some unrelated lines between 44s
> >>>>> ~ 335s.
> >>>>
> >>>> Okay I see similar unplug there:
> >>>>
> >>>> [  337.429646] [374] thunderbolt:tb_handle_dp_bandwidth_request:2752: thunderbolt 0000:00:0d.2: 0:5: handling bandwidth allocation request, retry 0
> >>>> ...
> >>>> [  337.430291] [165] thunderbolt:tb_cfg_ack_plug:842: thunderbolt 0000:00:0d.2: acking hot unplug event on 0:1
> >>>>
> >>>> We had an issue with MST monitors but that resulted unplug of the DP OUT
> >>>> not link going down. That was fixed with:
> >>>>
> >>>>     9cb15478916e ("drm/i915/dp_mst: Work around Thunderbolt sink disconnect after SINK_COUNT_ESI read")
> >>>>
> >>>> If you have Intel hardware still there it would be good if you could try
> >>>> and provide trace from that as well.
> >> I tried the latest mainline kernel, d358e5254674, which should include the commit you
> >> mentioned, but no luck.
> >>
> >> I put all the logs here for better reference
> >> https://people.canonical.com/~acelan/bugs/tbt_call_trace/
> >>
> >> Here is how I get the log
> >> ```
> >> $ cat debug
> >> #!/bin/sh
> >>
> >> . ~/.cargo/env
> >> sudo ~/.cargo/bin/tbtrace enable
> >> sleep 10 # plug in the monitor
> >> sudo ~/.cargo/bin/tbtrace disable
> >> sudo ~/.cargo/bin/tbtrace dump -vv > trace.out
> >> sudo dmesg > dmesg.out
> >> ./tbtools/scripts/merge-logs.py dmesg.out trace.out > merged.out
> >> ```
> >>
> >> And here is the log
> >> https://people.canonical.com/~acelan/bugs/tbt_call_trace/intel/merged_6.18.0-d358e5254674+.out
> >
> > Thanks!
> >
> > It shows that right before the unplug the driver is still enumerating
> > retimers:
> >
> > [   39.812733] tb_tx Read Request Domain 0 Route 3 Adapter 1 / Lane
> >                 0x00/---- 0x00000000 0b00000000 00000000 00000000 00000000 .... Route String High
> >                 0x01/---- 0x00000003 0b00000000 00000000 00000000 00000011 .... Route String Low
> >                 0x02/---- 0x02082091 0b00000010 00001000 00100000 10010001 ....
> >                   [00:12]       0x91 Address
> >                   [13:18]        0x1 Read Size
> >                   [19:24]        0x1 Adapter Num
> >                   [25:26]        0x1 Configuration Space (CS) → Adapter Configuration Space
> >                   [27:28]        0x0 Sequence Number (SN)
> > [   39.813005] tb_rx Read Response Domain 0 Route 3 Adapter 1 / Lane
> >                 0x00/---- 0x80000000 0b10000000 00000000 00000000 00000000 .... Route String High
> >                 0x01/---- 0x00000003 0b00000000 00000000 00000000 00000011 .... Route String Low
> >                 0x02/---- 0x02082091 0b00000010 00001000 00100000 10010001 ....
> >                   [00:12]       0x91 Address
> >                   [13:18]        0x1 Read Size
> >                   [19:24]        0x1 Adapter Num
> >                   [25:26]        0x1 Configuration Space (CS) → Adapter Configuration Space
> >                   [27:28]        0x0 Sequence Number (SN)
> >                 0x03/0091 0x81620408 0b10000001 01100010 00000100 00001000 .b.. PORT_CS_1
> >                   [00:07]        0x8 Address
> >                   [08:15]        0x4 Length
> >                   [16:18]        0x2 Target
> >                   [20:23]        0x6 Re-timer Index
> >                   [24:24]        0x1 WnR
> >                   [25:25]        0x0 No Response (NR)
> >                   [26:26]        0x0 Result Code (RC)
> >                   [31:31]        0x1 Pending (PND)
> > [   39.814180] tb_tx Read Request Domain 0 Route 3 Adapter 1 / Lane
> >                 0x00/---- 0x00000000 0b00000000 00000000 00000000 00000000 .... Route String High
> >                 0x01/---- 0x00000003 0b00000000 00000000 00000000 00000011 .... Route String Low
> >                 0x02/---- 0x02082091 0b00000010 00001000 00100000 10010001 ....
> >                   [00:12]       0x91 Address
> >                   [13:18]        0x1 Read Size
> >                   [19:24]        0x1 Adapter Num
> >                   [25:26]        0x1 Configuration Space (CS) → Adapter Configuration Space
> >                   [27:28]        0x0 Sequence Number (SN)
> > [   39.815193] tb_event Hot Plug Event Packet Domain 0 Route 0 Adapter 3 / Lane
> >                 0x00/---- 0x80000000 0b10000000 00000000 00000000 00000000 .... Route String High
> >                 0x01/---- 0x00000000 0b00000000 00000000 00000000 00000000 .... Route String Low
> >                 0x02/---- 0x80000003 0b10000000 00000000 00000000 00000011 ....
> >                   [00:05]        0x3 Adapter Num
> >                   [31:31]        0x1 UPG
> > [   39.815196] [2821] thunderbolt 0000:00:0d.2: acking hot unplug event on 0:3
> >
> > By default it does not access retimers beyond the Type-C connector. I
> > wonder if you have CONFIG_USB4_DEBUGFS_MARGINING set in your kernel
> > .config? And if yes can you disable that and try again.
Sorry, it looks like I got some troubles with my MTA, some emails are
not sent out correctly.

I've rebuilt the kernel without CONFIG_USB4_DEBUGFS_MARGINING, and
here is the log
There is a tbt storage daisy-chained after the tbt monitor, it's
easier to reproduce this issue.
https://people.canonical.com/~acelan/bugs/tbt_call_trace/intel/merged_6.18.0-d358e5254674+.2.out

And this one is only the tbt monitor plugged.
https://people.canonical.com/~acelan/bugs/tbt_call_trace/intel/merged_6.18.0-d358e5254674+.3.out

>
> If this does end up being the reason - maybe we should make this really
> noisy in logs and/or taint the kernel.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] [RFC] thunderbolt: Add delay for Dell U2725QE link width
  2025-12-17  3:06               ` AceLan Kao
@ 2025-12-17 12:55                 ` Mika Westerberg
  2025-12-17 15:53                   ` Mario Limonciello
  0 siblings, 1 reply; 21+ messages in thread
From: Mika Westerberg @ 2025-12-17 12:55 UTC (permalink / raw)
  To: AceLan Kao
  Cc: Mario Limonciello, Andreas Noever, Mika Westerberg,
	Yehezkel Bernat, linux-usb, linux-kernel, Sanath.S, Lin, Wayne

Hi,

On Wed, Dec 17, 2025 at 11:06:52AM +0800, AceLan Kao wrote:
> > > By default it does not access retimers beyond the Type-C connector. I
> > > wonder if you have CONFIG_USB4_DEBUGFS_MARGINING set in your kernel
> > > .config? And if yes can you disable that and try again.
> Sorry, it looks like I got some troubles with my MTA, some emails are
> not sent out correctly.
> 
> I've rebuilt the kernel without CONFIG_USB4_DEBUGFS_MARGINING, and
> here is the log
> There is a tbt storage daisy-chained after the tbt monitor, it's
> easier to reproduce this issue.
> https://people.canonical.com/~acelan/bugs/tbt_call_trace/intel/merged_6.18.0-d358e5254674+.2.out
> 
> And this one is only the tbt monitor plugged.
> https://people.canonical.com/~acelan/bugs/tbt_call_trace/intel/merged_6.18.0-d358e5254674+.3.out

Okay from the first trace at least scanning of the retimer at index 2
(which does not exist) does not complete too fast and I suspect there is
some timeout on the device side that triggers. We had already similar with
Pluggable devices but perhaps this is implemented in the Dell version too?

I wonder it is enough if we set configuration valid and then scan the
downstream retimers? Can you try the attached patch? We do need to scan
them before DP tunnels are created to support ALPM (this is work in
progress).

diff --git a/drivers/thunderbolt/tb.c b/drivers/thunderbolt/tb.c
index d7f32a63fc1e..e23e0ee9c95f 100644
--- a/drivers/thunderbolt/tb.c
+++ b/drivers/thunderbolt/tb.c
@@ -1380,14 +1380,6 @@ static void tb_scan_port(struct tb_port *port)
 	upstream_port = tb_upstream_port(sw);
 	tb_configure_link(port, upstream_port, sw);
 
-	/*
-	 * Scan for downstream retimers. We only scan them after the
-	 * router has been enumerated to avoid issues with certain
-	 * Pluggable devices that expect the host to enumerate them
-	 * within certain timeout.
-	 */
-	tb_retimer_scan(port, true);
-
 	/*
 	 * CL0s and CL1 are enabled and supported together.
 	 * Silently ignore CLx enabling in case CLx is not supported.
@@ -1406,6 +1398,13 @@ static void tb_scan_port(struct tb_port *port)
 	 */
 	tb_switch_configuration_valid(sw);
 
+	/*
+	 * Scan for downstream retimers. We only scan them after the
+	 * router has been enumerated to avoid issues with certain
+	 * Pluggable devices that expect the host to enumerate them
+	 * within certain timeout.
+	 */
+	tb_retimer_scan(port, true);
 	/* Scan upstream retimers */
 	tb_retimer_scan(upstream_port, true);
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] [RFC] thunderbolt: Add delay for Dell U2725QE link width
  2025-12-17 12:55                 ` Mika Westerberg
@ 2025-12-17 15:53                   ` Mario Limonciello
  2025-12-18  1:38                     ` AceLan Kao
  0 siblings, 1 reply; 21+ messages in thread
From: Mario Limonciello @ 2025-12-17 15:53 UTC (permalink / raw)
  To: Mika Westerberg, AceLan Kao
  Cc: Andreas Noever, Mika Westerberg, Yehezkel Bernat, linux-usb,
	linux-kernel, Sanath.S, Lin, Wayne

On 12/17/25 6:55 AM, Mika Westerberg wrote:
> Hi,
> 
> On Wed, Dec 17, 2025 at 11:06:52AM +0800, AceLan Kao wrote:
>>>> By default it does not access retimers beyond the Type-C connector. I
>>>> wonder if you have CONFIG_USB4_DEBUGFS_MARGINING set in your kernel
>>>> .config? And if yes can you disable that and try again.
>> Sorry, it looks like I got some troubles with my MTA, some emails are
>> not sent out correctly.
>>
>> I've rebuilt the kernel without CONFIG_USB4_DEBUGFS_MARGINING, and
>> here is the log
>> There is a tbt storage daisy-chained after the tbt monitor, it's
>> easier to reproduce this issue.
>> https://people.canonical.com/~acelan/bugs/tbt_call_trace/intel/merged_6.18.0-d358e5254674+.2.out
>>
>> And this one is only the tbt monitor plugged.
>> https://people.canonical.com/~acelan/bugs/tbt_call_trace/intel/merged_6.18.0-d358e5254674+.3.out
> 
> Okay from the first trace at least scanning of the retimer at index 2
> (which does not exist) does not complete too fast and I suspect there is
> some timeout on the device side that triggers. We had already similar with
> Pluggable devices but perhaps this is implemented in the Dell version too?
> 
> I wonder it is enough if we set configuration valid and then scan the
> downstream retimers? Can you try the attached patch? We do need to scan
> them before DP tunnels are created to support ALPM (this is work in
> progress).

If it needs to go even later - there is OFC the possibility of doing 
upstream ones first and USB3 tunnels first too.

I'd say if the below doesn't work Acelan you can try pushing it right 
before tp_add_dp_resources() to see.

> 
> diff --git a/drivers/thunderbolt/tb.c b/drivers/thunderbolt/tb.c
> index d7f32a63fc1e..e23e0ee9c95f 100644
> --- a/drivers/thunderbolt/tb.c
> +++ b/drivers/thunderbolt/tb.c
> @@ -1380,14 +1380,6 @@ static void tb_scan_port(struct tb_port *port)
>   	upstream_port = tb_upstream_port(sw);
>   	tb_configure_link(port, upstream_port, sw);
>   
> -	/*
> -	 * Scan for downstream retimers. We only scan them after the
> -	 * router has been enumerated to avoid issues with certain
> -	 * Pluggable devices that expect the host to enumerate them
> -	 * within certain timeout.
> -	 */
> -	tb_retimer_scan(port, true);
> -
>   	/*
>   	 * CL0s and CL1 are enabled and supported together.
>   	 * Silently ignore CLx enabling in case CLx is not supported.
> @@ -1406,6 +1398,13 @@ static void tb_scan_port(struct tb_port *port)
>   	 */
>   	tb_switch_configuration_valid(sw);
>   
> +	/*
> +	 * Scan for downstream retimers. We only scan them after the
> +	 * router has been enumerated to avoid issues with certain
> +	 * Pluggable devices that expect the host to enumerate them
> +	 * within certain timeout.
> +	 */
> +	tb_retimer_scan(port, true);

Just a note in case this turns into a proper patch/solution.  Make sure 
you update the comment to cover this monitor too.

>   	/* Scan upstream retimers */
>   	tb_retimer_scan(upstream_port, true);
>   

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] [RFC] thunderbolt: Add delay for Dell U2725QE link width
  2025-12-17 15:53                   ` Mario Limonciello
@ 2025-12-18  1:38                     ` AceLan Kao
  2025-12-18  7:21                       ` Mika Westerberg
  0 siblings, 1 reply; 21+ messages in thread
From: AceLan Kao @ 2025-12-18  1:38 UTC (permalink / raw)
  To: Mario Limonciello
  Cc: Mika Westerberg, Andreas Noever, Mika Westerberg, Yehezkel Bernat,
	linux-usb, linux-kernel, Sanath.S, Lin, Wayne

Mario Limonciello <mario.limonciello@amd.com> 於 2025年12月17日週三 下午11:53寫道：
>
> On 12/17/25 6:55 AM, Mika Westerberg wrote:
> > Hi,
> >
> > On Wed, Dec 17, 2025 at 11:06:52AM +0800, AceLan Kao wrote:
> >>>> By default it does not access retimers beyond the Type-C connector. I
> >>>> wonder if you have CONFIG_USB4_DEBUGFS_MARGINING set in your kernel
> >>>> .config? And if yes can you disable that and try again.
> >> Sorry, it looks like I got some troubles with my MTA, some emails are
> >> not sent out correctly.
> >>
> >> I've rebuilt the kernel without CONFIG_USB4_DEBUGFS_MARGINING, and
> >> here is the log
> >> There is a tbt storage daisy-chained after the tbt monitor, it's
> >> easier to reproduce this issue.
> >> https://people.canonical.com/~acelan/bugs/tbt_call_trace/intel/merged_6.18.0-d358e5254674+.2.out
> >>
> >> And this one is only the tbt monitor plugged.
> >> https://people.canonical.com/~acelan/bugs/tbt_call_trace/intel/merged_6.18.0-d358e5254674+.3.out
> >
> > Okay from the first trace at least scanning of the retimer at index 2
> > (which does not exist) does not complete too fast and I suspect there is
> > some timeout on the device side that triggers. We had already similar with
> > Pluggable devices but perhaps this is implemented in the Dell version too?
> >
> > I wonder it is enough if we set configuration valid and then scan the
> > downstream retimers? Can you try the attached patch? We do need to scan
> > them before DP tunnels are created to support ALPM (this is work in
> > progress).
>
> If it needs to go even later - there is OFC the possibility of doing
> upstream ones first and USB3 tunnels first too.
>
> I'd say if the below doesn't work Acelan you can try pushing it right
> before tp_add_dp_resources() to see.
Hi Mario,

It's still no luck to move tb_retimer_scan() right before tp_add_dp_resources()
https://people.canonical.com/~acelan/bugs/tbt_call_trace/intel/merged_6.18.0-d358e5254674+.patched2.out

>
> >
> > diff --git a/drivers/thunderbolt/tb.c b/drivers/thunderbolt/tb.c
> > index d7f32a63fc1e..e23e0ee9c95f 100644
> > --- a/drivers/thunderbolt/tb.c
> > +++ b/drivers/thunderbolt/tb.c
> > @@ -1380,14 +1380,6 @@ static void tb_scan_port(struct tb_port *port)
> >       upstream_port = tb_upstream_port(sw);
> >       tb_configure_link(port, upstream_port, sw);
> >
> > -     /*
> > -      * Scan for downstream retimers. We only scan them after the
> > -      * router has been enumerated to avoid issues with certain
> > -      * Pluggable devices that expect the host to enumerate them
> > -      * within certain timeout.
> > -      */
> > -     tb_retimer_scan(port, true);
> > -
> >       /*
> >        * CL0s and CL1 are enabled and supported together.
> >        * Silently ignore CLx enabling in case CLx is not supported.
> > @@ -1406,6 +1398,13 @@ static void tb_scan_port(struct tb_port *port)
> >        */
> >       tb_switch_configuration_valid(sw);
> >
> > +     /*
> > +      * Scan for downstream retimers. We only scan them after the
> > +      * router has been enumerated to avoid issues with certain
> > +      * Pluggable devices that expect the host to enumerate them
> > +      * within certain timeout.
> > +      */
> > +     tb_retimer_scan(port, true);
Hi Mika,

This doesn't work.
https://people.canonical.com/~acelan/bugs/tbt_call_trace/intel/merged_6.18.0-d358e5254674+.patched1.out

>
> Just a note in case this turns into a proper patch/solution.  Make sure
> you update the comment to cover this monitor too.
>
> >       /* Scan upstream retimers */
> >       tb_retimer_scan(upstream_port, true);
> >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] [RFC] thunderbolt: Add delay for Dell U2725QE link width
  2025-12-18  1:38                     ` AceLan Kao
@ 2025-12-18  7:21                       ` Mika Westerberg
       [not found]                         ` <6inne3luvw4ot3wqnsaw3gzhlxtd4756i465oto6so5ox3syxp@kibuv4vhvexx>
  0 siblings, 1 reply; 21+ messages in thread
From: Mika Westerberg @ 2025-12-18  7:21 UTC (permalink / raw)
  To: AceLan Kao
  Cc: Mario Limonciello, Andreas Noever, Mika Westerberg,
	Yehezkel Bernat, linux-usb, linux-kernel, Sanath.S, Lin, Wayne

Hi,

On Thu, Dec 18, 2025 at 09:38:13AM +0800, AceLan Kao wrote:
> > > +     /*
> > > +      * Scan for downstream retimers. We only scan them after the
> > > +      * router has been enumerated to avoid issues with certain
> > > +      * Pluggable devices that expect the host to enumerate them
> > > +      * within certain timeout.
> > > +      */
> > > +     tb_retimer_scan(port, true);
> Hi Mika,
> 
> This doesn't work.
> https://people.canonical.com/~acelan/bugs/tbt_call_trace/intel/merged_6.18.0-d358e5254674+.patched1.out

Okay thanks for trying. I noticed that there is also USB 2.x disconnect:

[    4.470610] usb 3-2: New USB device found, idVendor=1d5c, idProduct=5801, bcdDevice= 1.01
[    4.470618] usb 3-2: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[    4.470620] usb 3-2: Product: USB2.0 Hub
[    4.470622] usb 3-2: Manufacturer: Fresco Logic, Inc.
...
[  104.699872] tb_tx Read Request Domain 0 Route 303 Adapter 0
               0x00/---- 0x00000000 0b00000000 00000000 00000000 00000000 .... Route String High
               0x01/---- 0x00000303 0b00000000 00000000 00000011 00000011 .... Route String Low
               0x02/---- 0x0400202c 0b00000100 00000000 00100000 00101100 ...,
                 [00:12]       0x2c Address
                 [13:18]        0x1 Read Size
                 [19:24]        0x0 Adapter Num
                 [25:26]        0x2 Configuration Space (CS) → Router Configuration Space
                 [27:28]        0x0 Sequence Number (SN)
[  104.700850] tb_event Hot Plug Event Packet Domain 0 Route 0 Adapter 3 / Lane
               0x00/---- 0x80000000 0b10000000 00000000 00000000 00000000 .... Route String High
               0x01/---- 0x00000000 0b00000000 00000000 00000000 00000000 .... Route String Low
               0x02/---- 0x80000003 0b10000000 00000000 00000000 00000011 ....
                 [00:05]        0x3 Adapter Num
                 [31:31]        0x1 UPG
[  104.700852] [763] thunderbolt 0000:00:0d.2: acking hot unplug event on 0:3

// Here we got the unplug to 0:3. After a while

[  106.844134] usb 3-2: USB disconnect, device number 14

Now since USB 2.x has its own wires in Type-C cable this tells me that
there is some real problem with the connection. Have you tried different
cables already?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] [RFC] thunderbolt: Add delay for Dell U2725QE link width
       [not found]                         ` <6inne3luvw4ot3wqnsaw3gzhlxtd4756i465oto6so5ox3syxp@kibuv4vhvexx>
@ 2025-12-18 10:20                           ` Mika Westerberg
  2025-12-22  1:33                             ` Chia-Lin Kao (AceLan)
  0 siblings, 1 reply; 21+ messages in thread
From: Mika Westerberg @ 2025-12-18 10:20 UTC (permalink / raw)
  To: Chia-Lin Kao (AceLan), Mario Limonciello, Andreas Noever,
	Mika Westerberg, Yehezkel Bernat, linux-usb, linux-kernel,
	Sanath.S, Lin, Wayne

On Thu, Dec 18, 2025 at 03:35:05PM +0800, Chia-Lin Kao (AceLan) wrote:
> > Now since USB 2.x has its own wires in Type-C cable this tells me that
> > there is some real problem with the connection. Have you tried different
> > cables already?
> Here is the log I got with another tbt4 cable.
> I'm using the kernel with Mario suggests modification.
> 
> https://people.canonical.com/~acelan/bugs/tbt_call_trace/intel/merged_6.18.0-d358e5254674+.patched2.2_new_cable.out

Here I see (assuming I read it right) that the USB 2.x enumerates only
after the first unplug:

[   28.589861] usb 3-2: New USB device found, idVendor=1d5c, idProduct=5801, bcdDevice= 1.01
[   28.589864] usb 3-2: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[   28.589865] usb 3-2: Product: USB2.0 Hub
[   28.589866] usb 3-2: Manufacturer: Fresco Logic, Inc.

Since Goshen Ridge is pretty stable in Linux I'm kind of suspecting still a
connection issue rather than SW. Or could be power related too. AFAIK the
USB 2.x should be rock solid but here it seems not. Are you using active or
passive cables and do they have the lightning logo?

You could still try to comment out both tb_retimer_scan() calls and see if
that makes any difference but I doubt since your last log unplug happened
when we were reading DROM of the second device router not when sidband
access was done.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] [RFC] thunderbolt: Add delay for Dell U2725QE link width
  2025-12-18 10:20                           ` Mika Westerberg
@ 2025-12-22  1:33                             ` Chia-Lin Kao (AceLan)
  2025-12-30  7:30                               ` Mika Westerberg
  0 siblings, 1 reply; 21+ messages in thread
From: Chia-Lin Kao (AceLan) @ 2025-12-22  1:33 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Mario Limonciello, Andreas Noever, Mika Westerberg,
	Yehezkel Bernat, linux-usb, linux-kernel, Sanath.S, Lin, Wayne

On Thu, Dec 18, 2025 at 11:20:21AM +0100, Mika Westerberg wrote:
> On Thu, Dec 18, 2025 at 03:35:05PM +0800, Chia-Lin Kao (AceLan) wrote:
> > > Now since USB 2.x has its own wires in Type-C cable this tells me that
> > > there is some real problem with the connection. Have you tried different
> > > cables already?
> > Here is the log I got with another tbt4 cable.
> > I'm using the kernel with Mario suggests modification.
> > 
> > https://people.canonical.com/~acelan/bugs/tbt_call_trace/intel/merged_6.18.0-d358e5254674+.patched2.2_new_cable.out
> 
> Here I see (assuming I read it right) that the USB 2.x enumerates only
> after the first unplug:
> 
> [   28.589861] usb 3-2: New USB device found, idVendor=1d5c, idProduct=5801, bcdDevice= 1.01
> [   28.589864] usb 3-2: New USB device strings: Mfr=1, Product=2, SerialNumber=0
> [   28.589865] usb 3-2: Product: USB2.0 Hub
> [   28.589866] usb 3-2: Manufacturer: Fresco Logic, Inc.
From the logs, sometimes this hub is enumerated before the call trace
and then enumerated again after the call trace.

And I also found there are some suspicious USB disconnections while
plugging in the tbt monitor.

I tried to avoid the USB disconnection by the following modification,
but still no luck.

```
diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
index be50d03034a9..ed3756065568 100644
--- a/drivers/usb/core/hub.c
+++ b/drivers/usb/core/hub.c
@@ -5697,6 +5697,22 @@ static void hub_port_connect_change(struct usb_hub *hub, int port1,
                        /* Don't resuscitate */;
                }
        }
+#ifdef CONFIG_PM
+       /* Handle device with temporarily lost connection */
+       else if (!(portstatus & USB_PORT_STAT_CONNECTION) && udev &&
+                       udev->state != USB_STATE_NOTATTACHED &&
+                       udev->persist_enabled) {
+               /*
+                * If a device with persist enabled temporarily loses connection
+                * during parent hub reconfiguration (e.g., Thunderbolt re-probe),
+                * don't immediately disconnect it. Clear the change bit and
+                * let the hub resume process handle it properly.
+                */
+               dev_dbg(&port_dev->dev, "device (state=%d) lost connection temporarily, not disconnecting\n",
+                               udev->state);
+               status = 0;
+       }
+#endif
        clear_bit(port1, hub->change_bits);

        /* successfully revalidated the connection */
```

Here is the log with the modification.
https://people.canonical.com/~acelan/bugs/tbt_call_trace/intel/merged_6.18.0-d358e5254674+.usb_temporarily_lost_connection.out

> 
> Since Goshen Ridge is pretty stable in Linux I'm kind of suspecting still a
> connection issue rather than SW. Or could be power related too. AFAIK the
> USB 2.x should be rock solid but here it seems not. Are you using active or
> passive cables and do they have the lightning logo?
I can't tell the cable is active or passive, there is a lightning logo
on the both sides of the cable, and also a number "4" on the both sides.

> 
> You could still try to comment out both tb_retimer_scan() calls and see if
> that makes any difference but I doubt since your last log unplug happened
> when we were reading DROM of the second device router not when sidband
> access was done.
I still think it's waiting for something ready, but I don't know what it's
waiting for. Here is the log after applied the 2 seconds sleep.

https://people.canonical.com/~acelan/bugs/tbt_call_trace/intel/merged_6.18.0-d358e5254674+.wa_2_seconds.out

On AMD system, the issue could be reproduced 100% and requires at least
2 seconds to avoid the call trace.
I guess on Intel system the value could be lower, because the
reproduce rate is around 10% ~ 20% on Intel system.


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] [RFC] thunderbolt: Add delay for Dell U2725QE link width
  2025-12-22  1:33                             ` Chia-Lin Kao (AceLan)
@ 2025-12-30  7:30                               ` Mika Westerberg
  2025-12-31  1:33                                 ` Chia-Lin Kao (AceLan)
  0 siblings, 1 reply; 21+ messages in thread
From: Mika Westerberg @ 2025-12-30  7:30 UTC (permalink / raw)
  To: Chia-Lin Kao (AceLan), Mario Limonciello, Andreas Noever,
	Mika Westerberg, Yehezkel Bernat, linux-usb, linux-kernel,
	Sanath.S, Lin, Wayne

On Mon, Dec 22, 2025 at 09:33:48AM +0800, Chia-Lin Kao (AceLan) wrote:
> On Thu, Dec 18, 2025 at 11:20:21AM +0100, Mika Westerberg wrote:
> > On Thu, Dec 18, 2025 at 03:35:05PM +0800, Chia-Lin Kao (AceLan) wrote:
> > > > Now since USB 2.x has its own wires in Type-C cable this tells me that
> > > > there is some real problem with the connection. Have you tried different
> > > > cables already?
> > > Here is the log I got with another tbt4 cable.
> > > I'm using the kernel with Mario suggests modification.
> > > 
> > > https://people.canonical.com/~acelan/bugs/tbt_call_trace/intel/merged_6.18.0-d358e5254674+.patched2.2_new_cable.out
> > 
> > Here I see (assuming I read it right) that the USB 2.x enumerates only
> > after the first unplug:
> > 
> > [   28.589861] usb 3-2: New USB device found, idVendor=1d5c, idProduct=5801, bcdDevice= 1.01
> > [   28.589864] usb 3-2: New USB device strings: Mfr=1, Product=2, SerialNumber=0
> > [   28.589865] usb 3-2: Product: USB2.0 Hub
> > [   28.589866] usb 3-2: Manufacturer: Fresco Logic, Inc.
> >From the logs, sometimes this hub is enumerated before the call trace
> and then enumerated again after the call trace.
> 
> And I also found there are some suspicious USB disconnections while
> plugging in the tbt monitor.
> 
> I tried to avoid the USB disconnection by the following modification,
> but still no luck.

Okay but I think this is not a SW issue, rather an issue with that
particular monitor/cable/connection/PD. It is not just the USB4 link that
goes down it's the whole type-C connection therefore something is wrong on
the electrical side of things (well at least it seems so).

Dell also typically validate that their stuff works in Linux so I would
expect to got some report from them if that's not the case (unless you are
doing just that ;-))

Have you tried this same monitor with Windows? Do you see the same issue
there? I would expect so.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] [RFC] thunderbolt: Add delay for Dell U2725QE link width
  2025-12-30  7:30                               ` Mika Westerberg
@ 2025-12-31  1:33                                 ` Chia-Lin Kao (AceLan)
  2025-12-31  6:03                                   ` Mika Westerberg
  0 siblings, 1 reply; 21+ messages in thread
From: Chia-Lin Kao (AceLan) @ 2025-12-31  1:33 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Mario Limonciello, Andreas Noever, Mika Westerberg,
	Yehezkel Bernat, linux-usb, linux-kernel, Sanath.S, Lin, Wayne

On Tue, Dec 30, 2025 at 08:30:11AM +0100, Mika Westerberg wrote:
> On Mon, Dec 22, 2025 at 09:33:48AM +0800, Chia-Lin Kao (AceLan) wrote:
> > On Thu, Dec 18, 2025 at 11:20:21AM +0100, Mika Westerberg wrote:
> > > On Thu, Dec 18, 2025 at 03:35:05PM +0800, Chia-Lin Kao (AceLan) wrote:
> > > > > Now since USB 2.x has its own wires in Type-C cable this tells me that
> > > > > there is some real problem with the connection. Have you tried different
> > > > > cables already?
> > > > Here is the log I got with another tbt4 cable.
> > > > I'm using the kernel with Mario suggests modification.
> > > >
> > > > https://people.canonical.com/~acelan/bugs/tbt_call_trace/intel/merged_6.18.0-d358e5254674+.patched2.2_new_cable.out
> > >
> > > Here I see (assuming I read it right) that the USB 2.x enumerates only
> > > after the first unplug:
> > >
> > > [   28.589861] usb 3-2: New USB device found, idVendor=1d5c, idProduct=5801, bcdDevice= 1.01
> > > [   28.589864] usb 3-2: New USB device strings: Mfr=1, Product=2, SerialNumber=0
> > > [   28.589865] usb 3-2: Product: USB2.0 Hub
> > > [   28.589866] usb 3-2: Manufacturer: Fresco Logic, Inc.
> > >From the logs, sometimes this hub is enumerated before the call trace
> > and then enumerated again after the call trace.
> >
> > And I also found there are some suspicious USB disconnections while
> > plugging in the tbt monitor.
> >
> > I tried to avoid the USB disconnection by the following modification,
> > but still no luck.
>
> Okay but I think this is not a SW issue, rather an issue with that
> particular monitor/cable/connection/PD. It is not just the USB4 link that
> goes down it's the whole type-C connection therefore something is wrong on
> the electrical side of things (well at least it seems so).
If that's the case, would you agree to suppress the scary call trace
like this?

diff --git a/drivers/thunderbolt/path.c b/drivers/thunderbolt/path.c
index f9b11dadfbdd..ae7127eca542 100644
--- a/drivers/thunderbolt/path.c
+++ b/drivers/thunderbolt/path.c
@@ -586,7 +586,18 @@ int tb_path_activate(struct tb_path *path)
        tb_dbg(path->tb, "%s path activation complete\n", path->name);
        return 0;
 err:
-       tb_WARN(path->tb, "%s path activation failed\n", path->name);
+       /*
+        * -ENOTCONN can occur during transient hardware states like lane
+        * bonding or when the Type-C connection has electrical issues. The
+        * hardware may automatically retry by reconnecting. Use a regular
+        * warning instead of tb_WARN to avoid generating call traces for
+        * these expected transient conditions.
+        */
+       if (res == -ENOTCONN)
+               tb_warn(path->tb, "%s path activation failed (port not connected)\n",
+                       path->name);
+       else
+               tb_WARN(path->tb, "%s path activation failed\n", path->name);
        return res;
 }

>
> Dell also typically validate that their stuff works in Linux so I would
> expect to got some report from them if that's not the case (unless you are
> doing just that ;-))
Currently, the issue could be reproduced on the AMD platform every
time when plugging in the tbt monitor. We don't report the issue on
Intel platform yet, because of it's low failrate.
And the issue is not critical, as it can be recovered after
re-enumerating the monitor.
So maybe they won't bother you about this issue.

>
> Have you tried this same monitor with Windows? Do you see the same issue
> there? I would expect so.

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] [RFC] thunderbolt: Add delay for Dell U2725QE link width
  2025-12-31  1:33                                 ` Chia-Lin Kao (AceLan)
@ 2025-12-31  6:03                                   ` Mika Westerberg
  2026-01-02  2:03                                     ` Chia-Lin Kao (AceLan)
  0 siblings, 1 reply; 21+ messages in thread
From: Mika Westerberg @ 2025-12-31  6:03 UTC (permalink / raw)
  To: Chia-Lin Kao (AceLan), Mario Limonciello, Andreas Noever,
	Mika Westerberg, Yehezkel Bernat, linux-usb, linux-kernel,
	Sanath.S, Lin, Wayne

On Wed, Dec 31, 2025 at 09:33:15AM +0800, Chia-Lin Kao (AceLan) wrote:
> On Tue, Dec 30, 2025 at 08:30:11AM +0100, Mika Westerberg wrote:
> > On Mon, Dec 22, 2025 at 09:33:48AM +0800, Chia-Lin Kao (AceLan) wrote:
> > > On Thu, Dec 18, 2025 at 11:20:21AM +0100, Mika Westerberg wrote:
> > > > On Thu, Dec 18, 2025 at 03:35:05PM +0800, Chia-Lin Kao (AceLan) wrote:
> > > > > > Now since USB 2.x has its own wires in Type-C cable this tells me that
> > > > > > there is some real problem with the connection. Have you tried different
> > > > > > cables already?
> > > > > Here is the log I got with another tbt4 cable.
> > > > > I'm using the kernel with Mario suggests modification.
> > > > >
> > > > > https://people.canonical.com/~acelan/bugs/tbt_call_trace/intel/merged_6.18.0-d358e5254674+.patched2.2_new_cable.out
> > > >
> > > > Here I see (assuming I read it right) that the USB 2.x enumerates only
> > > > after the first unplug:
> > > >
> > > > [   28.589861] usb 3-2: New USB device found, idVendor=1d5c, idProduct=5801, bcdDevice= 1.01
> > > > [   28.589864] usb 3-2: New USB device strings: Mfr=1, Product=2, SerialNumber=0
> > > > [   28.589865] usb 3-2: Product: USB2.0 Hub
> > > > [   28.589866] usb 3-2: Manufacturer: Fresco Logic, Inc.
> > > >From the logs, sometimes this hub is enumerated before the call trace
> > > and then enumerated again after the call trace.
> > >
> > > And I also found there are some suspicious USB disconnections while
> > > plugging in the tbt monitor.
> > >
> > > I tried to avoid the USB disconnection by the following modification,
> > > but still no luck.
> >
> > Okay but I think this is not a SW issue, rather an issue with that
> > particular monitor/cable/connection/PD. It is not just the USB4 link that
> > goes down it's the whole type-C connection therefore something is wrong on
> > the electrical side of things (well at least it seems so).
> If that's the case, would you agree to suppress the scary call trace
> like this?
> 
> diff --git a/drivers/thunderbolt/path.c b/drivers/thunderbolt/path.c
> index f9b11dadfbdd..ae7127eca542 100644
> --- a/drivers/thunderbolt/path.c
> +++ b/drivers/thunderbolt/path.c
> @@ -586,7 +586,18 @@ int tb_path_activate(struct tb_path *path)
>         tb_dbg(path->tb, "%s path activation complete\n", path->name);
>         return 0;
>  err:
> -       tb_WARN(path->tb, "%s path activation failed\n", path->name);
> +       /*
> +        * -ENOTCONN can occur during transient hardware states like lane
> +        * bonding or when the Type-C connection has electrical issues. The
> +        * hardware may automatically retry by reconnecting. Use a regular
> +        * warning instead of tb_WARN to avoid generating call traces for
> +        * these expected transient conditions.
> +        */
> +       if (res == -ENOTCONN)
> +               tb_warn(path->tb, "%s path activation failed (port not connected)\n",
> +                       path->name);
> +       else
> +               tb_WARN(path->tb, "%s path activation failed\n", path->name);
>         return res;
>  }

Yes please but make it unconditionally do tb_warn() instead of that
tb_WARN().

> > Dell also typically validate that their stuff works in Linux so I would
> > expect to got some report from them if that's not the case (unless you are
> > doing just that ;-))
> Currently, the issue could be reproduced on the AMD platform every
> time when plugging in the tbt monitor. We don't report the issue on
> Intel platform yet, because of it's low failrate.
> And the issue is not critical, as it can be recovered after
> re-enumerating the monitor.
> So maybe they won't bother you about this issue.

You only have one of those monitors? It would be good to check with another
if it has the same issue. I have GR reference device here which is what
this monitor is based on but I don't see any unplugs or link issues. I will
ask around if we have somewhere this monitor.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] [RFC] thunderbolt: Add delay for Dell U2725QE link width
  2025-12-31  6:03                                   ` Mika Westerberg
@ 2026-01-02  2:03                                     ` Chia-Lin Kao (AceLan)
  2026-01-05 11:19                                       ` Mika Westerberg
  0 siblings, 1 reply; 21+ messages in thread
From: Chia-Lin Kao (AceLan) @ 2026-01-02  2:03 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Mario Limonciello, Andreas Noever, Mika Westerberg,
	Yehezkel Bernat, linux-usb, linux-kernel, Sanath.S, Lin, Wayne

On Wed, Dec 31, 2025 at 07:03:33AM +0100, Mika Westerberg wrote:
> On Wed, Dec 31, 2025 at 09:33:15AM +0800, Chia-Lin Kao (AceLan) wrote:
> > On Tue, Dec 30, 2025 at 08:30:11AM +0100, Mika Westerberg wrote:
> > > On Mon, Dec 22, 2025 at 09:33:48AM +0800, Chia-Lin Kao (AceLan) wrote:
> > > > On Thu, Dec 18, 2025 at 11:20:21AM +0100, Mika Westerberg wrote:
> > > > > On Thu, Dec 18, 2025 at 03:35:05PM +0800, Chia-Lin Kao (AceLan) wrote:
> > > > > > > Now since USB 2.x has its own wires in Type-C cable this tells me that
> > > > > > > there is some real problem with the connection. Have you tried different
> > > > > > > cables already?
> > > > > > Here is the log I got with another tbt4 cable.
> > > > > > I'm using the kernel with Mario suggests modification.
> > > > > >
> > > > > > https://people.canonical.com/~acelan/bugs/tbt_call_trace/intel/merged_6.18.0-d358e5254674+.patched2.2_new_cable.out
> > > > >
> > > > > Here I see (assuming I read it right) that the USB 2.x enumerates only
> > > > > after the first unplug:
> > > > >
> > > > > [   28.589861] usb 3-2: New USB device found, idVendor=1d5c, idProduct=5801, bcdDevice= 1.01
> > > > > [   28.589864] usb 3-2: New USB device strings: Mfr=1, Product=2, SerialNumber=0
> > > > > [   28.589865] usb 3-2: Product: USB2.0 Hub
> > > > > [   28.589866] usb 3-2: Manufacturer: Fresco Logic, Inc.
> > > > >From the logs, sometimes this hub is enumerated before the call trace
> > > > and then enumerated again after the call trace.
> > > >
> > > > And I also found there are some suspicious USB disconnections while
> > > > plugging in the tbt monitor.
> > > >
> > > > I tried to avoid the USB disconnection by the following modification,
> > > > but still no luck.
> > >
> > > Okay but I think this is not a SW issue, rather an issue with that
> > > particular monitor/cable/connection/PD. It is not just the USB4 link that
> > > goes down it's the whole type-C connection therefore something is wrong on
> > > the electrical side of things (well at least it seems so).
> > If that's the case, would you agree to suppress the scary call trace
> > like this?
> >
> > diff --git a/drivers/thunderbolt/path.c b/drivers/thunderbolt/path.c
> > index f9b11dadfbdd..ae7127eca542 100644
> > --- a/drivers/thunderbolt/path.c
> > +++ b/drivers/thunderbolt/path.c
> > @@ -586,7 +586,18 @@ int tb_path_activate(struct tb_path *path)
> >         tb_dbg(path->tb, "%s path activation complete\n", path->name);
> >         return 0;
> >  err:
> > -       tb_WARN(path->tb, "%s path activation failed\n", path->name);
> > +       /*
> > +        * -ENOTCONN can occur during transient hardware states like lane
> > +        * bonding or when the Type-C connection has electrical issues. The
> > +        * hardware may automatically retry by reconnecting. Use a regular
> > +        * warning instead of tb_WARN to avoid generating call traces for
> > +        * these expected transient conditions.
> > +        */
> > +       if (res == -ENOTCONN)
> > +               tb_warn(path->tb, "%s path activation failed (port not connected)\n",
> > +                       path->name);
> > +       else
> > +               tb_WARN(path->tb, "%s path activation failed\n", path->name);
> >         return res;
> >  }
>
> Yes please but make it unconditionally do tb_warn() instead of that
> tb_WARN().
Got it.

>
> > > Dell also typically validate that their stuff works in Linux so I would
> > > expect to got some report from them if that's not the case (unless you are
> > > doing just that ;-))
> > Currently, the issue could be reproduced on the AMD platform every
> > time when plugging in the tbt monitor. We don't report the issue on
> > Intel platform yet, because of it's low failrate.
> > And the issue is not critical, as it can be recovered after
> > re-enumerating the monitor.
> > So maybe they won't bother you about this issue.
>
> You only have one of those monitors? It would be good to check with another
> if it has the same issue. I have GR reference device here which is what
> this monitor is based on but I don't see any unplugs or link issues. I will
> ask around if we have somewhere this monitor.
Here is another BenQ 5k thunderbolt 4 monitor, and I can't reproduce
the issue with this monitor, even with the AMD machine.

https://people.canonical.com/~acelan/bugs/tbt_call_trace/intel/merged_6.18.0-d358e5254674+.benq.out

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] [RFC] thunderbolt: Add delay for Dell U2725QE link width
  2026-01-02  2:03                                     ` Chia-Lin Kao (AceLan)
@ 2026-01-05 11:19                                       ` Mika Westerberg
  0 siblings, 0 replies; 21+ messages in thread
From: Mika Westerberg @ 2026-01-05 11:19 UTC (permalink / raw)
  To: Chia-Lin Kao (AceLan), Mario Limonciello, Andreas Noever,
	Mika Westerberg, Yehezkel Bernat, linux-usb, linux-kernel,
	Sanath.S, Lin, Wayne

On Fri, Jan 02, 2026 at 10:03:12AM +0800, Chia-Lin Kao (AceLan) wrote:
> > You only have one of those monitors? It would be good to check with another
> > if it has the same issue. I have GR reference device here which is what
> > this monitor is based on but I don't see any unplugs or link issues. I will
> > ask around if we have somewhere this monitor.
>
> Here is another BenQ 5k thunderbolt 4 monitor, and I can't reproduce
> the issue with this monitor, even with the AMD machine.
> 
> https://people.canonical.com/~acelan/bugs/tbt_call_trace/intel/merged_6.18.0-d358e5254674+.benq.out

Okay thanks! We have one Dell GR based monitor here (not the same as yours)
and a Lenovo GR based one. I asked the folks to try to repro the issue and
will share the results once I get them.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2026-01-05 11:19 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-09  5:41 [PATCH] [RFC] thunderbolt: Add delay for Dell U2725QE link width Chia-Lin Kao (AceLan)
2025-12-09  7:06 ` Mika Westerberg
2025-12-09 16:49   ` Mario Limonciello
2025-12-10  5:33     ` Chia-Lin Kao (AceLan)
2025-12-10  3:15   ` Chia-Lin Kao (AceLan)
2025-12-10  7:41     ` Mika Westerberg
2025-12-10 21:42       ` Mario Limonciello
     [not found]         ` <coxrm5gishdztghznuvzafg2pbdk4qk3ttbkbq7t5whsfv2lk5@3gqepcs6h4uc>
2025-12-12 12:39           ` Mika Westerberg
2025-12-12 14:40             ` Mario Limonciello
2025-12-17  3:06               ` AceLan Kao
2025-12-17 12:55                 ` Mika Westerberg
2025-12-17 15:53                   ` Mario Limonciello
2025-12-18  1:38                     ` AceLan Kao
2025-12-18  7:21                       ` Mika Westerberg
     [not found]                         ` <6inne3luvw4ot3wqnsaw3gzhlxtd4756i465oto6so5ox3syxp@kibuv4vhvexx>
2025-12-18 10:20                           ` Mika Westerberg
2025-12-22  1:33                             ` Chia-Lin Kao (AceLan)
2025-12-30  7:30                               ` Mika Westerberg
2025-12-31  1:33                                 ` Chia-Lin Kao (AceLan)
2025-12-31  6:03                                   ` Mika Westerberg
2026-01-02  2:03                                     ` Chia-Lin Kao (AceLan)
2026-01-05 11:19                                       ` Mika Westerberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox