* drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1
@ 2024-02-27 13:33 Johan Hovold
2024-03-08 12:43 ` Johan Hovold
0 siblings, 1 reply; 8+ messages in thread
From: Johan Hovold @ 2024-02-27 13:33 UTC (permalink / raw)
To: Abhinav Kumar, Rob Clark, Dmitry Baryshkov, Kuogee Hsieh
Cc: Daniel Thompson, Sean Paul, Marijn Suijten, David Airlie,
Daniel Vetter, Bjorn Andersson, quic_jesszhan, quic_sbillaka,
dri-devel, freedreno, linux-arm-msm, regressions, linux-kernel
Hi,
Since 6.8-rc1 I have seen (and received reports) of hard resets of the
Lenovo ThinkPad X13s after connecting and disconnecting an external
display.
I have triggered this on a simple disconnect while in a VT console, but
also when stopping Xorg after having repeatedly connected and
disconnected an external display and tried to enable it using xrandr.
In the former case, the last (custom debug) messages printed over an SSH
session were once:
[ 948.416358] usb 5-1: USB disconnect, device number 3
[ 948.443496] msm_dpu ae01000.display-controller: msm_fbdev_client_hotplug
[ 948.443723] msm-dp-display ae98000.displayport-controller: dp_power_clk_enable - type = 1, enable = 0
[ 948.443872] msm-dp-display ae98000.displayport-controller: dp_ctrl_phy_exit
[ 948.445117] msm-dp-display ae98000.displayport-controller: dp_ctrl_phy_exit - done
and then the hypervisor resets the machine.
Hotplug in Xorg seems to work worse than it did with 6.7, which also had
some issues. Connecting a display once seems to work fine, but trying to
re-enable a reconnected display using xrandr sometimes does not work at
all, while with 6.7 it usually worked on the second xrandr execution.
xrandr reports the reconnected display as disconnected:
Screen 0: minimum 320 x 200, current 1920 x 1200, maximum 5120 x 4096
eDP-1 connected primary 1920x1200+0+0 (normal left inverted right x axis y axis) 286mm x 178mm
1920x1200 60.03*+
1600x1200 60.00
DP-1 disconnected (normal left inverted right x axis y axis)
DP-2 disconnected 1920x1200+0+0 (normal left inverted right x axis y axis) 0mm x 0mm
1920x1200 (0x40c) 154.000MHz +HSync -VSync
h: width 1920 start 1968 end 2000 total 2080 skew 0 clock 74.04KHz
v: height 1200 start 1203 end 1209 total 1235 clock 59.95Hz
Running 'xrandr --output DP-2 --auto' 2-3 times makes xrandr report the
display as connected, but the display is still blank (unlike with 6.7).
A few times after having exercised hotplug this way, the machine hard
resets when Xorg is later stopped. Once I saw the following log messages
on an SSH session but they may not have been printed directly before
the hard reset:
[ 214.555781] [drm:dpu_encoder_phys_vid_wait_for_commit_done:492] [dpu error]vblank timeout
[ 214.555843] [drm:dpu_kms_wait_for_commit_done:483] [dpu error]wait for commit done returned -110
Note that this appears to be unrelated to the recently fixed Qualcomm
power domain driver bug which can trigger similar resets when
initialising the display subsystem on boot. Specifically, I have
triggered the hotplug resets described above also with the fix applied.
[1]
Reverting commit e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify()
to report HPD status changes") which fixes the related VT console
regression does not seem to make any difference. [2]
Daniel Thompson reports that reverting the whole runtime PM series
appears to make the hard resets he has seen with DisplayPort hotplug go
away however:
https://lore.kernel.org/lkml/1701472789-25951-1-git-send-email-quic_khsieh@quicinc.com/
So for now, let's assume that these regressions were also introduced (or
triggered) by commit 5814b8bf086a ("drm/msm/dp: incorporate pm_runtime
framework into DP driver").
Johan
[1] https://lore.kernel.org/lkml/20240226-rpmhpd-enable-corner-fix-v1-1-68c004cec48c@quicinc.com/
[2] https://lore.kernel.org/lkml/Zd3YPGmrprxv-N-O@hovoldconsulting.com/
#regzbot introduced: 5814b8bf086a
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1 2024-02-27 13:33 drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1 Johan Hovold @ 2024-03-08 12:43 ` Johan Hovold 2024-03-08 17:50 ` Abhinav Kumar 0 siblings, 1 reply; 8+ messages in thread From: Johan Hovold @ 2024-03-08 12:43 UTC (permalink / raw) To: Abhinav Kumar, Rob Clark, Dmitry Baryshkov, Kuogee Hsieh Cc: Daniel Thompson, Sean Paul, Marijn Suijten, David Airlie, Daniel Vetter, Bjorn Andersson, quic_jesszhan, quic_sbillaka, dri-devel, freedreno, linux-arm-msm, regressions, linux-kernel Hi Abhinav, Rob, Dmitry and Kuogee, On Tue, Feb 27, 2024 at 02:33:48PM +0100, Johan Hovold wrote: > Since 6.8-rc1 I have seen (and received reports) of hard resets of the > Lenovo ThinkPad X13s after connecting and disconnecting an external > display. > > I have triggered this on a simple disconnect while in a VT console, but > also when stopping Xorg after having repeatedly connected and > disconnected an external display and tried to enable it using xrandr. > > In the former case, the last (custom debug) messages printed over an SSH > session were once: > > [ 948.416358] usb 5-1: USB disconnect, device number 3 > [ 948.443496] msm_dpu ae01000.display-controller: msm_fbdev_client_hotplug > [ 948.443723] msm-dp-display ae98000.displayport-controller: dp_power_clk_enable - type = 1, enable = 0 > [ 948.443872] msm-dp-display ae98000.displayport-controller: dp_ctrl_phy_exit > [ 948.445117] msm-dp-display ae98000.displayport-controller: dp_ctrl_phy_exit - done > > and then the hypervisor resets the machine. Has there been any progress on tracking down the reset on disconnect issue? I was expecting you to share your findings here so that we can determine if the rest of the runtime PM series needs to be reverted or not before 6.8 is released (possibly on Sunday). I really did not want to spend more on this driver than I already have this cycle, but the lack of (visible) progress has again forced me to do so. It's quite likely that the resets are indeed a regression caused by the runtime PM series as the bus clocks were not disabled on disconnect before that one was merged in 6.8-rc1. In a VT console, the device is now runtime suspended immediately on disconnect, while in X, it currently remains active until X is killed, which is consistent with what I reported above. We now also have Bjorn's call trace from a different Qualcomm platform triggering an unclocked access on disconnect, something which would trigger a reset by the hypervisor on sc8280xp platforms like the X13s: [ 2030.379417] SError Interrupt on CPU0, code 0x00000000be000000 -- SError [ 2030.379425] CPU: 0 PID: 239 Comm: kworker/0:2 Not tainted 6.8.0-rc4-next-20240216-00015-gc937d3c43ffe-dirty #219 [ 2030.379430] Hardware name: Qualcomm Technologies, Inc. Robotics RB3gen2 (DT) [ 2030.379435] Workqueue: events output_poll_execute [drm_kms_helper] [ 2030.379495] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 2030.379501] pc : el1_interrupt+0x28/0xc0 [ 2030.379514] lr : el1h_64_irq_handler+0x18/0x24 [ 2030.379520] sp : ffff80008129b700 [ 2030.379523] x29: ffff80008129b700 x28: ffff7a0a031aa200 x27: ffff7a0af768c3c0 [ 2030.379530] x26: 0000000000000000 x25: ffff7a0a024bb568 x24: ffff7a0a031aa200 [ 2030.379537] x23: 0000000060400005 x22: ffffd2a4a10c871c x21: ffff80008129b8a0 [ 2030.379544] x20: ffffd2a4a00002f0 x19: ffff80008129b750 x18: 0000000000000000 [ 2030.379549] x17: 0000000000000000 x16: ffffd2a4a10c21d4 x15: 0000000000000000 [ 2030.379555] x14: 0000000000000000 x13: 000000000001acf6 x12: 0000000000000000 [ 2030.379560] x11: 0000000000000001 x10: ffff7a0af623f680 x9 : 0000000100000001 [ 2030.379565] x8 : 00000000000000c0 x7 : 0000000000000000 x6 : 000000000000003f [ 2030.379570] x5 : ffff7a0a003c6a70 x4 : 000000000000001f x3 : ffffd2a48d6722dc [ 2030.379576] x2 : 0000000000000002 x1 : ffffd2a4a00002f0 x0 : ffff80008129b750 [ 2030.379583] Kernel panic - not syncing: Asynchronous SError Interrupt [ 2030.379586] CPU: 0 PID: 239 Comm: kworker/0:2 Not tainted 6.8.0-rc4-next-20240216-00015-gc937d3c43ffe-dirty #219 [ 2030.379590] Hardware name: Qualcomm Technologies, Inc. Robotics RB3gen2 (DT) [ 2030.379592] Workqueue: events output_poll_execute [drm_kms_helper] [ 2030.379642] Call trace: [ 2030.379644] dump_backtrace+0xec/0x108 [ 2030.379654] show_stack+0x18/0x24 [ 2030.379659] dump_stack_lvl+0x40/0x84 [ 2030.379664] dump_stack+0x18/0x24 [ 2030.379668] panic+0x130/0x34c [ 2030.379673] nmi_panic+0x44/0x90 [ 2030.379679] arm64_serror_panic+0x68/0x74 [ 2030.379683] do_serror+0xc4/0xcc [ 2030.379686] el1h_64_error_handler+0x34/0x4c [ 2030.379692] el1h_64_error+0x64/0x68 [ 2030.379696] el1_interrupt+0x28/0xc0 [ 2030.379700] el1h_64_irq_handler+0x18/0x24 [ 2030.379706] el1h_64_irq+0x64/0x68 [ 2030.379710] _raw_spin_unlock_irq+0x20/0x48 [ 2030.379718] wait_for_common+0xb4/0x16c [ 2030.379722] wait_for_completion_timeout+0x14/0x20 [ 2030.379727] dp_ctrl_push_idle+0x34/0x8c [msm] [ 2030.379844] dp_bridge_atomic_disable+0x18/0x24 [msm] [ 2030.379959] drm_atomic_bridge_chain_disable+0x6c/0xb4 [drm] [ 2030.380150] drm_atomic_helper_commit_modeset_disables+0x174/0x57c [drm_kms_helper] [ 2030.380200] msm_atomic_commit_tail+0x1b4/0x474 [msm] [ 2030.380316] commit_tail+0xa4/0x158 [drm_kms_helper] [ 2030.380369] drm_atomic_helper_commit+0x24c/0x26c [drm_kms_helper] [ 2030.380418] drm_atomic_commit+0xa8/0xd4 [drm] [ 2030.380529] drm_client_modeset_commit_atomic+0x16c/0x244 [drm] [ 2030.380641] drm_client_modeset_commit_locked+0x50/0x168 [drm] [ 2030.380753] drm_client_modeset_commit+0x2c/0x54 [drm] [ 2030.380865] __drm_fb_helper_restore_fbdev_mode_unlocked+0x60/0xa4 [drm_kms_helper] [ 2030.380915] drm_fb_helper_hotplug_event+0xe0/0xf4 [drm_kms_helper] [ 2030.380965] msm_fbdev_client_hotplug+0x28/0xc8 [msm] [ 2030.381081] drm_client_dev_hotplug+0x94/0x118 [drm] [ 2030.381192] output_poll_execute+0x214/0x26c [drm_kms_helper] [ 2030.381241] process_scheduled_works+0x19c/0x2cc [ 2030.381249] worker_thread+0x290/0x3cc [ 2030.381255] kthread+0xfc/0x184 [ 2030.381260] ret_from_fork+0x10/0x20 The above could happen if the convoluted hotplug state machine breaks down so that the device is runtime suspended before dp_bridge_atomic_disable() is called. For some reason, possibly due to unrelated changes in timing, possibly after the hotplug revert, I am no longer able to reproduce the reset with 6.8-rc7 on the X13s. I am however able to get the hotplug state machine to leak runtime PM reference counts which prevents it from ever being suspended (e.g. by disconnecting slowly so that we get multiple connect and disconnect events). This can manifest itself as a hotplug event which is processed after disconnecting the display: [drm:dp_panel_read_sink_caps [msm]] *ERROR* read dpcd failed -110 I would not be surprised at all if there's a sequence of events that leads to an unbalanced put instead (and the stack trace and observations above make this appear likely). Were you able to determine what events lead to the premature disabling of the bus clocks on the RB3? Have you got any reason to believe that the revert of the hotplug notification patch may in anyway prevent that? Or is it just papering over the issue? > Hotplug in Xorg seems to work worse than it did with 6.7, which also had > some issues. Connecting a display once seems to work fine, but trying to > re-enable a reconnected display using xrandr sometimes does not work at > all, while with 6.7 it usually worked on the second xrandr execution. > > xrandr reports the reconnected display as disconnected: > Running 'xrandr --output DP-2 --auto' 2-3 times makes xrandr report the > display as connected, but the display is still blank (unlike with 6.7). As I mentioned elsewhere, the revert of e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify() to report HPD status changes") in rc7 does seem to help with the hotplug detect issues that I could reproduce (in VT console, X and wayland). The question is whether we should revert the whole runtime PM series so that the bus clock is left on, which should prevent any resets on disconnect. Without any analysis from you or reason to be believe the issue to have been resolved, I'm inclined to just go ahead and revert it. It clearly had not been tested enough before being merged and I'm quite frustrated with how this has been handled. Johan ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1 2024-03-08 12:43 ` Johan Hovold @ 2024-03-08 17:50 ` Abhinav Kumar 2024-03-09 16:30 ` Johan Hovold 0 siblings, 1 reply; 8+ messages in thread From: Abhinav Kumar @ 2024-03-08 17:50 UTC (permalink / raw) To: Johan Hovold, Rob Clark, Dmitry Baryshkov, Kuogee Hsieh Cc: Daniel Thompson, Sean Paul, Marijn Suijten, David Airlie, Daniel Vetter, Bjorn Andersson, quic_jesszhan, quic_sbillaka, dri-devel, freedreno, linux-arm-msm, regressions, linux-kernel Hi Johan On 3/8/2024 4:43 AM, Johan Hovold wrote: > Hi Abhinav, Rob, Dmitry and Kuogee, > > On Tue, Feb 27, 2024 at 02:33:48PM +0100, Johan Hovold wrote: > >> Since 6.8-rc1 I have seen (and received reports) of hard resets of the >> Lenovo ThinkPad X13s after connecting and disconnecting an external >> display. >> >> I have triggered this on a simple disconnect while in a VT console, but >> also when stopping Xorg after having repeatedly connected and >> disconnected an external display and tried to enable it using xrandr. >> >> In the former case, the last (custom debug) messages printed over an SSH >> session were once: >> >> [ 948.416358] usb 5-1: USB disconnect, device number 3 >> [ 948.443496] msm_dpu ae01000.display-controller: msm_fbdev_client_hotplug >> [ 948.443723] msm-dp-display ae98000.displayport-controller: dp_power_clk_enable - type = 1, enable = 0 >> [ 948.443872] msm-dp-display ae98000.displayport-controller: dp_ctrl_phy_exit >> [ 948.445117] msm-dp-display ae98000.displayport-controller: dp_ctrl_phy_exit - done >> >> and then the hypervisor resets the machine. > > Has there been any progress on tracking down the reset on disconnect > issue? I was expecting you to share your findings here so that we can > determine if the rest of the runtime PM series needs to be reverted or > not before 6.8 is released (possibly on Sunday). > > I really did not want to spend more on this driver than I already have > this cycle, but the lack of (visible) progress has again forced me to do > so. > > It's quite likely that the resets are indeed a regression caused by the > runtime PM series as the bus clocks were not disabled on disconnect > before that one was merged in 6.8-rc1. > For this last remaining reset with the stacktrace you have mentioned below, I do not think this was introduced due to PM runtime series. We have had this report earlier with the same stacktrace as well in earlier kernels which didnt have PM runtime support: https://gitlab.freedesktop.org/drm/msm/-/issues/17 We had fixed the issue reported by the user that time with below patch which was confirmed by the user as fixed: https://lore.kernel.org/all/1664408211-25314-1-git-send-email-quic_khsieh@quicinc.com/ But, it seems like there is another race condition in this code resulting in this issue again. I have posted my analysis with the patch here as a RFC y'day: https://patchwork.freedesktop.org/patch/581758/ I missed CCing you and Bjorn on the RFC but when I post it as a patch either today/tomm, will CC you both. Sorry, much of the discussion of this issue was happening on #freedreno last few days. > In a VT console, the device is now runtime suspended immediately on > disconnect, while in X, it currently remains active until X is killed, > which is consistent with what I reported above. > > We now also have Bjorn's call trace from a different Qualcomm platform > triggering an unclocked access on disconnect, something which would > trigger a reset by the hypervisor on sc8280xp platforms like the X13s: > > [ 2030.379417] SError Interrupt on CPU0, code 0x00000000be000000 -- SError > [ 2030.379425] CPU: 0 PID: 239 Comm: kworker/0:2 Not tainted 6.8.0-rc4-next-20240216-00015-gc937d3c43ffe-dirty #219 > [ 2030.379430] Hardware name: Qualcomm Technologies, Inc. Robotics RB3gen2 (DT) > [ 2030.379435] Workqueue: events output_poll_execute [drm_kms_helper] > [ 2030.379495] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 2030.379501] pc : el1_interrupt+0x28/0xc0 > [ 2030.379514] lr : el1h_64_irq_handler+0x18/0x24 > [ 2030.379520] sp : ffff80008129b700 > [ 2030.379523] x29: ffff80008129b700 x28: ffff7a0a031aa200 x27: ffff7a0af768c3c0 > [ 2030.379530] x26: 0000000000000000 x25: ffff7a0a024bb568 x24: ffff7a0a031aa200 > [ 2030.379537] x23: 0000000060400005 x22: ffffd2a4a10c871c x21: ffff80008129b8a0 > [ 2030.379544] x20: ffffd2a4a00002f0 x19: ffff80008129b750 x18: 0000000000000000 > [ 2030.379549] x17: 0000000000000000 x16: ffffd2a4a10c21d4 x15: 0000000000000000 > [ 2030.379555] x14: 0000000000000000 x13: 000000000001acf6 x12: 0000000000000000 > [ 2030.379560] x11: 0000000000000001 x10: ffff7a0af623f680 x9 : 0000000100000001 > [ 2030.379565] x8 : 00000000000000c0 x7 : 0000000000000000 x6 : 000000000000003f > [ 2030.379570] x5 : ffff7a0a003c6a70 x4 : 000000000000001f x3 : ffffd2a48d6722dc > [ 2030.379576] x2 : 0000000000000002 x1 : ffffd2a4a00002f0 x0 : ffff80008129b750 > [ 2030.379583] Kernel panic - not syncing: Asynchronous SError Interrupt > [ 2030.379586] CPU: 0 PID: 239 Comm: kworker/0:2 Not tainted 6.8.0-rc4-next-20240216-00015-gc937d3c43ffe-dirty #219 > [ 2030.379590] Hardware name: Qualcomm Technologies, Inc. Robotics RB3gen2 (DT) > [ 2030.379592] Workqueue: events output_poll_execute [drm_kms_helper] > [ 2030.379642] Call trace: > [ 2030.379644] dump_backtrace+0xec/0x108 > [ 2030.379654] show_stack+0x18/0x24 > [ 2030.379659] dump_stack_lvl+0x40/0x84 > [ 2030.379664] dump_stack+0x18/0x24 > [ 2030.379668] panic+0x130/0x34c > [ 2030.379673] nmi_panic+0x44/0x90 > [ 2030.379679] arm64_serror_panic+0x68/0x74 > [ 2030.379683] do_serror+0xc4/0xcc > [ 2030.379686] el1h_64_error_handler+0x34/0x4c > [ 2030.379692] el1h_64_error+0x64/0x68 > [ 2030.379696] el1_interrupt+0x28/0xc0 > [ 2030.379700] el1h_64_irq_handler+0x18/0x24 > [ 2030.379706] el1h_64_irq+0x64/0x68 > [ 2030.379710] _raw_spin_unlock_irq+0x20/0x48 > [ 2030.379718] wait_for_common+0xb4/0x16c > [ 2030.379722] wait_for_completion_timeout+0x14/0x20 > [ 2030.379727] dp_ctrl_push_idle+0x34/0x8c [msm] > [ 2030.379844] dp_bridge_atomic_disable+0x18/0x24 [msm] > [ 2030.379959] drm_atomic_bridge_chain_disable+0x6c/0xb4 [drm] > [ 2030.380150] drm_atomic_helper_commit_modeset_disables+0x174/0x57c [drm_kms_helper] > [ 2030.380200] msm_atomic_commit_tail+0x1b4/0x474 [msm] > [ 2030.380316] commit_tail+0xa4/0x158 [drm_kms_helper] > [ 2030.380369] drm_atomic_helper_commit+0x24c/0x26c [drm_kms_helper] > [ 2030.380418] drm_atomic_commit+0xa8/0xd4 [drm] > [ 2030.380529] drm_client_modeset_commit_atomic+0x16c/0x244 [drm] > [ 2030.380641] drm_client_modeset_commit_locked+0x50/0x168 [drm] > [ 2030.380753] drm_client_modeset_commit+0x2c/0x54 [drm] > [ 2030.380865] __drm_fb_helper_restore_fbdev_mode_unlocked+0x60/0xa4 [drm_kms_helper] > [ 2030.380915] drm_fb_helper_hotplug_event+0xe0/0xf4 [drm_kms_helper] > [ 2030.380965] msm_fbdev_client_hotplug+0x28/0xc8 [msm] > [ 2030.381081] drm_client_dev_hotplug+0x94/0x118 [drm] > [ 2030.381192] output_poll_execute+0x214/0x26c [drm_kms_helper] > [ 2030.381241] process_scheduled_works+0x19c/0x2cc > [ 2030.381249] worker_thread+0x290/0x3cc > [ 2030.381255] kthread+0xfc/0x184 > [ 2030.381260] ret_from_fork+0x10/0x20 > > The above could happen if the convoluted hotplug state machine breaks > down so that the device is runtime suspended before > dp_bridge_atomic_disable() is called. > Yes, state machine got broken and I have explained how in the commit text of the RFC. But its not necessarily due to PM runtime but a sequence of events happening from userspace exposing this breakage. > For some reason, possibly due to unrelated changes in timing, possibly > after the hotplug revert, I am no longer able to reproduce the reset > with 6.8-rc7 on the X13s. > I do not know how the hotplug revert fixed this stack because I think this can still happen. > I am however able to get the hotplug state machine to leak > runtime PM reference counts which prevents it from ever being suspended > (e.g. by disconnecting slowly so that we get multiple connect and > disconnect events). This can manifest itself as a hotplug event which is > processed after disconnecting the display: > > [drm:dp_panel_read_sink_caps [msm]] *ERROR* read dpcd failed -110 > hmm ... this is another new report. I was not aware of this till this instance. We would like to debug this issue too as we have also not seen this one before. But, I am suspicious about how or why a refcount leak leads to a DPCD read failure. > I would not be surprised at all if there's a sequence of events that > leads to an unbalanced put instead (and the stack trace and observations > above make this appear likely). > Stacktrace has a different analysis. > Were you able to determine what events lead to the premature disabling > of the bus clocks on the RB3? > Yes, please check my RFC . I am waiting for Kuogee to review it too and I shall post it formally as a patch. > Have you got any reason to believe that the revert of the hotplug > notification patch may in anyway prevent that? Or is it just papering > over the issue? > No, I do not think reverting the hotplug notification patch can fix that reset. >> Hotplug in Xorg seems to work worse than it did with 6.7, which also had >> some issues. Connecting a display once seems to work fine, but trying to >> re-enable a reconnected display using xrandr sometimes does not work at >> all, while with 6.7 it usually worked on the second xrandr execution. >> >> xrandr reports the reconnected display as disconnected: > >> Running 'xrandr --output DP-2 --auto' 2-3 times makes xrandr report the >> display as connected, but the display is still blank (unlike with 6.7). > > As I mentioned elsewhere, the revert of e467e0bde881 ("drm/msm/dp: use > drm_bridge_hpd_notify() to report HPD status changes") in rc7 does seem > to help with the hotplug detect issues that I could reproduce (in VT > console, X and wayland). > Yes revert of e467e0bde881 ("drm/msm/dp: use > drm_bridge_hpd_notify() to report HPD status changes") affecting hotplug status and associated connection issues makes sense. But I think the reset with the above mentioned stacktrace can still happen. > The question is whether we should revert the whole runtime PM series so > that the bus clock is left on, which should prevent any resets on > disconnect. > At this point, I would say if the RFC resolves the reset for you and Bjorn with the above stack, that should be the way to go and not full revert. Unless you are facing other issues other than this last reset crash. > Without any analysis from you or reason to be believe the issue to have > been resolved, I'm inclined to just go ahead and revert it. It clearly > had not been tested enough before being merged and I'm quite frustrated > with how this has been handled. > I must admit that the number of issues reported on sc8280xp was surprising for us as well because the series was tested on both sc7180/sc7280 by us and other users but not on sc8280xp as we did not have access to this device that time. sc7280 is the only other device we supported upstream which supported eDP natively before Qualcomm X1E80100 landed this cycle. Surprisingly enough, perhaps because it uses a chrome defconfig or chrome userspace, no issues were seen that time or even now for that matter. To avoid the breakages on sc8280xp moving forward as it has more visibility, our team here has decided to setup sc8280xp devices to test out all msm display related changes especially eDP/DP as it shares the eDP/DP driver. We can also CC you on these, in case you want to test it out as well and give your Tested-by. > Johan ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1 2024-03-08 17:50 ` Abhinav Kumar @ 2024-03-09 16:30 ` Johan Hovold 2024-03-11 13:43 ` Johan Hovold 0 siblings, 1 reply; 8+ messages in thread From: Johan Hovold @ 2024-03-09 16:30 UTC (permalink / raw) To: Abhinav Kumar Cc: Rob Clark, Dmitry Baryshkov, Kuogee Hsieh, Daniel Thompson, Sean Paul, Marijn Suijten, David Airlie, Daniel Vetter, Bjorn Andersson, quic_jesszhan, quic_sbillaka, dri-devel, freedreno, linux-arm-msm, regressions, linux-kernel On Fri, Mar 08, 2024 at 09:50:17AM -0800, Abhinav Kumar wrote: > On 3/8/2024 4:43 AM, Johan Hovold wrote: > For this last remaining reset with the stacktrace you have mentioned > below, I do not think this was introduced due to PM runtime series. We > have had this report earlier with the same stacktrace as well in earlier > kernels which didnt have PM runtime support: > But, it seems like there is another race condition in this code > resulting in this issue again. > > I have posted my analysis with the patch here as a RFC y'day: > > https://patchwork.freedesktop.org/patch/581758/ > > I missed CCing you and Bjorn on the RFC but when I post it as a patch > either today/tomm, will CC you both. Ok, thanks. I'll take a closer look at that next week. It's not clear to me what that race looks like after reading the commit message. It would be good to have some more details in there (e.g. the sequence of events that breaks the state machine) and some way to reproduce the issue reliably. > > We now also have Bjorn's call trace from a different Qualcomm platform > > triggering an unclocked access on disconnect, something which would > > trigger a reset by the hypervisor on sc8280xp platforms like the X13s: > > [ 2030.379583] Kernel panic - not syncing: Asynchronous SError Interrupt > > [ 2030.379586] CPU: 0 PID: 239 Comm: kworker/0:2 Not tainted 6.8.0-rc4-next-20240216-00015-gc937d3c43ffe-dirty #219 > > [ 2030.379590] Hardware name: Qualcomm Technologies, Inc. Robotics RB3gen2 (DT) > > [ 2030.379592] Workqueue: events output_poll_execute [drm_kms_helper] > > [ 2030.379642] Call trace: > > [ 2030.379722] wait_for_completion_timeout+0x14/0x20 > > [ 2030.379727] dp_ctrl_push_idle+0x34/0x8c [msm] > > [ 2030.379844] dp_bridge_atomic_disable+0x18/0x24 [msm] > > [ 2030.379959] drm_atomic_bridge_chain_disable+0x6c/0xb4 [drm] > > [ 2030.380150] drm_atomic_helper_commit_modeset_disables+0x174/0x57c [drm_kms_helper] > > [ 2030.380200] msm_atomic_commit_tail+0x1b4/0x474 [msm] > > [ 2030.380316] commit_tail+0xa4/0x158 [drm_kms_helper] > > [ 2030.380369] drm_atomic_helper_commit+0x24c/0x26c [drm_kms_helper] > > [ 2030.380418] drm_atomic_commit+0xa8/0xd4 [drm] > > [ 2030.380529] drm_client_modeset_commit_atomic+0x16c/0x244 [drm] > > [ 2030.380641] drm_client_modeset_commit_locked+0x50/0x168 [drm] > > [ 2030.380753] drm_client_modeset_commit+0x2c/0x54 [drm] > > [ 2030.380865] __drm_fb_helper_restore_fbdev_mode_unlocked+0x60/0xa4 [drm_kms_helper] > > [ 2030.380915] drm_fb_helper_hotplug_event+0xe0/0xf4 [drm_kms_helper] > > [ 2030.380965] msm_fbdev_client_hotplug+0x28/0xc8 [msm] > > [ 2030.381081] drm_client_dev_hotplug+0x94/0x118 [drm] > > [ 2030.381192] output_poll_execute+0x214/0x26c [drm_kms_helper] > > [ 2030.381241] process_scheduled_works+0x19c/0x2cc > > [ 2030.381249] worker_thread+0x290/0x3cc > > [ 2030.381255] kthread+0xfc/0x184 > > [ 2030.381260] ret_from_fork+0x10/0x20 > > > > The above could happen if the convoluted hotplug state machine breaks > > down so that the device is runtime suspended before > > dp_bridge_atomic_disable() is called. > Yes, state machine got broken and I have explained how in the commit > text of the RFC. But its not necessarily due to PM runtime but a > sequence of events happening from userspace exposing this breakage. After looking at this some more today, I agree with you. The observations I've reported are consistent with what would happen if the link clock is disabled when dp_bridge_atomic_disable() is called. That clock is disabled in dp_bridge_atomic_post_disable() before runtime suspending, but perhaps there are some further paths that can end up disabling it. > > For some reason, possibly due to unrelated changes in timing, possibly > > after the hotplug revert, I am no longer able to reproduce the reset > > with 6.8-rc7 on the X13s. > I do not know how the hotplug revert fixed this stack because I think > this can still happen. Thanks for confirming. > > I am however able to get the hotplug state machine to leak > > runtime PM reference counts which prevents it from ever being suspended > > (e.g. by disconnecting slowly so that we get multiple connect and > > disconnect events). This can manifest itself as a hotplug event which is > > processed after disconnecting the display: > > > > [drm:dp_panel_read_sink_caps [msm]] *ERROR* read dpcd failed -110 > hmm ... this is another new report. I was not aware of this till this > instance. We would like to debug this issue too as we have also not seen > this one before. > > But, I am suspicious about how or why a refcount leak leads to a DPCD > read failure. It's the other way round: the hotplug event is processed after disconnect and that leaks a runtime PM usage count: < usb-c disconnect > [ 158.167845] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 2 Here's the disconnect event. [ 158.172234] msm-dp-display ae98000.displayport-controller: dp_hpd_unplug_handle [ 158.175164] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 1 Here's an unexpected connect event which is added to the event queue if the link state is updated by a racing thread (after this printk). [ 158.175696] msm-dp-display ae98000.displayport-controller: __pm_runtime_idle - usage count = 1 [ 158.175704] msm-dp-display ae98000.displayport-controller: dp_display_send_hpd_notification - hpd = 0 Such as here. [ 158.180972] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 0, status = 2 [ 158.184146] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 0, status = 2 [ 158.186565] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_disable [ 158.235903] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_post_disable [ 158.237759] msm-dp-display ae98000.displayport-controller: dp_power_clk_enable - type = 2, enable = 0 [ 158.238484] msm-dp-display ae98000.displayport-controller: dp_power_clk_enable - type = 1, enable = 0 [ 158.240350] msm-dp-display ae98000.displayport-controller: __pm_runtime_idle - usage count = 0 [ 158.240966] msm-dp-display ae98000.displayport-controller: dp_pm_runtime_suspend [ 158.242674] msm-dp-display ae98000.displayport-controller: dp_power_clk_enable - type = 0, enable = 0 [ 158.259286] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 0, status = 2 [ 158.275869] msm-dp-display ae98000.displayport-controller: dp_hpd_plug_handle Here the spurious connect event is processed but the external display is already disconnected. [ 158.277070] msm-dp-display ae98000.displayport-controller: dp_hpd_plug_handle - rpm get [ 158.278234] msm-dp-display ae98000.displayport-controller: __pm_runtime_resume - usage count = 1 [ 158.279322] msm-dp-display ae98000.displayport-controller: dp_pm_runtime_resume [ 158.280386] msm-dp-display ae98000.displayport-controller: dp_power_clk_enable - type = 0, enable = 1 [ 158.283766] msm-dp-display ae98000.displayport-controller: __pm_runtime_resume - usage count = 2 [ 158.340829] msm-dp-display ae98000.displayport-controller: __pm_runtime_idle - usage count = 1 ... [ 160.101867] msm-dp-display ae98000.displayport-controller: __pm_runtime_resume - usage count = 2 [ 160.159553] msm-dp-display ae98000.displayport-controller: __pm_runtime_idle - usage count = 1 [ 160.160643] [drm:dp_panel_read_sink_caps [msm]] *ERROR* read dpcd failed -110 Which leads to the dpcd read failure and leaves the runtime PM usage count incremented (the next hotplug event bumps the count to 2). I've also hit one other scenario where a runtime PM reference count is leaked (without any dpcd errors), but the above one is easily reproduced by pulling out the plug slowly (the printks may possibly help make this more reproducable by altering the timing slightly). > At this point, I would say if the RFC resolves the reset for you and > Bjorn with the above stack, that should be the way to go and not full > revert. > > Unless you are facing other issues other than this last reset crash. We only have Bjorn's stack trace as the hypervisor just resets the machine when I hit this on sc8280xp, but I think it's reasonable to assume that we are hitting the same or at least a related issue. Also note that I've hit the resets in two different ways; disconnect in VT console and when stopping X some time after a disconnect (Bjorn's stack trace is from the former case I think). Both resets occur around the time where dp_bridge_atomic_disable() is normally called, but I'm still missing an explanation for how the link clock got disabled prematurely (if that's what happened). > To avoid the breakages on sc8280xp moving forward as it has more > visibility, our team here has decided to setup sc8280xp devices to test > out all msm display related changes especially eDP/DP as it shares the > eDP/DP driver. We can also CC you on these, in case you want to test it > out as well and give your Tested-by. That's good to hear. I'm sure your testing will be sufficient to catch any obvious regressions. Once it hits mainline we have a larger user base of people running the X13s which will provide further testing, but it's not very efficient to rely on third parties with these kind of issues that just trigger resets on this platform. It gives us an indication that something is broken, but someone still needs to spend time to try to narrow it down. Johan ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1 2024-03-09 16:30 ` Johan Hovold @ 2024-03-11 13:43 ` Johan Hovold 2024-03-11 15:16 ` Johan Hovold 2024-03-11 16:51 ` Abhinav Kumar 0 siblings, 2 replies; 8+ messages in thread From: Johan Hovold @ 2024-03-11 13:43 UTC (permalink / raw) To: Abhinav Kumar Cc: Rob Clark, Dmitry Baryshkov, Kuogee Hsieh, Daniel Thompson, Sean Paul, Marijn Suijten, David Airlie, Daniel Vetter, Bjorn Andersson, quic_jesszhan, quic_sbillaka, dri-devel, freedreno, linux-arm-msm, regressions, linux-kernel On Sat, Mar 09, 2024 at 05:30:17PM +0100, Johan Hovold wrote: > On Fri, Mar 08, 2024 at 09:50:17AM -0800, Abhinav Kumar wrote: > > On 3/8/2024 4:43 AM, Johan Hovold wrote: > > > For this last remaining reset with the stacktrace you have mentioned > > below, I do not think this was introduced due to PM runtime series. We > > have had this report earlier with the same stacktrace as well in earlier > > kernels which didnt have PM runtime support: > > > But, it seems like there is another race condition in this code > > resulting in this issue again. > > > > I have posted my analysis with the patch here as a RFC y'day: > > > > https://patchwork.freedesktop.org/patch/581758/ > > > > I missed CCing you and Bjorn on the RFC but when I post it as a patch > > either today/tomm, will CC you both. > > Ok, thanks. I'll take a closer look at that next week. It's not clear to > me what that race looks like after reading the commit message. It would > be good to have some more details in there (e.g. the sequence of events > that breaks the state machine) and some way to reproduce the issue > reliably. I was able to reproduce the reset with some of my debug printks in place after reapplying the reverted hpd notify change so I have an explanation for (one of) the ways we can up in this state now. This does not match description of the problem in the fix linked to above and I don't think that patch solves the underlying issue even if it may make the race window somewhat smaller. More details below. > > > We now also have Bjorn's call trace from a different Qualcomm platform > > > triggering an unclocked access on disconnect, something which would > > > trigger a reset by the hypervisor on sc8280xp platforms like the X13s: > > > > [ 2030.379583] Kernel panic - not syncing: Asynchronous SError Interrupt > > > [ 2030.379586] CPU: 0 PID: 239 Comm: kworker/0:2 Not tainted 6.8.0-rc4-next-20240216-00015-gc937d3c43ffe-dirty #219 > > > [ 2030.379590] Hardware name: Qualcomm Technologies, Inc. Robotics RB3gen2 (DT) > > > [ 2030.379592] Workqueue: events output_poll_execute [drm_kms_helper] > > > [ 2030.379642] Call trace: > > > > [ 2030.379722] wait_for_completion_timeout+0x14/0x20 > > > [ 2030.379727] dp_ctrl_push_idle+0x34/0x8c [msm] > > > [ 2030.379844] dp_bridge_atomic_disable+0x18/0x24 [msm] > > > [ 2030.379959] drm_atomic_bridge_chain_disable+0x6c/0xb4 [drm] > > > [ 2030.380150] drm_atomic_helper_commit_modeset_disables+0x174/0x57c [drm_kms_helper] > > > [ 2030.380200] msm_atomic_commit_tail+0x1b4/0x474 [msm] > > > [ 2030.380316] commit_tail+0xa4/0x158 [drm_kms_helper] > > > [ 2030.380369] drm_atomic_helper_commit+0x24c/0x26c [drm_kms_helper] > > > [ 2030.380418] drm_atomic_commit+0xa8/0xd4 [drm] > > > [ 2030.380529] drm_client_modeset_commit_atomic+0x16c/0x244 [drm] > > > [ 2030.380641] drm_client_modeset_commit_locked+0x50/0x168 [drm] > > > [ 2030.380753] drm_client_modeset_commit+0x2c/0x54 [drm] > > > [ 2030.380865] __drm_fb_helper_restore_fbdev_mode_unlocked+0x60/0xa4 [drm_kms_helper] > > > [ 2030.380915] drm_fb_helper_hotplug_event+0xe0/0xf4 [drm_kms_helper] > > > [ 2030.380965] msm_fbdev_client_hotplug+0x28/0xc8 [msm] > > > [ 2030.381081] drm_client_dev_hotplug+0x94/0x118 [drm] > > > [ 2030.381192] output_poll_execute+0x214/0x26c [drm_kms_helper] > > > [ 2030.381241] process_scheduled_works+0x19c/0x2cc > > > [ 2030.381249] worker_thread+0x290/0x3cc > > > [ 2030.381255] kthread+0xfc/0x184 > > > [ 2030.381260] ret_from_fork+0x10/0x20 > > > > > > The above could happen if the convoluted hotplug state machine breaks > > > down so that the device is runtime suspended before > > > dp_bridge_atomic_disable() is called. > > > Yes, state machine got broken and I have explained how in the commit > > text of the RFC. But its not necessarily due to PM runtime but a > > sequence of events happening from userspace exposing this breakage. > > After looking at this some more today, I agree with you. The > observations I've reported are consistent with what would happen if the > link clock is disabled when dp_bridge_atomic_disable() is called. > > That clock is disabled in dp_bridge_atomic_post_disable() before runtime > suspending, but perhaps there are some further paths that can end up > disabling it. Turns out there are paths like that and we hit those more often before reverting e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify(). Specifically, when a previous connect attempt did not enable the bridge fully so that it is still in the ST_MAINLINK_READY when we receive a disconnect event, dp_hpd_unplug_handle() will turn of the link clock. [ 204.527625] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 2 [ 204.531553] msm-dp-display ae98000.displayport-controller: dp_hpd_unplug_handle [ 204.533261] msm-dp-display ae98000.displayport-controller: dp_ctrl_off_link A racing connect event, such as the one I described earlier, can then try to enable the bridge again but dp_bridge_atomic_enable() just bails out early (and leaks a rpm reference) because we're now in ST_DISCONNECTED: [ 204.535773] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 1 [ 204.536187] [CONNECTOR:35:DP-2] status updated from disconnected to connected [ 204.536905] msm-dp-display ae98000.displayport-controller: dp_display_notify_disconnect - would clear link ready (1), state = 0 [ 204.537821] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_check - link_ready = 1 [ 204.538063] msm-dp-display ae98000.displayport-controller: dp_display_send_hpd_notification - hpd = 0, link_ready = 1 [ 204.542778] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_enable [ 204.586547] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_enable - state = 0 (rpm leak?) Clearing link_ready already in dp_display_notify_disconnect() would make the race window slightly smaller, but it would essentially just paper over the bug as the events are still not serialised. Notably, there is no user space interaction involved here and it's the spurious connect event that triggers the bridge enable. When the fbdev hotplug code later disables the never-enabled bridge, things go boom: [ 204.649072] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 0, status = 2 [ 204.650378] [CONNECTOR:35:DP-2] status updated from connected to disconnected [ 204.651111] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_disable as the link clock has already been disabled when accessing the link registers. The stack trace for the bridge enable above is: [ 204.553922] dp_bridge_atomic_enable+0x40/0x2f0 [msm] [ 204.555241] drm_atomic_bridge_chain_enable+0x54/0xc8 [drm] [ 204.556557] drm_atomic_helper_commit_modeset_enables+0x194/0x26c [drm_kms_helper] [ 204.557853] msm_atomic_commit_tail+0x204/0x804 [msm] [ 204.559173] commit_tail+0xa4/0x18c [drm_kms_helper] [ 204.560450] drm_atomic_helper_commit+0x19c/0x1b0 [drm_kms_helper] [ 204.561743] drm_atomic_commit+0xa4/0xdc [drm] [ 204.563065] drm_client_modeset_commit_atomic+0x22c/0x298 [drm] [ 204.564402] drm_client_modeset_commit_locked+0x60/0x1c0 [drm] [ 204.565733] drm_client_modeset_commit+0x30/0x58 [drm] [ 204.567055] __drm_fb_helper_restore_fbdev_mode_unlocked+0xbc/0xfc [drm_kms_helper] [ 204.568381] drm_fb_helper_hotplug_event.part.0+0xd4/0x110 [drm_kms_helper] [ 204.569708] drm_fb_helper_hotplug_event+0x38/0x44 [drm_kms_helper] [ 204.571032] msm_fbdev_client_hotplug+0x28/0xd4 [msm] [ 204.572395] drm_client_dev_hotplug+0xcc/0x130 [drm] [ 204.573755] drm_kms_helper_connector_hotplug_event+0x34/0x44 [drm_kms_helper] [ 204.575114] drm_bridge_connector_hpd_cb+0x90/0xa4 [drm_kms_helper] [ 204.576465] drm_bridge_hpd_notify+0x40/0x5c [drm] [ 204.577842] drm_aux_hpd_bridge_notify+0x18/0x28 [aux_hpd_bridge] [ 204.579184] pmic_glink_altmode_worker+0xc0/0x23c [pmic_glink_altmode] > > > For some reason, possibly due to unrelated changes in timing, possibly > > > after the hotplug revert, I am no longer able to reproduce the reset > > > with 6.8-rc7 on the X13s. > > > I do not know how the hotplug revert fixed this stack because I think > > this can still happen. So, while it may still be theoretically possible to hit the resets after the revert, the HPD notify revert effectively "fixed" the regression in 6.8-rc1 by removing the preconditions that now made us hit it (i.e. the half-initialised bridge). It seems the hotplug state machine needs to be reworked completely, but at least we're roughly back where we were with 6.7 (including that the bus clocks will never be turned of because of the rpm leaks on disconnect). Johan ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1 2024-03-11 13:43 ` Johan Hovold @ 2024-03-11 15:16 ` Johan Hovold 2024-03-11 16:51 ` Abhinav Kumar 1 sibling, 0 replies; 8+ messages in thread From: Johan Hovold @ 2024-03-11 15:16 UTC (permalink / raw) To: regressions Cc: Abhinav Kumar, Rob Clark, Dmitry Baryshkov, Kuogee Hsieh, Daniel Thompson, Bjorn Andersson, dri-devel, freedreno, linux-arm-msm, linux-kernel On Mon, Mar 11, 2024 at 02:43:24PM +0100, Johan Hovold wrote: > So, while it may still be theoretically possible to hit the resets after > the revert, the HPD notify revert effectively "fixed" the regression in > 6.8-rc1 by removing the preconditions that now made us hit it (i.e. the > half-initialised bridge). > > It seems the hotplug state machine needs to be reworked completely, but > at least we're roughly back where we were with 6.7 (including that the > bus clocks will never be turned of because of the rpm leaks on > disconnect). #regzbot introduced: e467e0bde881 #regzbot fix: 664bad6af3cb ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1 2024-03-11 13:43 ` Johan Hovold 2024-03-11 15:16 ` Johan Hovold @ 2024-03-11 16:51 ` Abhinav Kumar 2024-03-12 8:31 ` Johan Hovold 1 sibling, 1 reply; 8+ messages in thread From: Abhinav Kumar @ 2024-03-11 16:51 UTC (permalink / raw) To: Johan Hovold Cc: Rob Clark, Dmitry Baryshkov, Kuogee Hsieh, Daniel Thompson, Sean Paul, Marijn Suijten, David Airlie, Daniel Vetter, Bjorn Andersson, quic_jesszhan, quic_sbillaka, dri-devel, freedreno, linux-arm-msm, regressions, linux-kernel On 3/11/2024 6:43 AM, Johan Hovold wrote: > On Sat, Mar 09, 2024 at 05:30:17PM +0100, Johan Hovold wrote: >> On Fri, Mar 08, 2024 at 09:50:17AM -0800, Abhinav Kumar wrote: >>> On 3/8/2024 4:43 AM, Johan Hovold wrote: >> >>> For this last remaining reset with the stacktrace you have mentioned >>> below, I do not think this was introduced due to PM runtime series. We >>> have had this report earlier with the same stacktrace as well in earlier >>> kernels which didnt have PM runtime support: >> >>> But, it seems like there is another race condition in this code >>> resulting in this issue again. >>> >>> I have posted my analysis with the patch here as a RFC y'day: >>> >>> https://patchwork.freedesktop.org/patch/581758/ >>> >>> I missed CCing you and Bjorn on the RFC but when I post it as a patch >>> either today/tomm, will CC you both. >> >> Ok, thanks. I'll take a closer look at that next week. It's not clear to >> me what that race looks like after reading the commit message. It would >> be good to have some more details in there (e.g. the sequence of events >> that breaks the state machine) and some way to reproduce the issue >> reliably. > > I was able to reproduce the reset with some of my debug printks in place > after reapplying the reverted hpd notify change so I have an explanation > for (one of) the ways we can up in this state now. > > This does not match description of the problem in the fix linked to > above and I don't think that patch solves the underlying issue even if > it may make the race window somewhat smaller. More details below. > Its the same condition you described below that enable does not go through and we bail out as we are in ST_DISCONNECTED. >>>> We now also have Bjorn's call trace from a different Qualcomm platform >>>> triggering an unclocked access on disconnect, something which would >>>> trigger a reset by the hypervisor on sc8280xp platforms like the X13s: >> >>>> [ 2030.379583] Kernel panic - not syncing: Asynchronous SError Interrupt >>>> [ 2030.379586] CPU: 0 PID: 239 Comm: kworker/0:2 Not tainted 6.8.0-rc4-next-20240216-00015-gc937d3c43ffe-dirty #219 >>>> [ 2030.379590] Hardware name: Qualcomm Technologies, Inc. Robotics RB3gen2 (DT) >>>> [ 2030.379592] Workqueue: events output_poll_execute [drm_kms_helper] >>>> [ 2030.379642] Call trace: >> >>>> [ 2030.379722] wait_for_completion_timeout+0x14/0x20 >>>> [ 2030.379727] dp_ctrl_push_idle+0x34/0x8c [msm] >>>> [ 2030.379844] dp_bridge_atomic_disable+0x18/0x24 [msm] >>>> [ 2030.379959] drm_atomic_bridge_chain_disable+0x6c/0xb4 [drm] >>>> [ 2030.380150] drm_atomic_helper_commit_modeset_disables+0x174/0x57c [drm_kms_helper] >>>> [ 2030.380200] msm_atomic_commit_tail+0x1b4/0x474 [msm] >>>> [ 2030.380316] commit_tail+0xa4/0x158 [drm_kms_helper] >>>> [ 2030.380369] drm_atomic_helper_commit+0x24c/0x26c [drm_kms_helper] >>>> [ 2030.380418] drm_atomic_commit+0xa8/0xd4 [drm] >>>> [ 2030.380529] drm_client_modeset_commit_atomic+0x16c/0x244 [drm] >>>> [ 2030.380641] drm_client_modeset_commit_locked+0x50/0x168 [drm] >>>> [ 2030.380753] drm_client_modeset_commit+0x2c/0x54 [drm] >>>> [ 2030.380865] __drm_fb_helper_restore_fbdev_mode_unlocked+0x60/0xa4 [drm_kms_helper] >>>> [ 2030.380915] drm_fb_helper_hotplug_event+0xe0/0xf4 [drm_kms_helper] >>>> [ 2030.380965] msm_fbdev_client_hotplug+0x28/0xc8 [msm] >>>> [ 2030.381081] drm_client_dev_hotplug+0x94/0x118 [drm] >>>> [ 2030.381192] output_poll_execute+0x214/0x26c [drm_kms_helper] >>>> [ 2030.381241] process_scheduled_works+0x19c/0x2cc >>>> [ 2030.381249] worker_thread+0x290/0x3cc >>>> [ 2030.381255] kthread+0xfc/0x184 >>>> [ 2030.381260] ret_from_fork+0x10/0x20 >>>> >>>> The above could happen if the convoluted hotplug state machine breaks >>>> down so that the device is runtime suspended before >>>> dp_bridge_atomic_disable() is called. >> >>> Yes, state machine got broken and I have explained how in the commit >>> text of the RFC. But its not necessarily due to PM runtime but a >>> sequence of events happening from userspace exposing this breakage. >> >> After looking at this some more today, I agree with you. The >> observations I've reported are consistent with what would happen if the >> link clock is disabled when dp_bridge_atomic_disable() is called. >> >> That clock is disabled in dp_bridge_atomic_post_disable() before runtime >> suspending, but perhaps there are some further paths that can end up >> disabling it. > > Turns out there are paths like that and we hit those more often before > reverting e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify(). > > Specifically, when a previous connect attempt did not enable the bridge > fully so that it is still in the ST_MAINLINK_READY when we receive a > disconnect event, dp_hpd_unplug_handle() will turn of the link clock. > > [ 204.527625] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 2 > [ 204.531553] msm-dp-display ae98000.displayport-controller: dp_hpd_unplug_handle > [ 204.533261] msm-dp-display ae98000.displayport-controller: dp_ctrl_off_link > > A racing connect event, such as the one I described earlier, can then > try to enable the bridge again but dp_bridge_atomic_enable() just bails > out early (and leaks a rpm reference) because we're now in > ST_DISCONNECTED: > > [ 204.535773] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 1 > [ 204.536187] [CONNECTOR:35:DP-2] status updated from disconnected to connected > [ 204.536905] msm-dp-display ae98000.displayport-controller: dp_display_notify_disconnect - would clear link ready (1), state = 0 > [ 204.537821] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_check - link_ready = 1 > [ 204.538063] msm-dp-display ae98000.displayport-controller: dp_display_send_hpd_notification - hpd = 0, link_ready = 1 > [ 204.542778] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_enable > [ 204.586547] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_enable - state = 0 (rpm leak?) > > Clearing link_ready already in dp_display_notify_disconnect() would make > the race window slightly smaller, but it would essentially just paper > over the bug as the events are still not serialised. Notably, there is > no user space interaction involved here and it's the spurious connect > event that triggers the bridge enable. > Yes, it only narrows down the race condition window. The issue can still happen if the commit / modeset was issued before we marked link_ready as false. And yes, I was only targetting a short term fix till we rework the HPD. That will happen only incrementally as its a delicate piece of code. > When the fbdev hotplug code later disables the never-enabled bridge, > things go boom: > > [ 204.649072] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 0, status = 2 > [ 204.650378] [CONNECTOR:35:DP-2] status updated from connected to disconnected > [ 204.651111] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_disable > > as the link clock has already been disabled when accessing the link > registers. > > The stack trace for the bridge enable above is: > > [ 204.553922] dp_bridge_atomic_enable+0x40/0x2f0 [msm] > [ 204.555241] drm_atomic_bridge_chain_enable+0x54/0xc8 [drm] > [ 204.556557] drm_atomic_helper_commit_modeset_enables+0x194/0x26c [drm_kms_helper] > [ 204.557853] msm_atomic_commit_tail+0x204/0x804 [msm] > [ 204.559173] commit_tail+0xa4/0x18c [drm_kms_helper] > [ 204.560450] drm_atomic_helper_commit+0x19c/0x1b0 [drm_kms_helper] > [ 204.561743] drm_atomic_commit+0xa4/0xdc [drm] > [ 204.563065] drm_client_modeset_commit_atomic+0x22c/0x298 [drm] > [ 204.564402] drm_client_modeset_commit_locked+0x60/0x1c0 [drm] > [ 204.565733] drm_client_modeset_commit+0x30/0x58 [drm] > [ 204.567055] __drm_fb_helper_restore_fbdev_mode_unlocked+0xbc/0xfc [drm_kms_helper] > [ 204.568381] drm_fb_helper_hotplug_event.part.0+0xd4/0x110 [drm_kms_helper] > [ 204.569708] drm_fb_helper_hotplug_event+0x38/0x44 [drm_kms_helper] > [ 204.571032] msm_fbdev_client_hotplug+0x28/0xd4 [msm] > [ 204.572395] drm_client_dev_hotplug+0xcc/0x130 [drm] > [ 204.573755] drm_kms_helper_connector_hotplug_event+0x34/0x44 [drm_kms_helper] > [ 204.575114] drm_bridge_connector_hpd_cb+0x90/0xa4 [drm_kms_helper] > [ 204.576465] drm_bridge_hpd_notify+0x40/0x5c [drm] > [ 204.577842] drm_aux_hpd_bridge_notify+0x18/0x28 [aux_hpd_bridge] > [ 204.579184] pmic_glink_altmode_worker+0xc0/0x23c [pmic_glink_altmode] > >>>> For some reason, possibly due to unrelated changes in timing, possibly >>>> after the hotplug revert, I am no longer able to reproduce the reset >>>> with 6.8-rc7 on the X13s. >> >>> I do not know how the hotplug revert fixed this stack because I think >>> this can still happen. > > So, while it may still be theoretically possible to hit the resets after > the revert, the HPD notify revert effectively "fixed" the regression in > 6.8-rc1 by removing the preconditions that now made us hit it (i.e. the > half-initialised bridge). > Not entirely. In the bug which was reported before the patch e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify() got landed, its a classic example of how this issue can happen with userspace involvement and not just fbdev client which was your case. > It seems the hotplug state machine needs to be reworked completely, but > at least we're roughly back where we were with 6.7 (including that the > bus clocks will never be turned of because of the rpm leaks on > disconnect). > Yes, we already landed that revert but I am also planning to land the patch I had posted till we rework HPD. > Johan ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1 2024-03-11 16:51 ` Abhinav Kumar @ 2024-03-12 8:31 ` Johan Hovold 0 siblings, 0 replies; 8+ messages in thread From: Johan Hovold @ 2024-03-12 8:31 UTC (permalink / raw) To: Abhinav Kumar Cc: Rob Clark, Dmitry Baryshkov, Kuogee Hsieh, Daniel Thompson, Sean Paul, Marijn Suijten, David Airlie, Daniel Vetter, Bjorn Andersson, quic_jesszhan, quic_sbillaka, dri-devel, freedreno, linux-arm-msm, regressions, linux-kernel On Mon, Mar 11, 2024 at 09:51:29AM -0700, Abhinav Kumar wrote: > On 3/11/2024 6:43 AM, Johan Hovold wrote: > > On Sat, Mar 09, 2024 at 05:30:17PM +0100, Johan Hovold wrote: > >> On Fri, Mar 08, 2024 at 09:50:17AM -0800, Abhinav Kumar wrote: > >>> I have posted my analysis with the patch here as a RFC y'day: > >>> > >>> https://patchwork.freedesktop.org/patch/581758/ > > I was able to reproduce the reset with some of my debug printks in place > > after reapplying the reverted hpd notify change so I have an explanation > > for (one of) the ways we can up in this state now. > > > > This does not match description of the problem in the fix linked to > > above and I don't think that patch solves the underlying issue even if > > it may make the race window somewhat smaller. More details below. > Its the same condition you described below that enable does not go > through and we bail out as we are in ST_DISCONNECTED. It's closely related but clearly not the same as user space is not involved in the reset I see. > > Turns out there are paths like that and we hit those more often before > > reverting e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify(). > > > > Specifically, when a previous connect attempt did not enable the bridge > > fully so that it is still in the ST_MAINLINK_READY when we receive a > > disconnect event, dp_hpd_unplug_handle() will turn of the link clock. > > > > [ 204.527625] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 2 > > [ 204.531553] msm-dp-display ae98000.displayport-controller: dp_hpd_unplug_handle > > [ 204.533261] msm-dp-display ae98000.displayport-controller: dp_ctrl_off_link > > > > A racing connect event, such as the one I described earlier, can then > > try to enable the bridge again but dp_bridge_atomic_enable() just bails > > out early (and leaks a rpm reference) because we're now in > > ST_DISCONNECTED: > > > > [ 204.535773] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 1 > > [ 204.536187] [CONNECTOR:35:DP-2] status updated from disconnected to connected > > [ 204.536905] msm-dp-display ae98000.displayport-controller: dp_display_notify_disconnect - would clear link ready (1), state = 0 > > [ 204.537821] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_check - link_ready = 1 > > [ 204.538063] msm-dp-display ae98000.displayport-controller: dp_display_send_hpd_notification - hpd = 0, link_ready = 1 > > [ 204.542778] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_enable > > [ 204.586547] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_enable - state = 0 (rpm leak?) > > > > Clearing link_ready already in dp_display_notify_disconnect() would make > > the race window slightly smaller, but it would essentially just paper > > over the bug as the events are still not serialised. Notably, there is > > no user space interaction involved here and it's the spurious connect > > event that triggers the bridge enable. > Yes, it only narrows down the race condition window. The issue can still > happen if the commit / modeset was issued before we marked link_ready as > false. > > And yes, I was only targetting a short term fix till we rework the HPD. > That will happen only incrementally as its a delicate piece of code. Ok, thanks for confirming. Please also make that clear in the commit message of any patch. I am however not sure that your patch (RFC) is needed at this point as the HPD revert fixes the 6.8-rc1 regression, and moving the clearing of link_ready can actually make things worse as it makes any spurious hotplug event always be processed (not just if they happen after dp_display_send_hpd_notification()). I'll reply to you patch as well. > > So, while it may still be theoretically possible to hit the resets after > > the revert, the HPD notify revert effectively "fixed" the regression in > > 6.8-rc1 by removing the preconditions that now made us hit it (i.e. the > > half-initialised bridge). > Not entirely. In the bug which was reported before the patch > e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify() got landed, its a > classic example of how this issue can happen with userspace involvement > and not just fbdev client which was your case. Sure, but you already added some kind of band-aid for that issue, right? https://lore.kernel.org/all/1664408211-25314-1-git-send-email-quic_khsieh@quicinc.com/ How likely is it that you'd still hit that? Have you had an reports of anyone actually hitting that issue since the above workaround was merged? Note that I, and other users of the X13s, only started hitting the resets with 6.8-rc1, which broke hotplug notification and resulted in the half-initialised bridges. I'm not saying it's impossible to hit the unclocked access still, just that that does not seem to be relevant for the regression. > > It seems the hotplug state machine needs to be reworked completely, but > > at least we're roughly back where we were with 6.7 (including that the > > bus clocks will never be turned of because of the rpm leaks on > > disconnect). > Yes, we already landed that revert but I am also planning to land the > patch I had posted till we rework HPD. Ok, but please consider the implications that would have for any spurious connect events first. Johan ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-03-12 8:31 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-02-27 13:33 drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1 Johan Hovold 2024-03-08 12:43 ` Johan Hovold 2024-03-08 17:50 ` Abhinav Kumar 2024-03-09 16:30 ` Johan Hovold 2024-03-11 13:43 ` Johan Hovold 2024-03-11 15:16 ` Johan Hovold 2024-03-11 16:51 ` Abhinav Kumar 2024-03-12 8:31 ` Johan Hovold
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox