* GPU lockup+blackout on 2.6.35 with r600... @ 2010-08-05 10:52 Daniel J Blueman 2010-08-05 13:50 ` Jerome Glisse 0 siblings, 1 reply; 8+ messages in thread From: Daniel J Blueman @ 2010-08-05 10:52 UTC (permalink / raw) To: dri-devel After around 10-30 mins of typical usage (ie cairo rendering to surfaces + compiz) on my Dell Studio 15 (model 1557), my R600 radeon locks up, giving a blank screen. The only information I have is a backtrace [1]. 'radeontool regs' doesn't give anything - what information, /sys files etc would be useful here, and would it be useful to log a freedesktop.org bug report at this stage, assuming I haven't found a reliable reproducer? OS is Ubuntu 10.10 development release, mesa 7.8.2, libdrm2 2.4.21. Thanks, Daniel --- [1] radeon 0000:02:00.0: GPU lockup CP stall for more than 1000msec ------------[ cut here ]------------ WARNING: at /home/kernel-ppa/COD/linux/drivers/gpu/drm/radeon/radeon_fence.c:235 radeon_fence_wait+0x2b7/0x320 [radeon]() Hardware name: Studio 1557 GPU lockup (waiting for 0x00008C35 last fence id 0x00008C33) Modules linked in: binfmt_misc parport_pc ppdev kvm_intel kvm microcode snd_hda_codec_atihdmi joydev ipt_REJECT xt_limit xt_tcpudp ipt_addrtype xt_state snd_hda_codec_idt ip6table_filter ip6_tables snd_hda_intel nf_nat_irc nf_conntrack_irc snd_hda_codec nf_nat_ftp nf_nat snd_hwdep nf_conntrack_ipv4 snd_seq_midi arc4 snd_pcm snd_rawmidi nf_defrag_ipv4 radeon nf_conntrack_ftp nf_conntrack iptable_filter snd_seq_midi_event snd_seq ip_tables ttm iwlagn video snd_timer uvcvideo x_tables snd_seq_device output drm_kms_helper lp dell_laptop iwlcore drm videodev dcdbas dell_wmi v4l1_compat mac80211 i7core_edac parport v4l2_compat_ioctl32 snd psmouse edac_core i2c_algo_bit soundcore cfg80211 snd_page_alloc serio_raw sdhci_pci sdhci ahci libahci led_class r8169 mii btrfs zlib_deflate crc32c libcrc32c Pid: 1624, comm: Xorg Tainted: G D 2.6.35-020635-generic #020635 Call Trace: ? radeon_fence_wait+0x2b7/0x320 [radeon] warn_slowpath_common+0x90/0xc0 warn_slowpath_fmt+0x6e/0x70 ? schedule_timeout+0x15a/0x2e0 ? r600_irq_set+0x27d/0xc00 [radeon] ? radeon_ring_commit+0xa3/0xb0 [radeon] ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon] ? r600_gpu_is_lockup+0x1cb/0x220 [radeon] radeon_fence_wait+0x2b7/0x320 [radeon] ? autoremove_wake_function+0x0/0x40 radeon_sync_obj_wait+0x11/0x20 [radeon] ttm_bo_wait+0x102/0x1b0 [ttm] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm] radeon_move_blit+0x124/0x170 [radeon] radeon_bo_move+0xda/0x1a0 [radeon] ttm_bo_handle_move_mem+0xec/0x370 [ttm] ttm_bo_evict+0x1cc/0x270 [ttm] ? drm_mm_split_at_start+0x1d/0x80 [drm] ttm_mem_evict_first+0xed/0x180 [ttm] ? ttm_bo_man_get_node+0xd5/0xe0 [ttm] ttm_bo_mem_force_space+0xab/0x110 [ttm] ttm_bo_mem_space+0x305/0x370 [ttm] ttm_bo_move_buffer+0xcf/0x140 [ttm] ? drm_mm_split_at_start+0x1d/0x80 [drm] ttm_bo_validate+0xd5/0x100 [ttm] ttm_bo_init+0x1f2/0x240 [ttm] radeon_bo_create+0x121/0x240 [radeon] ? radeon_ttm_bo_destroy+0x0/0x80 [radeon] radeon_gem_object_create+0x89/0x100 [radeon] ? drm_gem_handle_create+0x8b/0xa0 [drm] radeon_gem_create_ioctl+0x58/0xe0 [radeon] drm_ioctl+0x283/0x460 [drm] ? radeon_gem_create_ioctl+0x0/0xe0 [radeon] ? do_readv_writev+0x16b/0x1e0 vfs_ioctl+0x3a/0xc0 do_vfs_ioctl+0x6d/0x1f0 sys_ioctl+0x87/0xa0 ? sys_writev+0x55/0xb0 system_call_fastpath+0x16/0x1b ---[ end trace c0bc12025fa8386c ]--- radeon 0000:02:00.0: GPU softreset radeon 0000:02:00.0: R_008010_GRBM_STATUS=0xA0003028 radeon 0000:02:00.0: R_008014_GRBM_STATUS2=0x00000002 radeon 0000:02:00.0: R_000E50_SRBM_STATUS=0x200000C0 radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00007FEE radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00000001 radeon 0000:02:00.0: R_008010_GRBM_STATUS=0x00003028 radeon 0000:02:00.0: R_008014_GRBM_STATUS2=0x00000002 radeon 0000:02:00.0: R_000E50_SRBM_STATUS=0x200000C0 radeon 0000:02:00.0: GPU reset succeed Clocks initialized ! ring test succeeded in 1 usecs ib test succeeded in 1 usecs -- Daniel J Blueman ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: GPU lockup+blackout on 2.6.35 with r600... 2010-08-05 10:52 GPU lockup+blackout on 2.6.35 with r600 Daniel J Blueman @ 2010-08-05 13:50 ` Jerome Glisse 2010-08-05 15:23 ` Alex Deucher 0 siblings, 1 reply; 8+ messages in thread From: Jerome Glisse @ 2010-08-05 13:50 UTC (permalink / raw) To: Daniel J Blueman; +Cc: dri-devel On 08/05/2010 06:52 AM, Daniel J Blueman wrote: > After around 10-30 mins of typical usage (ie cairo rendering to > surfaces + compiz) on my Dell Studio 15 (model 1557), my R600 radeon > locks up, giving a blank screen. > > The only information I have is a backtrace [1]. 'radeontool regs' > doesn't give anything - what information, /sys files etc would be > useful here, and would it be useful to log a freedesktop.org bug > report at this stage, assuming I haven't found a reliable reproducer? > > OS is Ubuntu 10.10 development release, mesa 7.8.2, libdrm2 2.4.21. > > Thanks, > Daniel > > --- [1] > > radeon 0000:02:00.0: GPU lockup CP stall for more than 1000msec > ------------[ cut here ]------------ > WARNING: at /home/kernel-ppa/COD/linux/drivers/gpu/drm/radeon/radeon_fence.c:235 > radeon_fence_wait+0x2b7/0x320 [radeon]() > Hardware name: Studio 1557 > GPU lockup (waiting for 0x00008C35 last fence id 0x00008C33) > Modules linked in: binfmt_misc parport_pc ppdev kvm_intel kvm > microcode snd_hda_codec_atihdmi joydev ipt_REJECT xt_limit xt_tcpudp > ipt_addrtype xt_state snd_hda_codec_idt ip6table_filter ip6_tables > snd_hda_intel nf_nat_irc nf_conntrack_irc snd_hda_codec nf_nat_ftp > nf_nat snd_hwdep nf_conntrack_ipv4 snd_seq_midi arc4 snd_pcm > snd_rawmidi nf_defrag_ipv4 radeon nf_conntrack_ftp nf_conntrack > iptable_filter snd_seq_midi_event snd_seq ip_tables ttm iwlagn video > snd_timer uvcvideo x_tables snd_seq_device output drm_kms_helper lp > dell_laptop iwlcore drm videodev dcdbas dell_wmi v4l1_compat mac80211 > i7core_edac parport v4l2_compat_ioctl32 snd psmouse edac_core > i2c_algo_bit soundcore cfg80211 snd_page_alloc serio_raw sdhci_pci > sdhci ahci libahci led_class r8169 mii btrfs zlib_deflate crc32c > libcrc32c > Pid: 1624, comm: Xorg Tainted: G D 2.6.35-020635-generic #020635 > Call Trace: > ? radeon_fence_wait+0x2b7/0x320 [radeon] > warn_slowpath_common+0x90/0xc0 > warn_slowpath_fmt+0x6e/0x70 > ? schedule_timeout+0x15a/0x2e0 > ? r600_irq_set+0x27d/0xc00 [radeon] > ? radeon_ring_commit+0xa3/0xb0 [radeon] > ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon] > ? r600_gpu_is_lockup+0x1cb/0x220 [radeon] > radeon_fence_wait+0x2b7/0x320 [radeon] > ? autoremove_wake_function+0x0/0x40 > radeon_sync_obj_wait+0x11/0x20 [radeon] > ttm_bo_wait+0x102/0x1b0 [ttm] > ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm] > radeon_move_blit+0x124/0x170 [radeon] > radeon_bo_move+0xda/0x1a0 [radeon] > ttm_bo_handle_move_mem+0xec/0x370 [ttm] > ttm_bo_evict+0x1cc/0x270 [ttm] > ? drm_mm_split_at_start+0x1d/0x80 [drm] > ttm_mem_evict_first+0xed/0x180 [ttm] > ? ttm_bo_man_get_node+0xd5/0xe0 [ttm] > ttm_bo_mem_force_space+0xab/0x110 [ttm] > ttm_bo_mem_space+0x305/0x370 [ttm] > ttm_bo_move_buffer+0xcf/0x140 [ttm] > ? drm_mm_split_at_start+0x1d/0x80 [drm] > ttm_bo_validate+0xd5/0x100 [ttm] > ttm_bo_init+0x1f2/0x240 [ttm] > radeon_bo_create+0x121/0x240 [radeon] > ? radeon_ttm_bo_destroy+0x0/0x80 [radeon] > radeon_gem_object_create+0x89/0x100 [radeon] > ? drm_gem_handle_create+0x8b/0xa0 [drm] > radeon_gem_create_ioctl+0x58/0xe0 [radeon] > drm_ioctl+0x283/0x460 [drm] > ? radeon_gem_create_ioctl+0x0/0xe0 [radeon] > ? do_readv_writev+0x16b/0x1e0 > vfs_ioctl+0x3a/0xc0 > do_vfs_ioctl+0x6d/0x1f0 > sys_ioctl+0x87/0xa0 > ? sys_writev+0x55/0xb0 > system_call_fastpath+0x16/0x1b > ---[ end trace c0bc12025fa8386c ]--- > radeon 0000:02:00.0: GPU softreset > radeon 0000:02:00.0: R_008010_GRBM_STATUS=0xA0003028 > radeon 0000:02:00.0: R_008014_GRBM_STATUS2=0x00000002 > radeon 0000:02:00.0: R_000E50_SRBM_STATUS=0x200000C0 > radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00007FEE > radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00000001 > radeon 0000:02:00.0: R_008010_GRBM_STATUS=0x00003028 > radeon 0000:02:00.0: R_008014_GRBM_STATUS2=0x00000002 > radeon 0000:02:00.0: R_000E50_SRBM_STATUS=0x200000C0 > radeon 0000:02:00.0: GPU reset succeed > Clocks initialized ! > ring test succeeded in 1 usecs > ib test succeeded in 1 usecs SO it keep reseting ? According to log there is a GPU lockup but then a successfull GPU reset so GPU should resume fine (that's what log says) Best is to open a bug and attach full dmesg + lspci -vv and context in which the lockup happen Cheers, Jerome ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: GPU lockup+blackout on 2.6.35 with r600... 2010-08-05 13:50 ` Jerome Glisse @ 2010-08-05 15:23 ` Alex Deucher 2010-08-08 16:36 ` Daniel J Blueman 0 siblings, 1 reply; 8+ messages in thread From: Alex Deucher @ 2010-08-05 15:23 UTC (permalink / raw) To: Jerome Glisse; +Cc: Daniel J Blueman, dri-devel On Thu, Aug 5, 2010 at 9:50 AM, Jerome Glisse <glisse@freedesktop.org> wrote: > On 08/05/2010 06:52 AM, Daniel J Blueman wrote: >> >> After around 10-30 mins of typical usage (ie cairo rendering to >> surfaces + compiz) on my Dell Studio 15 (model 1557), my R600 radeon >> locks up, giving a blank screen. >> >> The only information I have is a backtrace [1]. 'radeontool regs' >> doesn't give anything - what information, /sys files etc would be >> useful here, and would it be useful to log a freedesktop.org bug >> report at this stage, assuming I haven't found a reliable reproducer? >> >> OS is Ubuntu 10.10 development release, mesa 7.8.2, libdrm2 2.4.21. >> >> Thanks, >> Daniel >> >> --- [1] >> >> radeon 0000:02:00.0: GPU lockup CP stall for more than 1000msec >> ------------[ cut here ]------------ >> WARNING: at >> /home/kernel-ppa/COD/linux/drivers/gpu/drm/radeon/radeon_fence.c:235 >> radeon_fence_wait+0x2b7/0x320 [radeon]() >> Hardware name: Studio 1557 >> GPU lockup (waiting for 0x00008C35 last fence id 0x00008C33) >> Modules linked in: binfmt_misc parport_pc ppdev kvm_intel kvm >> microcode snd_hda_codec_atihdmi joydev ipt_REJECT xt_limit xt_tcpudp >> ipt_addrtype xt_state snd_hda_codec_idt ip6table_filter ip6_tables >> snd_hda_intel nf_nat_irc nf_conntrack_irc snd_hda_codec nf_nat_ftp >> nf_nat snd_hwdep nf_conntrack_ipv4 snd_seq_midi arc4 snd_pcm >> snd_rawmidi nf_defrag_ipv4 radeon nf_conntrack_ftp nf_conntrack >> iptable_filter snd_seq_midi_event snd_seq ip_tables ttm iwlagn video >> snd_timer uvcvideo x_tables snd_seq_device output drm_kms_helper lp >> dell_laptop iwlcore drm videodev dcdbas dell_wmi v4l1_compat mac80211 >> i7core_edac parport v4l2_compat_ioctl32 snd psmouse edac_core >> i2c_algo_bit soundcore cfg80211 snd_page_alloc serio_raw sdhci_pci >> sdhci ahci libahci led_class r8169 mii btrfs zlib_deflate crc32c >> libcrc32c >> Pid: 1624, comm: Xorg Tainted: G D 2.6.35-020635-generic #020635 >> Call Trace: >> ? radeon_fence_wait+0x2b7/0x320 [radeon] >> warn_slowpath_common+0x90/0xc0 >> warn_slowpath_fmt+0x6e/0x70 >> ? schedule_timeout+0x15a/0x2e0 >> ? r600_irq_set+0x27d/0xc00 [radeon] >> ? radeon_ring_commit+0xa3/0xb0 [radeon] >> ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon] >> ? r600_gpu_is_lockup+0x1cb/0x220 [radeon] >> radeon_fence_wait+0x2b7/0x320 [radeon] >> ? autoremove_wake_function+0x0/0x40 >> radeon_sync_obj_wait+0x11/0x20 [radeon] >> ttm_bo_wait+0x102/0x1b0 [ttm] >> ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm] >> radeon_move_blit+0x124/0x170 [radeon] >> radeon_bo_move+0xda/0x1a0 [radeon] >> ttm_bo_handle_move_mem+0xec/0x370 [ttm] >> ttm_bo_evict+0x1cc/0x270 [ttm] >> ? drm_mm_split_at_start+0x1d/0x80 [drm] >> ttm_mem_evict_first+0xed/0x180 [ttm] >> ? ttm_bo_man_get_node+0xd5/0xe0 [ttm] >> ttm_bo_mem_force_space+0xab/0x110 [ttm] >> ttm_bo_mem_space+0x305/0x370 [ttm] >> ttm_bo_move_buffer+0xcf/0x140 [ttm] >> ? drm_mm_split_at_start+0x1d/0x80 [drm] >> ttm_bo_validate+0xd5/0x100 [ttm] >> ttm_bo_init+0x1f2/0x240 [ttm] >> radeon_bo_create+0x121/0x240 [radeon] >> ? radeon_ttm_bo_destroy+0x0/0x80 [radeon] >> radeon_gem_object_create+0x89/0x100 [radeon] >> ? drm_gem_handle_create+0x8b/0xa0 [drm] >> radeon_gem_create_ioctl+0x58/0xe0 [radeon] >> drm_ioctl+0x283/0x460 [drm] >> ? radeon_gem_create_ioctl+0x0/0xe0 [radeon] >> ? do_readv_writev+0x16b/0x1e0 >> vfs_ioctl+0x3a/0xc0 >> do_vfs_ioctl+0x6d/0x1f0 >> sys_ioctl+0x87/0xa0 >> ? sys_writev+0x55/0xb0 >> system_call_fastpath+0x16/0x1b >> ---[ end trace c0bc12025fa8386c ]--- >> radeon 0000:02:00.0: GPU softreset >> radeon 0000:02:00.0: R_008010_GRBM_STATUS=0xA0003028 >> radeon 0000:02:00.0: R_008014_GRBM_STATUS2=0x00000002 >> radeon 0000:02:00.0: R_000E50_SRBM_STATUS=0x200000C0 >> radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00007FEE >> radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00000001 >> radeon 0000:02:00.0: R_008010_GRBM_STATUS=0x00003028 >> radeon 0000:02:00.0: R_008014_GRBM_STATUS2=0x00000002 >> radeon 0000:02:00.0: R_000E50_SRBM_STATUS=0x200000C0 >> radeon 0000:02:00.0: GPU reset succeed >> Clocks initialized ! >> ring test succeeded in 1 usecs >> ib test succeeded in 1 usecs > > > SO it keep reseting ? According to log there is a GPU lockup but then > a successfull GPU reset so GPU should resume fine (that's what log says) > Best is to open a bug and attach full dmesg + lspci -vv and context > in which the lockup happen > Perhaps the lockup timeout is too short or interrupts are being delivered late or something? There's a fdo or kernel bugzilla entry where removing the lockup timeout fixed the issue. Does something like this: --- a/drivers/gpu/drm/radeon/radeon_fence.c +++ b/drivers/gpu/drm/radeon/radeon_fence.c @@ -237,9 +237,11 @@ retry: * as signaled for now */ rdev->gpu_lockup = true; +#if 0 r = radeon_gpu_reset(rdev); if (r) return r; +#endif WREG32(rdev->fence_drv.scratch_reg, fence->seq); rdev->gpu_lockup = false; } fix the issue? Alex ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: GPU lockup+blackout on 2.6.35 with r600... 2010-08-05 15:23 ` Alex Deucher @ 2010-08-08 16:36 ` Daniel J Blueman 2010-08-09 13:41 ` Daniel J Blueman 0 siblings, 1 reply; 8+ messages in thread From: Daniel J Blueman @ 2010-08-08 16:36 UTC (permalink / raw) To: Alex Deucher, Jerome Glisse; +Cc: dri-devel On 5 August 2010 16:23, Alex Deucher <alexdeucher@gmail.com> wrote: > On Thu, Aug 5, 2010 at 9:50 AM, Jerome Glisse <glisse@freedesktop.org> wrote: >> On 08/05/2010 06:52 AM, Daniel J Blueman wrote: >>> >>> After around 10-30 mins of typical usage (ie cairo rendering to >>> surfaces + compiz) on my Dell Studio 15 (model 1557), my R600 radeon >>> locks up, giving a blank screen. >>> >>> The only information I have is a backtrace [1]. 'radeontool regs' >>> doesn't give anything - what information, /sys files etc would be >>> useful here, and would it be useful to log a freedesktop.org bug >>> report at this stage, assuming I haven't found a reliable reproducer? >>> >>> OS is Ubuntu 10.10 development release, mesa 7.8.2, libdrm2 2.4.21. >>> >>> Thanks, >>> Daniel >>> >>> --- [1] >>> >>> radeon 0000:02:00.0: GPU lockup CP stall for more than 1000msec >>> ------------[ cut here ]------------ >>> WARNING: at >>> /home/kernel-ppa/COD/linux/drivers/gpu/drm/radeon/radeon_fence.c:235 >>> radeon_fence_wait+0x2b7/0x320 [radeon]() >>> Hardware name: Studio 1557 >>> GPU lockup (waiting for 0x00008C35 last fence id 0x00008C33) >>> Modules linked in: binfmt_misc parport_pc ppdev kvm_intel kvm >>> microcode snd_hda_codec_atihdmi joydev ipt_REJECT xt_limit xt_tcpudp >>> ipt_addrtype xt_state snd_hda_codec_idt ip6table_filter ip6_tables >>> snd_hda_intel nf_nat_irc nf_conntrack_irc snd_hda_codec nf_nat_ftp >>> nf_nat snd_hwdep nf_conntrack_ipv4 snd_seq_midi arc4 snd_pcm >>> snd_rawmidi nf_defrag_ipv4 radeon nf_conntrack_ftp nf_conntrack >>> iptable_filter snd_seq_midi_event snd_seq ip_tables ttm iwlagn video >>> snd_timer uvcvideo x_tables snd_seq_device output drm_kms_helper lp >>> dell_laptop iwlcore drm videodev dcdbas dell_wmi v4l1_compat mac80211 >>> i7core_edac parport v4l2_compat_ioctl32 snd psmouse edac_core >>> i2c_algo_bit soundcore cfg80211 snd_page_alloc serio_raw sdhci_pci >>> sdhci ahci libahci led_class r8169 mii btrfs zlib_deflate crc32c >>> libcrc32c >>> Pid: 1624, comm: Xorg Tainted: G D 2.6.35-020635-generic #020635 >>> Call Trace: >>> ? radeon_fence_wait+0x2b7/0x320 [radeon] >>> warn_slowpath_common+0x90/0xc0 >>> warn_slowpath_fmt+0x6e/0x70 >>> ? schedule_timeout+0x15a/0x2e0 >>> ? r600_irq_set+0x27d/0xc00 [radeon] >>> ? radeon_ring_commit+0xa3/0xb0 [radeon] >>> ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon] >>> ? r600_gpu_is_lockup+0x1cb/0x220 [radeon] >>> radeon_fence_wait+0x2b7/0x320 [radeon] >>> ? autoremove_wake_function+0x0/0x40 >>> radeon_sync_obj_wait+0x11/0x20 [radeon] >>> ttm_bo_wait+0x102/0x1b0 [ttm] >>> ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm] >>> radeon_move_blit+0x124/0x170 [radeon] >>> radeon_bo_move+0xda/0x1a0 [radeon] >>> ttm_bo_handle_move_mem+0xec/0x370 [ttm] >>> ttm_bo_evict+0x1cc/0x270 [ttm] >>> ? drm_mm_split_at_start+0x1d/0x80 [drm] >>> ttm_mem_evict_first+0xed/0x180 [ttm] >>> ? ttm_bo_man_get_node+0xd5/0xe0 [ttm] >>> ttm_bo_mem_force_space+0xab/0x110 [ttm] >>> ttm_bo_mem_space+0x305/0x370 [ttm] >>> ttm_bo_move_buffer+0xcf/0x140 [ttm] >>> ? drm_mm_split_at_start+0x1d/0x80 [drm] >>> ttm_bo_validate+0xd5/0x100 [ttm] >>> ttm_bo_init+0x1f2/0x240 [ttm] >>> radeon_bo_create+0x121/0x240 [radeon] >>> ? radeon_ttm_bo_destroy+0x0/0x80 [radeon] >>> radeon_gem_object_create+0x89/0x100 [radeon] >>> ? drm_gem_handle_create+0x8b/0xa0 [drm] >>> radeon_gem_create_ioctl+0x58/0xe0 [radeon] >>> drm_ioctl+0x283/0x460 [drm] >>> ? radeon_gem_create_ioctl+0x0/0xe0 [radeon] >>> ? do_readv_writev+0x16b/0x1e0 >>> vfs_ioctl+0x3a/0xc0 >>> do_vfs_ioctl+0x6d/0x1f0 >>> sys_ioctl+0x87/0xa0 >>> ? sys_writev+0x55/0xb0 >>> system_call_fastpath+0x16/0x1b >>> ---[ end trace c0bc12025fa8386c ]--- >>> radeon 0000:02:00.0: GPU softreset >>> radeon 0000:02:00.0: R_008010_GRBM_STATUS=0xA0003028 >>> radeon 0000:02:00.0: R_008014_GRBM_STATUS2=0x00000002 >>> radeon 0000:02:00.0: R_000E50_SRBM_STATUS=0x200000C0 >>> radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00007FEE >>> radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00000001 >>> radeon 0000:02:00.0: R_008010_GRBM_STATUS=0x00003028 >>> radeon 0000:02:00.0: R_008014_GRBM_STATUS2=0x00000002 >>> radeon 0000:02:00.0: R_000E50_SRBM_STATUS=0x200000C0 >>> radeon 0000:02:00.0: GPU reset succeed >>> Clocks initialized ! >>> ring test succeeded in 1 usecs >>> ib test succeeded in 1 usecs >> >> SO it keep reseting ? According to log there is a GPU lockup but then >> a successfull GPU reset so GPU should resume fine (that's what log says) >> Best is to open a bug and attach full dmesg + lspci -vv and context >> in which the lockup happen > > Perhaps the lockup timeout is too short or interrupts are being > delivered late or something? There's a fdo or kernel bugzilla entry > where removing the lockup timeout fixed the issue. Does something > like this: > > --- a/drivers/gpu/drm/radeon/radeon_fence.c > +++ b/drivers/gpu/drm/radeon/radeon_fence.c > @@ -237,9 +237,11 @@ retry: > * as signaled for now > */ > rdev->gpu_lockup = true; > +#if 0 > r = radeon_gpu_reset(rdev); > if (r) > return r; > +#endif > WREG32(rdev->fence_drv.scratch_reg, fence->seq); > rdev->gpu_lockup = false; > } > > fix the issue? Correction, RV710. Well, RADEON_FENCE_JIFFIES_TIMEOUT plus 1000ms is already excruciatingly long for the command processor to not have responded, but the issue I'm really hitting here is that X if left hanging after GPU reset. The GPU lockup was being reported at: [<ffffffffa03b66d7>] ? radeon_fence_wait+0x2b7/0x320 [radeon] [<ffffffff81061370>] warn_slowpath_common+0x90/0xc0 [<ffffffff8106149e>] warn_slowpath_fmt+0x6e/0x70 [<ffffffff81575f8a>] ? schedule_timeout+0x15a/0x2e0 [<ffffffffa03f5a9d>] ? r600_irq_set+0x27d/0xc00 [radeon] [<ffffffffa03ce1f3>] ? radeon_ring_commit+0xa3/0xb0 [radeon] [<ffffffffa03d5412>] ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon] [<ffffffffa040063b>] ? r600_gpu_is_lockup+0x1cb/0x220 [radeon] [<ffffffffa03b66d7>] radeon_fence_wait+0x2b7/0x320 [radeon] [<ffffffff8107fcd0>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa03b74a1>] radeon_sync_obj_wait+0x11/0x20 [radeon] [<ffffffffa02ff332>] ttm_bo_wait+0x102/0x1b0 [ttm] [<ffffffffa0302ede>] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm] [<ffffffffa03b75f4>] radeon_move_blit+0x124/0x170 [radeon] X consistently hangs after GPU reset at (with sysrq-T): [<ffffffff81576838>] __mutex_lock_slowpath+0xe8/0x170 [<ffffffff8157647b>] mutex_lock+0x2b/0x50 [<ffffffffa0415aba>] radeon_pm_set_clocks+0x3a/0x2d0 [radeon] [<ffffffff812abc51>] ? idr_get_new_above+0x11/0x40 [<ffffffffa01679ab>] ? drm_mode_object_get+0x6b/0xc0 [drm] [<ffffffff811410ae>] ? kmem_cache_alloc_notrace+0x8e/0xd0 [<ffffffffa041662d>] radeon_pm_compute_clocks+0xcd/0x240 [radeon] [<ffffffffa03a9d21>] atombios_crtc_mode_fixup+0x31/0x60 [radeon] [<ffffffffa02ba13a>] drm_crtc_helper_set_mode+0x16a/0x370 [drm_kms_helper] [<ffffffffa02ba3bd>] drm_helper_resume_force_mode+0x7d/0x150 [drm_kms_helper] [<ffffffffa03a2f35>] radeon_gpu_reset+0x85/0xd0 [radeon] [<ffffffffa03b66e7>] radeon_fence_wait+0x2c7/0x320 [radeon] [<ffffffff8107fcd0>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa03b74a1>] radeon_sync_obj_wait+0x11/0x20 [radeon] [<ffffffffa02ff332>] ttm_bo_wait+0x102/0x1b0 [ttm] [<ffffffffa0302ede>] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm] [<ffffffffa03b75f4>] radeon_move_blit+0x124/0x170 [radeon] I couldn't reproduce it (but did hit it a number of times over the last two weeks), so have been analysing with what we have. The kernel the backtrace is against isn't built with debug, so: radeon_pm_set_clocks(): 8ca80: 55 push %rbp 8ca81: 48 89 e5 mov %rsp,%rbp 8ca84: 41 57 push %r15 8ca86: 41 56 push %r14 8ca88: 41 55 push %r13 8ca8a: 41 54 push %r12 8ca8c: 53 push %rbx 8ca8d: 48 83 ec 48 sub $0x48,%rsp 8ca91: e8 00 00 00 00 callq 8ca96 <radeon_pm_set_clocks+0x16> 8ca96: 49 89 fc mov %rdi,%r12 8ca99: 48 8b 7f 08 mov 0x8(%rdi),%rdi 8ca9d: 48 83 c7 20 add $0x20,%rdi 8caa1: e8 00 00 00 00 callq 8caa6 <radeon_pm_set_clocks+0x26> 8caa6: 49 8d 84 24 a0 1c 00 lea 0x1ca0(%r12),%rax 8caad: 00 8caae: 48 89 c7 mov %rax,%rdi 8cab1: 48 89 45 90 mov %rax,-0x70(%rbp) 8cab5: e8 00 00 00 00 callq 8caba <radeon_pm_set_clocks+0x3a> -> 8caba: 49 8d 8c 24 e8 09 00 lea 0x9e8(%r12),%rcx 8cac1: 00 8cac2: 48 89 cf mov %rcx,%rdi 8cac5: 48 89 4d 98 mov %rcx,-0x68(%rbp) 8cac9: e8 00 00 00 00 callq 8cace <radeon_pm_set_clocks+0x4e> 8cace: 41 83 7c 24 78 1a cmpl $0x1a,0x78(%r12) 8cad4: 0f 86 46 01 00 00 jbe 8cc20 <radeon_pm_set_clocks+0x1a0> 8cada: 41 80 bc 24 d0 0d 00 cmpb $0x0,0xdd0(%r12) 8cae1: 00 00 This corresponds to: static void radeon_pm_set_clocks(struct radeon_device *rdev) { int i; mutex_lock(&rdev->ddev->struct_mutex); mutex_lock(&rdev->vram_mutex); -> mutex_lock(&rdev->cp.mutex); We know r12 is the rdev struct pointer (%rdi is the first function argument gets loaded into r12), and 0x9e8 is plausable as the offset of cp.mutex, as vram_mutex is further out and plausable as 0x1ca0, so the code is trying to acquire cp.mutex it seems. >From the kernel log, the command proc ring test completed after the start of the GPU reset, which is the last callsite to unlock cp.mutex, ie <radeon_fence_wait->radeon_gpu_reset->radeon_resume->rv770_resume->rv770_startup->r600_cp_resume->radeon_ring_test->r600_ring_test->radeon_ring_unlock_commit> before the hang at <radeon_fence_wait->radeon_gpu_reset->->drm_helper_resume_force_mode->drm_crtc_helper_set_mode->mode_fixup->atombios_crtc_mode_fixup->radeon_pm_compute_clocks->radeon_pm_set_clocks>. Looking through all the calls between these two spots, I just can't see anything that _could_ have locked cp.mutex, unless I'm misreading the disassembly of mutex_lock. Any thoughts? Dan -- Daniel J Blueman ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: GPU lockup+blackout on 2.6.35 with r600... 2010-08-08 16:36 ` Daniel J Blueman @ 2010-08-09 13:41 ` Daniel J Blueman 2010-08-17 22:12 ` Daniel J Blueman 0 siblings, 1 reply; 8+ messages in thread From: Daniel J Blueman @ 2010-08-09 13:41 UTC (permalink / raw) To: Alex Deucher, Jerome Glisse, Matthew Garrett; +Cc: dri-devel On 8 August 2010 17:36, Daniel J Blueman <daniel.blueman@gmail.com> wrote: > On 5 August 2010 16:23, Alex Deucher <alexdeucher@gmail.com> wrote: >> On Thu, Aug 5, 2010 at 9:50 AM, Jerome Glisse <glisse@freedesktop.org> wrote: >>> On 08/05/2010 06:52 AM, Daniel J Blueman wrote: >>>> >>>> After around 10-30 mins of typical usage (ie cairo rendering to >>>> surfaces + compiz) on my Dell Studio 15 (model 1557), my R600 radeon >>>> locks up, giving a blank screen. >>>> >>>> The only information I have is a backtrace [1]. 'radeontool regs' >>>> doesn't give anything - what information, /sys files etc would be >>>> useful here, and would it be useful to log a freedesktop.org bug >>>> report at this stage, assuming I haven't found a reliable reproducer? >>>> >>>> OS is Ubuntu 10.10 development release, mesa 7.8.2, libdrm2 2.4.21. >>>> >>>> Thanks, >>>> Daniel >>>> >>>> --- [1] >>>> >>>> radeon 0000:02:00.0: GPU lockup CP stall for more than 1000msec >>>> ------------[ cut here ]------------ >>>> WARNING: at >>>> /home/kernel-ppa/COD/linux/drivers/gpu/drm/radeon/radeon_fence.c:235 >>>> radeon_fence_wait+0x2b7/0x320 [radeon]() >>>> Hardware name: Studio 1557 >>>> GPU lockup (waiting for 0x00008C35 last fence id 0x00008C33) >>>> Modules linked in: binfmt_misc parport_pc ppdev kvm_intel kvm >>>> microcode snd_hda_codec_atihdmi joydev ipt_REJECT xt_limit xt_tcpudp >>>> ipt_addrtype xt_state snd_hda_codec_idt ip6table_filter ip6_tables >>>> snd_hda_intel nf_nat_irc nf_conntrack_irc snd_hda_codec nf_nat_ftp >>>> nf_nat snd_hwdep nf_conntrack_ipv4 snd_seq_midi arc4 snd_pcm >>>> snd_rawmidi nf_defrag_ipv4 radeon nf_conntrack_ftp nf_conntrack >>>> iptable_filter snd_seq_midi_event snd_seq ip_tables ttm iwlagn video >>>> snd_timer uvcvideo x_tables snd_seq_device output drm_kms_helper lp >>>> dell_laptop iwlcore drm videodev dcdbas dell_wmi v4l1_compat mac80211 >>>> i7core_edac parport v4l2_compat_ioctl32 snd psmouse edac_core >>>> i2c_algo_bit soundcore cfg80211 snd_page_alloc serio_raw sdhci_pci >>>> sdhci ahci libahci led_class r8169 mii btrfs zlib_deflate crc32c >>>> libcrc32c >>>> Pid: 1624, comm: Xorg Tainted: G D 2.6.35-020635-generic #020635 >>>> Call Trace: >>>> ? radeon_fence_wait+0x2b7/0x320 [radeon] >>>> warn_slowpath_common+0x90/0xc0 >>>> warn_slowpath_fmt+0x6e/0x70 >>>> ? schedule_timeout+0x15a/0x2e0 >>>> ? r600_irq_set+0x27d/0xc00 [radeon] >>>> ? radeon_ring_commit+0xa3/0xb0 [radeon] >>>> ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon] >>>> ? r600_gpu_is_lockup+0x1cb/0x220 [radeon] >>>> radeon_fence_wait+0x2b7/0x320 [radeon] >>>> ? autoremove_wake_function+0x0/0x40 >>>> radeon_sync_obj_wait+0x11/0x20 [radeon] >>>> ttm_bo_wait+0x102/0x1b0 [ttm] >>>> ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm] >>>> radeon_move_blit+0x124/0x170 [radeon] >>>> radeon_bo_move+0xda/0x1a0 [radeon] >>>> ttm_bo_handle_move_mem+0xec/0x370 [ttm] >>>> ttm_bo_evict+0x1cc/0x270 [ttm] >>>> ? drm_mm_split_at_start+0x1d/0x80 [drm] >>>> ttm_mem_evict_first+0xed/0x180 [ttm] >>>> ? ttm_bo_man_get_node+0xd5/0xe0 [ttm] >>>> ttm_bo_mem_force_space+0xab/0x110 [ttm] >>>> ttm_bo_mem_space+0x305/0x370 [ttm] >>>> ttm_bo_move_buffer+0xcf/0x140 [ttm] >>>> ? drm_mm_split_at_start+0x1d/0x80 [drm] >>>> ttm_bo_validate+0xd5/0x100 [ttm] >>>> ttm_bo_init+0x1f2/0x240 [ttm] >>>> radeon_bo_create+0x121/0x240 [radeon] >>>> ? radeon_ttm_bo_destroy+0x0/0x80 [radeon] >>>> radeon_gem_object_create+0x89/0x100 [radeon] >>>> ? drm_gem_handle_create+0x8b/0xa0 [drm] >>>> radeon_gem_create_ioctl+0x58/0xe0 [radeon] >>>> drm_ioctl+0x283/0x460 [drm] >>>> ? radeon_gem_create_ioctl+0x0/0xe0 [radeon] >>>> ? do_readv_writev+0x16b/0x1e0 >>>> vfs_ioctl+0x3a/0xc0 >>>> do_vfs_ioctl+0x6d/0x1f0 >>>> sys_ioctl+0x87/0xa0 >>>> ? sys_writev+0x55/0xb0 >>>> system_call_fastpath+0x16/0x1b >>>> ---[ end trace c0bc12025fa8386c ]--- >>>> radeon 0000:02:00.0: GPU softreset >>>> radeon 0000:02:00.0: R_008010_GRBM_STATUS=0xA0003028 >>>> radeon 0000:02:00.0: R_008014_GRBM_STATUS2=0x00000002 >>>> radeon 0000:02:00.0: R_000E50_SRBM_STATUS=0x200000C0 >>>> radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00007FEE >>>> radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00000001 >>>> radeon 0000:02:00.0: R_008010_GRBM_STATUS=0x00003028 >>>> radeon 0000:02:00.0: R_008014_GRBM_STATUS2=0x00000002 >>>> radeon 0000:02:00.0: R_000E50_SRBM_STATUS=0x200000C0 >>>> radeon 0000:02:00.0: GPU reset succeed >>>> Clocks initialized ! >>>> ring test succeeded in 1 usecs >>>> ib test succeeded in 1 usecs >>> >>> SO it keep reseting ? According to log there is a GPU lockup but then >>> a successfull GPU reset so GPU should resume fine (that's what log says) >>> Best is to open a bug and attach full dmesg + lspci -vv and context >>> in which the lockup happen >> >> Perhaps the lockup timeout is too short or interrupts are being >> delivered late or something? There's a fdo or kernel bugzilla entry >> where removing the lockup timeout fixed the issue. Does something >> like this: >> >> --- a/drivers/gpu/drm/radeon/radeon_fence.c >> +++ b/drivers/gpu/drm/radeon/radeon_fence.c >> @@ -237,9 +237,11 @@ retry: >> * as signaled for now >> */ >> rdev->gpu_lockup = true; >> +#if 0 >> r = radeon_gpu_reset(rdev); >> if (r) >> return r; >> +#endif >> WREG32(rdev->fence_drv.scratch_reg, fence->seq); >> rdev->gpu_lockup = false; >> } >> >> fix the issue? > > Correction, RV710. > > Well, RADEON_FENCE_JIFFIES_TIMEOUT plus 1000ms is already > excruciatingly long for the command processor to not have responded, > but the issue I'm really hitting here is that X if left hanging after > GPU reset. The GPU lockup was being reported at: > > [<ffffffffa03b66d7>] ? radeon_fence_wait+0x2b7/0x320 [radeon] > [<ffffffff81061370>] warn_slowpath_common+0x90/0xc0 > [<ffffffff8106149e>] warn_slowpath_fmt+0x6e/0x70 > [<ffffffff81575f8a>] ? schedule_timeout+0x15a/0x2e0 > [<ffffffffa03f5a9d>] ? r600_irq_set+0x27d/0xc00 [radeon] > [<ffffffffa03ce1f3>] ? radeon_ring_commit+0xa3/0xb0 [radeon] > [<ffffffffa03d5412>] ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon] > [<ffffffffa040063b>] ? r600_gpu_is_lockup+0x1cb/0x220 [radeon] > [<ffffffffa03b66d7>] radeon_fence_wait+0x2b7/0x320 [radeon] > [<ffffffff8107fcd0>] ? autoremove_wake_function+0x0/0x40 > [<ffffffffa03b74a1>] radeon_sync_obj_wait+0x11/0x20 [radeon] > [<ffffffffa02ff332>] ttm_bo_wait+0x102/0x1b0 [ttm] > [<ffffffffa0302ede>] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm] > [<ffffffffa03b75f4>] radeon_move_blit+0x124/0x170 [radeon] > > X consistently hangs after GPU reset at (with sysrq-T): > > [<ffffffff81576838>] __mutex_lock_slowpath+0xe8/0x170 > [<ffffffff8157647b>] mutex_lock+0x2b/0x50 > [<ffffffffa0415aba>] radeon_pm_set_clocks+0x3a/0x2d0 [radeon] > [<ffffffff812abc51>] ? idr_get_new_above+0x11/0x40 > [<ffffffffa01679ab>] ? drm_mode_object_get+0x6b/0xc0 [drm] > [<ffffffff811410ae>] ? kmem_cache_alloc_notrace+0x8e/0xd0 > [<ffffffffa041662d>] radeon_pm_compute_clocks+0xcd/0x240 [radeon] > [<ffffffffa03a9d21>] atombios_crtc_mode_fixup+0x31/0x60 [radeon] > [<ffffffffa02ba13a>] drm_crtc_helper_set_mode+0x16a/0x370 [drm_kms_helper] > [<ffffffffa02ba3bd>] drm_helper_resume_force_mode+0x7d/0x150 [drm_kms_helper] > [<ffffffffa03a2f35>] radeon_gpu_reset+0x85/0xd0 [radeon] > [<ffffffffa03b66e7>] radeon_fence_wait+0x2c7/0x320 [radeon] > [<ffffffff8107fcd0>] ? autoremove_wake_function+0x0/0x40 > [<ffffffffa03b74a1>] radeon_sync_obj_wait+0x11/0x20 [radeon] > [<ffffffffa02ff332>] ttm_bo_wait+0x102/0x1b0 [ttm] > [<ffffffffa0302ede>] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm] > [<ffffffffa03b75f4>] radeon_move_blit+0x124/0x170 [radeon] > > I couldn't reproduce it (but did hit it a number of times over the > last two weeks), so have been analysing with what we have. The kernel > the backtrace is against isn't built with debug, so: > > radeon_pm_set_clocks(): > 8ca80: 55 push %rbp > 8ca81: 48 89 e5 mov %rsp,%rbp > 8ca84: 41 57 push %r15 > 8ca86: 41 56 push %r14 > 8ca88: 41 55 push %r13 > 8ca8a: 41 54 push %r12 > 8ca8c: 53 push %rbx > 8ca8d: 48 83 ec 48 sub $0x48,%rsp > 8ca91: e8 00 00 00 00 callq 8ca96 <radeon_pm_set_clocks+0x16> > 8ca96: 49 89 fc mov %rdi,%r12 > 8ca99: 48 8b 7f 08 mov 0x8(%rdi),%rdi > 8ca9d: 48 83 c7 20 add $0x20,%rdi > 8caa1: e8 00 00 00 00 callq 8caa6 <radeon_pm_set_clocks+0x26> > 8caa6: 49 8d 84 24 a0 1c 00 lea 0x1ca0(%r12),%rax > 8caad: 00 > 8caae: 48 89 c7 mov %rax,%rdi > 8cab1: 48 89 45 90 mov %rax,-0x70(%rbp) > 8cab5: e8 00 00 00 00 callq 8caba <radeon_pm_set_clocks+0x3a> > -> 8caba: 49 8d 8c 24 e8 09 00 lea 0x9e8(%r12),%rcx > 8cac1: 00 > 8cac2: 48 89 cf mov %rcx,%rdi > 8cac5: 48 89 4d 98 mov %rcx,-0x68(%rbp) > 8cac9: e8 00 00 00 00 callq 8cace <radeon_pm_set_clocks+0x4e> > 8cace: 41 83 7c 24 78 1a cmpl $0x1a,0x78(%r12) > 8cad4: 0f 86 46 01 00 00 jbe 8cc20 > <radeon_pm_set_clocks+0x1a0> > 8cada: 41 80 bc 24 d0 0d 00 cmpb $0x0,0xdd0(%r12) > 8cae1: 00 00 > > This corresponds to: > > static void radeon_pm_set_clocks(struct radeon_device *rdev) > { > int i; > > mutex_lock(&rdev->ddev->struct_mutex); > mutex_lock(&rdev->vram_mutex); > -> mutex_lock(&rdev->cp.mutex); > > We know r12 is the rdev struct pointer (%rdi is the first function > argument gets loaded into r12), and 0x9e8 is plausable as the offset > of cp.mutex, as vram_mutex is further out and plausable as 0x1ca0, so > the code is trying to acquire cp.mutex it seems. > > From the kernel log, the command proc ring test completed after the > start of the GPU reset, which is the last callsite to unlock cp.mutex, > ie <radeon_fence_wait->radeon_gpu_reset->radeon_resume->rv770_resume->rv770_startup->r600_cp_resume->radeon_ring_test->r600_ring_test->radeon_ring_unlock_commit> > before the hang at > <radeon_fence_wait->radeon_gpu_reset->->drm_helper_resume_force_mode->drm_crtc_helper_set_mode->mode_fixup->atombios_crtc_mode_fixup->radeon_pm_compute_clocks->radeon_pm_set_clocks>. > > Looking through all the calls between these two spots, I just can't > see anything that _could_ have locked cp.mutex, unless I'm misreading > the disassembly of mutex_lock. > > Any thoughts? Well, looking further back down the call chain, radeon_bo_create takes the vram_mutex across the ttm_bo_init call, so I was duped by the compiler's instruction scheduling. I looked at various changes to the locking scheme which did address the X hang (thus blackouts), but all attempts failed with lockdep violations somewhere, so there is no minimal fix without revisiting recent locking scheme changes [1,2,3]. I'll pass on some other minor fixes I've cooked up in the meantime. Note that I am running with the kernel boot options 'dynpm=1 radeon.dynclks=1' to stop everything getting toasty; let me know if you need more info for this issue. Thanks, Daniel [1] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37917 [2] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37923 [3] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37937 -- Daniel J Blueman ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: GPU lockup+blackout on 2.6.35 with r600... 2010-08-09 13:41 ` Daniel J Blueman @ 2010-08-17 22:12 ` Daniel J Blueman 2010-08-17 22:37 ` Alex Deucher 0 siblings, 1 reply; 8+ messages in thread From: Daniel J Blueman @ 2010-08-17 22:12 UTC (permalink / raw) To: Alex Deucher, Jerome Glisse, Matthew Garrett; +Cc: dri-devel Hi chaps, On 9 August 2010 14:41, Daniel J Blueman <daniel.blueman@gmail.com> wrote: > On 8 August 2010 17:36, Daniel J Blueman <daniel.blueman@gmail.com> wrote: >> On 5 August 2010 16:23, Alex Deucher <alexdeucher@gmail.com> wrote: >>> On Thu, Aug 5, 2010 at 9:50 AM, Jerome Glisse <glisse@freedesktop.org> wrote: >>>> On 08/05/2010 06:52 AM, Daniel J Blueman wrote: >>>>> >>>>> After around 10-30 mins of typical usage (ie cairo rendering to >>>>> surfaces + compiz) on my Dell Studio 15 (model 1557), my R600 radeon >>>>> locks up, giving a blank screen. >>>>> >>>>> The only information I have is a backtrace [1]. 'radeontool regs' >>>>> doesn't give anything - what information, /sys files etc would be >>>>> useful here, and would it be useful to log a freedesktop.org bug >>>>> report at this stage, assuming I haven't found a reliable reproducer? >>>>> >>>>> OS is Ubuntu 10.10 development release, mesa 7.8.2, libdrm2 2.4.21. >>>>> >>>>> Thanks, >>>>> Daniel >>>>> >>>>> --- [1] >>>>> >>>>> radeon 0000:02:00.0: GPU lockup CP stall for more than 1000msec >>>>> ------------[ cut here ]------------ >>>>> WARNING: at >>>>> /home/kernel-ppa/COD/linux/drivers/gpu/drm/radeon/radeon_fence.c:235 >>>>> radeon_fence_wait+0x2b7/0x320 [radeon]() >>>>> Hardware name: Studio 1557 >>>>> GPU lockup (waiting for 0x00008C35 last fence id 0x00008C33) >>>>> Modules linked in: binfmt_misc parport_pc ppdev kvm_intel kvm >>>>> microcode snd_hda_codec_atihdmi joydev ipt_REJECT xt_limit xt_tcpudp >>>>> ipt_addrtype xt_state snd_hda_codec_idt ip6table_filter ip6_tables >>>>> snd_hda_intel nf_nat_irc nf_conntrack_irc snd_hda_codec nf_nat_ftp >>>>> nf_nat snd_hwdep nf_conntrack_ipv4 snd_seq_midi arc4 snd_pcm >>>>> snd_rawmidi nf_defrag_ipv4 radeon nf_conntrack_ftp nf_conntrack >>>>> iptable_filter snd_seq_midi_event snd_seq ip_tables ttm iwlagn video >>>>> snd_timer uvcvideo x_tables snd_seq_device output drm_kms_helper lp >>>>> dell_laptop iwlcore drm videodev dcdbas dell_wmi v4l1_compat mac80211 >>>>> i7core_edac parport v4l2_compat_ioctl32 snd psmouse edac_core >>>>> i2c_algo_bit soundcore cfg80211 snd_page_alloc serio_raw sdhci_pci >>>>> sdhci ahci libahci led_class r8169 mii btrfs zlib_deflate crc32c >>>>> libcrc32c >>>>> Pid: 1624, comm: Xorg Tainted: G D 2.6.35-020635-generic #020635 >>>>> Call Trace: >>>>> ? radeon_fence_wait+0x2b7/0x320 [radeon] >>>>> warn_slowpath_common+0x90/0xc0 >>>>> warn_slowpath_fmt+0x6e/0x70 >>>>> ? schedule_timeout+0x15a/0x2e0 >>>>> ? r600_irq_set+0x27d/0xc00 [radeon] >>>>> ? radeon_ring_commit+0xa3/0xb0 [radeon] >>>>> ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon] >>>>> ? r600_gpu_is_lockup+0x1cb/0x220 [radeon] >>>>> radeon_fence_wait+0x2b7/0x320 [radeon] >>>>> ? autoremove_wake_function+0x0/0x40 >>>>> radeon_sync_obj_wait+0x11/0x20 [radeon] >>>>> ttm_bo_wait+0x102/0x1b0 [ttm] >>>>> ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm] >>>>> radeon_move_blit+0x124/0x170 [radeon] >>>>> radeon_bo_move+0xda/0x1a0 [radeon] >>>>> ttm_bo_handle_move_mem+0xec/0x370 [ttm] >>>>> ttm_bo_evict+0x1cc/0x270 [ttm] >>>>> ? drm_mm_split_at_start+0x1d/0x80 [drm] >>>>> ttm_mem_evict_first+0xed/0x180 [ttm] >>>>> ? ttm_bo_man_get_node+0xd5/0xe0 [ttm] >>>>> ttm_bo_mem_force_space+0xab/0x110 [ttm] >>>>> ttm_bo_mem_space+0x305/0x370 [ttm] >>>>> ttm_bo_move_buffer+0xcf/0x140 [ttm] >>>>> ? drm_mm_split_at_start+0x1d/0x80 [drm] >>>>> ttm_bo_validate+0xd5/0x100 [ttm] >>>>> ttm_bo_init+0x1f2/0x240 [ttm] >>>>> radeon_bo_create+0x121/0x240 [radeon] >>>>> ? radeon_ttm_bo_destroy+0x0/0x80 [radeon] >>>>> radeon_gem_object_create+0x89/0x100 [radeon] >>>>> ? drm_gem_handle_create+0x8b/0xa0 [drm] >>>>> radeon_gem_create_ioctl+0x58/0xe0 [radeon] >>>>> drm_ioctl+0x283/0x460 [drm] >>>>> ? radeon_gem_create_ioctl+0x0/0xe0 [radeon] >>>>> ? do_readv_writev+0x16b/0x1e0 >>>>> vfs_ioctl+0x3a/0xc0 >>>>> do_vfs_ioctl+0x6d/0x1f0 >>>>> sys_ioctl+0x87/0xa0 >>>>> ? sys_writev+0x55/0xb0 >>>>> system_call_fastpath+0x16/0x1b >>>>> ---[ end trace c0bc12025fa8386c ]--- >>>>> radeon 0000:02:00.0: GPU softreset >>>>> radeon 0000:02:00.0: R_008010_GRBM_STATUS=0xA0003028 >>>>> radeon 0000:02:00.0: R_008014_GRBM_STATUS2=0x00000002 >>>>> radeon 0000:02:00.0: R_000E50_SRBM_STATUS=0x200000C0 >>>>> radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00007FEE >>>>> radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00000001 >>>>> radeon 0000:02:00.0: R_008010_GRBM_STATUS=0x00003028 >>>>> radeon 0000:02:00.0: R_008014_GRBM_STATUS2=0x00000002 >>>>> radeon 0000:02:00.0: R_000E50_SRBM_STATUS=0x200000C0 >>>>> radeon 0000:02:00.0: GPU reset succeed >>>>> Clocks initialized ! >>>>> ring test succeeded in 1 usecs >>>>> ib test succeeded in 1 usecs >>>> >>>> SO it keep reseting ? According to log there is a GPU lockup but then >>>> a successfull GPU reset so GPU should resume fine (that's what log says) >>>> Best is to open a bug and attach full dmesg + lspci -vv and context >>>> in which the lockup happen >>> >>> Perhaps the lockup timeout is too short or interrupts are being >>> delivered late or something? There's a fdo or kernel bugzilla entry >>> where removing the lockup timeout fixed the issue. Does something >>> like this: >>> >>> --- a/drivers/gpu/drm/radeon/radeon_fence.c >>> +++ b/drivers/gpu/drm/radeon/radeon_fence.c >>> @@ -237,9 +237,11 @@ retry: >>> * as signaled for now >>> */ >>> rdev->gpu_lockup = true; >>> +#if 0 >>> r = radeon_gpu_reset(rdev); >>> if (r) >>> return r; >>> +#endif >>> WREG32(rdev->fence_drv.scratch_reg, fence->seq); >>> rdev->gpu_lockup = false; >>> } >>> >>> fix the issue? >> >> Correction, RV710. >> >> Well, RADEON_FENCE_JIFFIES_TIMEOUT plus 1000ms is already >> excruciatingly long for the command processor to not have responded, >> but the issue I'm really hitting here is that X if left hanging after >> GPU reset. The GPU lockup was being reported at: >> >> [<ffffffffa03b66d7>] ? radeon_fence_wait+0x2b7/0x320 [radeon] >> [<ffffffff81061370>] warn_slowpath_common+0x90/0xc0 >> [<ffffffff8106149e>] warn_slowpath_fmt+0x6e/0x70 >> [<ffffffff81575f8a>] ? schedule_timeout+0x15a/0x2e0 >> [<ffffffffa03f5a9d>] ? r600_irq_set+0x27d/0xc00 [radeon] >> [<ffffffffa03ce1f3>] ? radeon_ring_commit+0xa3/0xb0 [radeon] >> [<ffffffffa03d5412>] ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon] >> [<ffffffffa040063b>] ? r600_gpu_is_lockup+0x1cb/0x220 [radeon] >> [<ffffffffa03b66d7>] radeon_fence_wait+0x2b7/0x320 [radeon] >> [<ffffffff8107fcd0>] ? autoremove_wake_function+0x0/0x40 >> [<ffffffffa03b74a1>] radeon_sync_obj_wait+0x11/0x20 [radeon] >> [<ffffffffa02ff332>] ttm_bo_wait+0x102/0x1b0 [ttm] >> [<ffffffffa0302ede>] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm] >> [<ffffffffa03b75f4>] radeon_move_blit+0x124/0x170 [radeon] >> >> X consistently hangs after GPU reset at (with sysrq-T): >> >> [<ffffffff81576838>] __mutex_lock_slowpath+0xe8/0x170 >> [<ffffffff8157647b>] mutex_lock+0x2b/0x50 >> [<ffffffffa0415aba>] radeon_pm_set_clocks+0x3a/0x2d0 [radeon] >> [<ffffffff812abc51>] ? idr_get_new_above+0x11/0x40 >> [<ffffffffa01679ab>] ? drm_mode_object_get+0x6b/0xc0 [drm] >> [<ffffffff811410ae>] ? kmem_cache_alloc_notrace+0x8e/0xd0 >> [<ffffffffa041662d>] radeon_pm_compute_clocks+0xcd/0x240 [radeon] >> [<ffffffffa03a9d21>] atombios_crtc_mode_fixup+0x31/0x60 [radeon] >> [<ffffffffa02ba13a>] drm_crtc_helper_set_mode+0x16a/0x370 [drm_kms_helper] >> [<ffffffffa02ba3bd>] drm_helper_resume_force_mode+0x7d/0x150 [drm_kms_helper] >> [<ffffffffa03a2f35>] radeon_gpu_reset+0x85/0xd0 [radeon] >> [<ffffffffa03b66e7>] radeon_fence_wait+0x2c7/0x320 [radeon] >> [<ffffffff8107fcd0>] ? autoremove_wake_function+0x0/0x40 >> [<ffffffffa03b74a1>] radeon_sync_obj_wait+0x11/0x20 [radeon] >> [<ffffffffa02ff332>] ttm_bo_wait+0x102/0x1b0 [ttm] >> [<ffffffffa0302ede>] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm] >> [<ffffffffa03b75f4>] radeon_move_blit+0x124/0x170 [radeon] >> >> I couldn't reproduce it (but did hit it a number of times over the >> last two weeks), so have been analysing with what we have. The kernel >> the backtrace is against isn't built with debug, so: >> >> radeon_pm_set_clocks(): >> 8ca80: 55 push %rbp >> 8ca81: 48 89 e5 mov %rsp,%rbp >> 8ca84: 41 57 push %r15 >> 8ca86: 41 56 push %r14 >> 8ca88: 41 55 push %r13 >> 8ca8a: 41 54 push %r12 >> 8ca8c: 53 push %rbx >> 8ca8d: 48 83 ec 48 sub $0x48,%rsp >> 8ca91: e8 00 00 00 00 callq 8ca96 <radeon_pm_set_clocks+0x16> >> 8ca96: 49 89 fc mov %rdi,%r12 >> 8ca99: 48 8b 7f 08 mov 0x8(%rdi),%rdi >> 8ca9d: 48 83 c7 20 add $0x20,%rdi >> 8caa1: e8 00 00 00 00 callq 8caa6 <radeon_pm_set_clocks+0x26> >> 8caa6: 49 8d 84 24 a0 1c 00 lea 0x1ca0(%r12),%rax >> 8caad: 00 >> 8caae: 48 89 c7 mov %rax,%rdi >> 8cab1: 48 89 45 90 mov %rax,-0x70(%rbp) >> 8cab5: e8 00 00 00 00 callq 8caba <radeon_pm_set_clocks+0x3a> >> -> 8caba: 49 8d 8c 24 e8 09 00 lea 0x9e8(%r12),%rcx >> 8cac1: 00 >> 8cac2: 48 89 cf mov %rcx,%rdi >> 8cac5: 48 89 4d 98 mov %rcx,-0x68(%rbp) >> 8cac9: e8 00 00 00 00 callq 8cace <radeon_pm_set_clocks+0x4e> >> 8cace: 41 83 7c 24 78 1a cmpl $0x1a,0x78(%r12) >> 8cad4: 0f 86 46 01 00 00 jbe 8cc20 >> <radeon_pm_set_clocks+0x1a0> >> 8cada: 41 80 bc 24 d0 0d 00 cmpb $0x0,0xdd0(%r12) >> 8cae1: 00 00 >> >> This corresponds to: >> >> static void radeon_pm_set_clocks(struct radeon_device *rdev) >> { >> int i; >> >> mutex_lock(&rdev->ddev->struct_mutex); >> mutex_lock(&rdev->vram_mutex); >> -> mutex_lock(&rdev->cp.mutex); >> >> We know r12 is the rdev struct pointer (%rdi is the first function >> argument gets loaded into r12), and 0x9e8 is plausable as the offset >> of cp.mutex, as vram_mutex is further out and plausable as 0x1ca0, so >> the code is trying to acquire cp.mutex it seems. >> >> From the kernel log, the command proc ring test completed after the >> start of the GPU reset, which is the last callsite to unlock cp.mutex, >> ie <radeon_fence_wait->radeon_gpu_reset->radeon_resume->rv770_resume->rv770_startup->r600_cp_resume->radeon_ring_test->r600_ring_test->radeon_ring_unlock_commit> >> before the hang at >> <radeon_fence_wait->radeon_gpu_reset->->drm_helper_resume_force_mode->drm_crtc_helper_set_mode->mode_fixup->atombios_crtc_mode_fixup->radeon_pm_compute_clocks->radeon_pm_set_clocks>. >> >> Looking through all the calls between these two spots, I just can't >> see anything that _could_ have locked cp.mutex, unless I'm misreading >> the disassembly of mutex_lock. >> >> Any thoughts? > > Well, looking further back down the call chain, radeon_bo_create takes > the vram_mutex across the ttm_bo_init call, so I was duped by the > compiler's instruction scheduling. > > I looked at various changes to the locking scheme which did address > the X hang (thus blackouts), but all attempts failed with lockdep > violations somewhere, so there is no minimal fix without revisiting > recent locking scheme changes [1,2,3]. > > I'll pass on some other minor fixes I've cooked up in the meantime. > Note that I am running with the kernel boot options 'dynpm=1 > radeon.dynclks=1' to stop everything getting toasty; let me know if > you need more info for this issue. > > Thanks, > Daniel > > [1] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37917 > [2] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37923 > [3] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37937 I'm hitting this issue again on 2.6.36-rc1; it reproduces the 1000ms timeout and subsequent hang every time on my RV710 when starting a game on doom3 (eg http://ftp.jeuxlinux.fr/files/doom3-linux-1.1.1286-demo.x86.run), even without the dympm/dynclks options. It looks like the locking strategy of the Matthew's patches needs reworking. I was able to reproduce the hang by forcing a GPU reset also. What else would help with this? Thanks, Daniel -- Daniel J Blueman ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: GPU lockup+blackout on 2.6.35 with r600... 2010-08-17 22:12 ` Daniel J Blueman @ 2010-08-17 22:37 ` Alex Deucher 2010-08-23 14:26 ` Daniel J Blueman 0 siblings, 1 reply; 8+ messages in thread From: Alex Deucher @ 2010-08-17 22:37 UTC (permalink / raw) To: Daniel J Blueman; +Cc: dri-devel, Matthew Garrett On Tue, Aug 17, 2010 at 6:12 PM, Daniel J Blueman <daniel.blueman@gmail.com> wrote: > Hi chaps, > > On 9 August 2010 14:41, Daniel J Blueman <daniel.blueman@gmail.com> wrote: >> On 8 August 2010 17:36, Daniel J Blueman <daniel.blueman@gmail.com> wrote: >>> On 5 August 2010 16:23, Alex Deucher <alexdeucher@gmail.com> wrote: >>>> On Thu, Aug 5, 2010 at 9:50 AM, Jerome Glisse <glisse@freedesktop.org> wrote: >>>>> On 08/05/2010 06:52 AM, Daniel J Blueman wrote: >>>>>> >>>>>> After around 10-30 mins of typical usage (ie cairo rendering to >>>>>> surfaces + compiz) on my Dell Studio 15 (model 1557), my R600 radeon >>>>>> locks up, giving a blank screen. >>>>>> >>>>>> The only information I have is a backtrace [1]. 'radeontool regs' >>>>>> doesn't give anything - what information, /sys files etc would be >>>>>> useful here, and would it be useful to log a freedesktop.org bug >>>>>> report at this stage, assuming I haven't found a reliable reproducer? >>>>>> >>>>>> OS is Ubuntu 10.10 development release, mesa 7.8.2, libdrm2 2.4.21. >>>>>> >>>>>> Thanks, >>>>>> Daniel >>>>>> >>>>>> --- [1] >>>>>> >>>>>> radeon 0000:02:00.0: GPU lockup CP stall for more than 1000msec >>>>>> ------------[ cut here ]------------ >>>>>> WARNING: at >>>>>> /home/kernel-ppa/COD/linux/drivers/gpu/drm/radeon/radeon_fence.c:235 >>>>>> radeon_fence_wait+0x2b7/0x320 [radeon]() >>>>>> Hardware name: Studio 1557 >>>>>> GPU lockup (waiting for 0x00008C35 last fence id 0x00008C33) >>>>>> Modules linked in: binfmt_misc parport_pc ppdev kvm_intel kvm >>>>>> microcode snd_hda_codec_atihdmi joydev ipt_REJECT xt_limit xt_tcpudp >>>>>> ipt_addrtype xt_state snd_hda_codec_idt ip6table_filter ip6_tables >>>>>> snd_hda_intel nf_nat_irc nf_conntrack_irc snd_hda_codec nf_nat_ftp >>>>>> nf_nat snd_hwdep nf_conntrack_ipv4 snd_seq_midi arc4 snd_pcm >>>>>> snd_rawmidi nf_defrag_ipv4 radeon nf_conntrack_ftp nf_conntrack >>>>>> iptable_filter snd_seq_midi_event snd_seq ip_tables ttm iwlagn video >>>>>> snd_timer uvcvideo x_tables snd_seq_device output drm_kms_helper lp >>>>>> dell_laptop iwlcore drm videodev dcdbas dell_wmi v4l1_compat mac80211 >>>>>> i7core_edac parport v4l2_compat_ioctl32 snd psmouse edac_core >>>>>> i2c_algo_bit soundcore cfg80211 snd_page_alloc serio_raw sdhci_pci >>>>>> sdhci ahci libahci led_class r8169 mii btrfs zlib_deflate crc32c >>>>>> libcrc32c >>>>>> Pid: 1624, comm: Xorg Tainted: G D 2.6.35-020635-generic #020635 >>>>>> Call Trace: >>>>>> ? radeon_fence_wait+0x2b7/0x320 [radeon] >>>>>> warn_slowpath_common+0x90/0xc0 >>>>>> warn_slowpath_fmt+0x6e/0x70 >>>>>> ? schedule_timeout+0x15a/0x2e0 >>>>>> ? r600_irq_set+0x27d/0xc00 [radeon] >>>>>> ? radeon_ring_commit+0xa3/0xb0 [radeon] >>>>>> ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon] >>>>>> ? r600_gpu_is_lockup+0x1cb/0x220 [radeon] >>>>>> radeon_fence_wait+0x2b7/0x320 [radeon] >>>>>> ? autoremove_wake_function+0x0/0x40 >>>>>> radeon_sync_obj_wait+0x11/0x20 [radeon] >>>>>> ttm_bo_wait+0x102/0x1b0 [ttm] >>>>>> ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm] >>>>>> radeon_move_blit+0x124/0x170 [radeon] >>>>>> radeon_bo_move+0xda/0x1a0 [radeon] >>>>>> ttm_bo_handle_move_mem+0xec/0x370 [ttm] >>>>>> ttm_bo_evict+0x1cc/0x270 [ttm] >>>>>> ? drm_mm_split_at_start+0x1d/0x80 [drm] >>>>>> ttm_mem_evict_first+0xed/0x180 [ttm] >>>>>> ? ttm_bo_man_get_node+0xd5/0xe0 [ttm] >>>>>> ttm_bo_mem_force_space+0xab/0x110 [ttm] >>>>>> ttm_bo_mem_space+0x305/0x370 [ttm] >>>>>> ttm_bo_move_buffer+0xcf/0x140 [ttm] >>>>>> ? drm_mm_split_at_start+0x1d/0x80 [drm] >>>>>> ttm_bo_validate+0xd5/0x100 [ttm] >>>>>> ttm_bo_init+0x1f2/0x240 [ttm] >>>>>> radeon_bo_create+0x121/0x240 [radeon] >>>>>> ? radeon_ttm_bo_destroy+0x0/0x80 [radeon] >>>>>> radeon_gem_object_create+0x89/0x100 [radeon] >>>>>> ? drm_gem_handle_create+0x8b/0xa0 [drm] >>>>>> radeon_gem_create_ioctl+0x58/0xe0 [radeon] >>>>>> drm_ioctl+0x283/0x460 [drm] >>>>>> ? radeon_gem_create_ioctl+0x0/0xe0 [radeon] >>>>>> ? do_readv_writev+0x16b/0x1e0 >>>>>> vfs_ioctl+0x3a/0xc0 >>>>>> do_vfs_ioctl+0x6d/0x1f0 >>>>>> sys_ioctl+0x87/0xa0 >>>>>> ? sys_writev+0x55/0xb0 >>>>>> system_call_fastpath+0x16/0x1b >>>>>> ---[ end trace c0bc12025fa8386c ]--- >>>>>> radeon 0000:02:00.0: GPU softreset >>>>>> radeon 0000:02:00.0: R_008010_GRBM_STATUS=0xA0003028 >>>>>> radeon 0000:02:00.0: R_008014_GRBM_STATUS2=0x00000002 >>>>>> radeon 0000:02:00.0: R_000E50_SRBM_STATUS=0x200000C0 >>>>>> radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00007FEE >>>>>> radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00000001 >>>>>> radeon 0000:02:00.0: R_008010_GRBM_STATUS=0x00003028 >>>>>> radeon 0000:02:00.0: R_008014_GRBM_STATUS2=0x00000002 >>>>>> radeon 0000:02:00.0: R_000E50_SRBM_STATUS=0x200000C0 >>>>>> radeon 0000:02:00.0: GPU reset succeed >>>>>> Clocks initialized ! >>>>>> ring test succeeded in 1 usecs >>>>>> ib test succeeded in 1 usecs >>>>> >>>>> SO it keep reseting ? According to log there is a GPU lockup but then >>>>> a successfull GPU reset so GPU should resume fine (that's what log says) >>>>> Best is to open a bug and attach full dmesg + lspci -vv and context >>>>> in which the lockup happen >>>> >>>> Perhaps the lockup timeout is too short or interrupts are being >>>> delivered late or something? There's a fdo or kernel bugzilla entry >>>> where removing the lockup timeout fixed the issue. Does something >>>> like this: >>>> >>>> --- a/drivers/gpu/drm/radeon/radeon_fence.c >>>> +++ b/drivers/gpu/drm/radeon/radeon_fence.c >>>> @@ -237,9 +237,11 @@ retry: >>>> * as signaled for now >>>> */ >>>> rdev->gpu_lockup = true; >>>> +#if 0 >>>> r = radeon_gpu_reset(rdev); >>>> if (r) >>>> return r; >>>> +#endif >>>> WREG32(rdev->fence_drv.scratch_reg, fence->seq); >>>> rdev->gpu_lockup = false; >>>> } >>>> >>>> fix the issue? >>> >>> Correction, RV710. >>> >>> Well, RADEON_FENCE_JIFFIES_TIMEOUT plus 1000ms is already >>> excruciatingly long for the command processor to not have responded, >>> but the issue I'm really hitting here is that X if left hanging after >>> GPU reset. The GPU lockup was being reported at: >>> >>> [<ffffffffa03b66d7>] ? radeon_fence_wait+0x2b7/0x320 [radeon] >>> [<ffffffff81061370>] warn_slowpath_common+0x90/0xc0 >>> [<ffffffff8106149e>] warn_slowpath_fmt+0x6e/0x70 >>> [<ffffffff81575f8a>] ? schedule_timeout+0x15a/0x2e0 >>> [<ffffffffa03f5a9d>] ? r600_irq_set+0x27d/0xc00 [radeon] >>> [<ffffffffa03ce1f3>] ? radeon_ring_commit+0xa3/0xb0 [radeon] >>> [<ffffffffa03d5412>] ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon] >>> [<ffffffffa040063b>] ? r600_gpu_is_lockup+0x1cb/0x220 [radeon] >>> [<ffffffffa03b66d7>] radeon_fence_wait+0x2b7/0x320 [radeon] >>> [<ffffffff8107fcd0>] ? autoremove_wake_function+0x0/0x40 >>> [<ffffffffa03b74a1>] radeon_sync_obj_wait+0x11/0x20 [radeon] >>> [<ffffffffa02ff332>] ttm_bo_wait+0x102/0x1b0 [ttm] >>> [<ffffffffa0302ede>] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm] >>> [<ffffffffa03b75f4>] radeon_move_blit+0x124/0x170 [radeon] >>> >>> X consistently hangs after GPU reset at (with sysrq-T): >>> >>> [<ffffffff81576838>] __mutex_lock_slowpath+0xe8/0x170 >>> [<ffffffff8157647b>] mutex_lock+0x2b/0x50 >>> [<ffffffffa0415aba>] radeon_pm_set_clocks+0x3a/0x2d0 [radeon] >>> [<ffffffff812abc51>] ? idr_get_new_above+0x11/0x40 >>> [<ffffffffa01679ab>] ? drm_mode_object_get+0x6b/0xc0 [drm] >>> [<ffffffff811410ae>] ? kmem_cache_alloc_notrace+0x8e/0xd0 >>> [<ffffffffa041662d>] radeon_pm_compute_clocks+0xcd/0x240 [radeon] >>> [<ffffffffa03a9d21>] atombios_crtc_mode_fixup+0x31/0x60 [radeon] >>> [<ffffffffa02ba13a>] drm_crtc_helper_set_mode+0x16a/0x370 [drm_kms_helper] >>> [<ffffffffa02ba3bd>] drm_helper_resume_force_mode+0x7d/0x150 [drm_kms_helper] >>> [<ffffffffa03a2f35>] radeon_gpu_reset+0x85/0xd0 [radeon] >>> [<ffffffffa03b66e7>] radeon_fence_wait+0x2c7/0x320 [radeon] >>> [<ffffffff8107fcd0>] ? autoremove_wake_function+0x0/0x40 >>> [<ffffffffa03b74a1>] radeon_sync_obj_wait+0x11/0x20 [radeon] >>> [<ffffffffa02ff332>] ttm_bo_wait+0x102/0x1b0 [ttm] >>> [<ffffffffa0302ede>] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm] >>> [<ffffffffa03b75f4>] radeon_move_blit+0x124/0x170 [radeon] >>> >>> I couldn't reproduce it (but did hit it a number of times over the >>> last two weeks), so have been analysing with what we have. The kernel >>> the backtrace is against isn't built with debug, so: >>> >>> radeon_pm_set_clocks(): >>> 8ca80: 55 push %rbp >>> 8ca81: 48 89 e5 mov %rsp,%rbp >>> 8ca84: 41 57 push %r15 >>> 8ca86: 41 56 push %r14 >>> 8ca88: 41 55 push %r13 >>> 8ca8a: 41 54 push %r12 >>> 8ca8c: 53 push %rbx >>> 8ca8d: 48 83 ec 48 sub $0x48,%rsp >>> 8ca91: e8 00 00 00 00 callq 8ca96 <radeon_pm_set_clocks+0x16> >>> 8ca96: 49 89 fc mov %rdi,%r12 >>> 8ca99: 48 8b 7f 08 mov 0x8(%rdi),%rdi >>> 8ca9d: 48 83 c7 20 add $0x20,%rdi >>> 8caa1: e8 00 00 00 00 callq 8caa6 <radeon_pm_set_clocks+0x26> >>> 8caa6: 49 8d 84 24 a0 1c 00 lea 0x1ca0(%r12),%rax >>> 8caad: 00 >>> 8caae: 48 89 c7 mov %rax,%rdi >>> 8cab1: 48 89 45 90 mov %rax,-0x70(%rbp) >>> 8cab5: e8 00 00 00 00 callq 8caba <radeon_pm_set_clocks+0x3a> >>> -> 8caba: 49 8d 8c 24 e8 09 00 lea 0x9e8(%r12),%rcx >>> 8cac1: 00 >>> 8cac2: 48 89 cf mov %rcx,%rdi >>> 8cac5: 48 89 4d 98 mov %rcx,-0x68(%rbp) >>> 8cac9: e8 00 00 00 00 callq 8cace <radeon_pm_set_clocks+0x4e> >>> 8cace: 41 83 7c 24 78 1a cmpl $0x1a,0x78(%r12) >>> 8cad4: 0f 86 46 01 00 00 jbe 8cc20 >>> <radeon_pm_set_clocks+0x1a0> >>> 8cada: 41 80 bc 24 d0 0d 00 cmpb $0x0,0xdd0(%r12) >>> 8cae1: 00 00 >>> >>> This corresponds to: >>> >>> static void radeon_pm_set_clocks(struct radeon_device *rdev) >>> { >>> int i; >>> >>> mutex_lock(&rdev->ddev->struct_mutex); >>> mutex_lock(&rdev->vram_mutex); >>> -> mutex_lock(&rdev->cp.mutex); >>> >>> We know r12 is the rdev struct pointer (%rdi is the first function >>> argument gets loaded into r12), and 0x9e8 is plausable as the offset >>> of cp.mutex, as vram_mutex is further out and plausable as 0x1ca0, so >>> the code is trying to acquire cp.mutex it seems. >>> >>> From the kernel log, the command proc ring test completed after the >>> start of the GPU reset, which is the last callsite to unlock cp.mutex, >>> ie <radeon_fence_wait->radeon_gpu_reset->radeon_resume->rv770_resume->rv770_startup->r600_cp_resume->radeon_ring_test->r600_ring_test->radeon_ring_unlock_commit> >>> before the hang at >>> <radeon_fence_wait->radeon_gpu_reset->->drm_helper_resume_force_mode->drm_crtc_helper_set_mode->mode_fixup->atombios_crtc_mode_fixup->radeon_pm_compute_clocks->radeon_pm_set_clocks>. >>> >>> Looking through all the calls between these two spots, I just can't >>> see anything that _could_ have locked cp.mutex, unless I'm misreading >>> the disassembly of mutex_lock. >>> >>> Any thoughts? >> >> Well, looking further back down the call chain, radeon_bo_create takes >> the vram_mutex across the ttm_bo_init call, so I was duped by the >> compiler's instruction scheduling. >> >> I looked at various changes to the locking scheme which did address >> the X hang (thus blackouts), but all attempts failed with lockdep >> violations somewhere, so there is no minimal fix without revisiting >> recent locking scheme changes [1,2,3]. >> >> I'll pass on some other minor fixes I've cooked up in the meantime. >> Note that I am running with the kernel boot options 'dynpm=1 >> radeon.dynclks=1' to stop everything getting toasty; let me know if >> you need more info for this issue. >> >> Thanks, >> Daniel >> >> [1] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37917 >> [2] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37923 >> [3] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37937 > > I'm hitting this issue again on 2.6.36-rc1; it reproduces the 1000ms > timeout and subsequent hang every time on my RV710 when starting a > game on doom3 (eg > http://ftp.jeuxlinux.fr/files/doom3-linux-1.1.1286-demo.x86.run), even > without the dympm/dynclks options. > > It looks like the locking strategy of the Matthew's patches needs > reworking. I was able to reproduce the hang by forcing a GPU reset > also. > > What else would help with this? Does this patch help? http://git.kernel.org/?p=linux/kernel/git/airlied/drm-2.6.git;a=commitdiff;h=d02a83bb2dd3e34408844961336a66310a35ba30 It fixes several other similar bugs. Alex ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: GPU lockup+blackout on 2.6.35 with r600... 2010-08-17 22:37 ` Alex Deucher @ 2010-08-23 14:26 ` Daniel J Blueman 0 siblings, 0 replies; 8+ messages in thread From: Daniel J Blueman @ 2010-08-23 14:26 UTC (permalink / raw) To: Alex Deucher; +Cc: dri-devel, Matthew Garrett On 17 August 2010 23:37, Alex Deucher <alexdeucher@gmail.com> wrote: > On Tue, Aug 17, 2010 at 6:12 PM, Daniel J Blueman > <daniel.blueman@gmail.com> wrote: >> Hi chaps, >> >> On 9 August 2010 14:41, Daniel J Blueman <daniel.blueman@gmail.com> wrote: >>> On 8 August 2010 17:36, Daniel J Blueman <daniel.blueman@gmail.com> wrote: >>>> On 5 August 2010 16:23, Alex Deucher <alexdeucher@gmail.com> wrote: >>>>> On Thu, Aug 5, 2010 at 9:50 AM, Jerome Glisse <glisse@freedesktop.org> wrote: >>>>>> On 08/05/2010 06:52 AM, Daniel J Blueman wrote: >>>>>>> >>>>>>> After around 10-30 mins of typical usage (ie cairo rendering to >>>>>>> surfaces + compiz) on my Dell Studio 15 (model 1557), my R600 radeon >>>>>>> locks up, giving a blank screen. >>>>>>> >>>>>>> The only information I have is a backtrace [1]. 'radeontool regs' >>>>>>> doesn't give anything - what information, /sys files etc would be >>>>>>> useful here, and would it be useful to log a freedesktop.org bug >>>>>>> report at this stage, assuming I haven't found a reliable reproducer? >>>>>>> >>>>>>> OS is Ubuntu 10.10 development release, mesa 7.8.2, libdrm2 2.4.21. >>>>>>> >>>>>>> Thanks, >>>>>>> Daniel >>>>>>> >>>>>>> --- [1] >>>>>>> >>>>>>> radeon 0000:02:00.0: GPU lockup CP stall for more than 1000msec >>>>>>> ------------[ cut here ]------------ >>>>>>> WARNING: at >>>>>>> /home/kernel-ppa/COD/linux/drivers/gpu/drm/radeon/radeon_fence.c:235 >>>>>>> radeon_fence_wait+0x2b7/0x320 [radeon]() >>>>>>> Hardware name: Studio 1557 >>>>>>> GPU lockup (waiting for 0x00008C35 last fence id 0x00008C33) >>>>>>> Modules linked in: binfmt_misc parport_pc ppdev kvm_intel kvm >>>>>>> microcode snd_hda_codec_atihdmi joydev ipt_REJECT xt_limit xt_tcpudp >>>>>>> ipt_addrtype xt_state snd_hda_codec_idt ip6table_filter ip6_tables >>>>>>> snd_hda_intel nf_nat_irc nf_conntrack_irc snd_hda_codec nf_nat_ftp >>>>>>> nf_nat snd_hwdep nf_conntrack_ipv4 snd_seq_midi arc4 snd_pcm >>>>>>> snd_rawmidi nf_defrag_ipv4 radeon nf_conntrack_ftp nf_conntrack >>>>>>> iptable_filter snd_seq_midi_event snd_seq ip_tables ttm iwlagn video >>>>>>> snd_timer uvcvideo x_tables snd_seq_device output drm_kms_helper lp >>>>>>> dell_laptop iwlcore drm videodev dcdbas dell_wmi v4l1_compat mac80211 >>>>>>> i7core_edac parport v4l2_compat_ioctl32 snd psmouse edac_core >>>>>>> i2c_algo_bit soundcore cfg80211 snd_page_alloc serio_raw sdhci_pci >>>>>>> sdhci ahci libahci led_class r8169 mii btrfs zlib_deflate crc32c >>>>>>> libcrc32c >>>>>>> Pid: 1624, comm: Xorg Tainted: G D 2.6.35-020635-generic #020635 >>>>>>> Call Trace: >>>>>>> ? radeon_fence_wait+0x2b7/0x320 [radeon] >>>>>>> warn_slowpath_common+0x90/0xc0 >>>>>>> warn_slowpath_fmt+0x6e/0x70 >>>>>>> ? schedule_timeout+0x15a/0x2e0 >>>>>>> ? r600_irq_set+0x27d/0xc00 [radeon] >>>>>>> ? radeon_ring_commit+0xa3/0xb0 [radeon] >>>>>>> ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon] >>>>>>> ? r600_gpu_is_lockup+0x1cb/0x220 [radeon] >>>>>>> radeon_fence_wait+0x2b7/0x320 [radeon] >>>>>>> ? autoremove_wake_function+0x0/0x40 >>>>>>> radeon_sync_obj_wait+0x11/0x20 [radeon] >>>>>>> ttm_bo_wait+0x102/0x1b0 [ttm] >>>>>>> ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm] >>>>>>> radeon_move_blit+0x124/0x170 [radeon] >>>>>>> radeon_bo_move+0xda/0x1a0 [radeon] >>>>>>> ttm_bo_handle_move_mem+0xec/0x370 [ttm] >>>>>>> ttm_bo_evict+0x1cc/0x270 [ttm] >>>>>>> ? drm_mm_split_at_start+0x1d/0x80 [drm] >>>>>>> ttm_mem_evict_first+0xed/0x180 [ttm] >>>>>>> ? ttm_bo_man_get_node+0xd5/0xe0 [ttm] >>>>>>> ttm_bo_mem_force_space+0xab/0x110 [ttm] >>>>>>> ttm_bo_mem_space+0x305/0x370 [ttm] >>>>>>> ttm_bo_move_buffer+0xcf/0x140 [ttm] >>>>>>> ? drm_mm_split_at_start+0x1d/0x80 [drm] >>>>>>> ttm_bo_validate+0xd5/0x100 [ttm] >>>>>>> ttm_bo_init+0x1f2/0x240 [ttm] >>>>>>> radeon_bo_create+0x121/0x240 [radeon] >>>>>>> ? radeon_ttm_bo_destroy+0x0/0x80 [radeon] >>>>>>> radeon_gem_object_create+0x89/0x100 [radeon] >>>>>>> ? drm_gem_handle_create+0x8b/0xa0 [drm] >>>>>>> radeon_gem_create_ioctl+0x58/0xe0 [radeon] >>>>>>> drm_ioctl+0x283/0x460 [drm] >>>>>>> ? radeon_gem_create_ioctl+0x0/0xe0 [radeon] >>>>>>> ? do_readv_writev+0x16b/0x1e0 >>>>>>> vfs_ioctl+0x3a/0xc0 >>>>>>> do_vfs_ioctl+0x6d/0x1f0 >>>>>>> sys_ioctl+0x87/0xa0 >>>>>>> ? sys_writev+0x55/0xb0 >>>>>>> system_call_fastpath+0x16/0x1b >>>>>>> ---[ end trace c0bc12025fa8386c ]--- >>>>>>> radeon 0000:02:00.0: GPU softreset >>>>>>> radeon 0000:02:00.0: R_008010_GRBM_STATUS=0xA0003028 >>>>>>> radeon 0000:02:00.0: R_008014_GRBM_STATUS2=0x00000002 >>>>>>> radeon 0000:02:00.0: R_000E50_SRBM_STATUS=0x200000C0 >>>>>>> radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00007FEE >>>>>>> radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00000001 >>>>>>> radeon 0000:02:00.0: R_008010_GRBM_STATUS=0x00003028 >>>>>>> radeon 0000:02:00.0: R_008014_GRBM_STATUS2=0x00000002 >>>>>>> radeon 0000:02:00.0: R_000E50_SRBM_STATUS=0x200000C0 >>>>>>> radeon 0000:02:00.0: GPU reset succeed >>>>>>> Clocks initialized ! >>>>>>> ring test succeeded in 1 usecs >>>>>>> ib test succeeded in 1 usecs >>>>>> >>>>>> SO it keep reseting ? According to log there is a GPU lockup but then >>>>>> a successfull GPU reset so GPU should resume fine (that's what log says) >>>>>> Best is to open a bug and attach full dmesg + lspci -vv and context >>>>>> in which the lockup happen >>>>> >>>>> Perhaps the lockup timeout is too short or interrupts are being >>>>> delivered late or something? There's a fdo or kernel bugzilla entry >>>>> where removing the lockup timeout fixed the issue. Does something >>>>> like this: >>>>> >>>>> --- a/drivers/gpu/drm/radeon/radeon_fence.c >>>>> +++ b/drivers/gpu/drm/radeon/radeon_fence.c >>>>> @@ -237,9 +237,11 @@ retry: >>>>> * as signaled for now >>>>> */ >>>>> rdev->gpu_lockup = true; >>>>> +#if 0 >>>>> r = radeon_gpu_reset(rdev); >>>>> if (r) >>>>> return r; >>>>> +#endif >>>>> WREG32(rdev->fence_drv.scratch_reg, fence->seq); >>>>> rdev->gpu_lockup = false; >>>>> } >>>>> >>>>> fix the issue? >>>> >>>> Correction, RV710. >>>> >>>> Well, RADEON_FENCE_JIFFIES_TIMEOUT plus 1000ms is already >>>> excruciatingly long for the command processor to not have responded, >>>> but the issue I'm really hitting here is that X if left hanging after >>>> GPU reset. The GPU lockup was being reported at: >>>> >>>> [<ffffffffa03b66d7>] ? radeon_fence_wait+0x2b7/0x320 [radeon] >>>> [<ffffffff81061370>] warn_slowpath_common+0x90/0xc0 >>>> [<ffffffff8106149e>] warn_slowpath_fmt+0x6e/0x70 >>>> [<ffffffff81575f8a>] ? schedule_timeout+0x15a/0x2e0 >>>> [<ffffffffa03f5a9d>] ? r600_irq_set+0x27d/0xc00 [radeon] >>>> [<ffffffffa03ce1f3>] ? radeon_ring_commit+0xa3/0xb0 [radeon] >>>> [<ffffffffa03d5412>] ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon] >>>> [<ffffffffa040063b>] ? r600_gpu_is_lockup+0x1cb/0x220 [radeon] >>>> [<ffffffffa03b66d7>] radeon_fence_wait+0x2b7/0x320 [radeon] >>>> [<ffffffff8107fcd0>] ? autoremove_wake_function+0x0/0x40 >>>> [<ffffffffa03b74a1>] radeon_sync_obj_wait+0x11/0x20 [radeon] >>>> [<ffffffffa02ff332>] ttm_bo_wait+0x102/0x1b0 [ttm] >>>> [<ffffffffa0302ede>] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm] >>>> [<ffffffffa03b75f4>] radeon_move_blit+0x124/0x170 [radeon] >>>> >>>> X consistently hangs after GPU reset at (with sysrq-T): >>>> >>>> [<ffffffff81576838>] __mutex_lock_slowpath+0xe8/0x170 >>>> [<ffffffff8157647b>] mutex_lock+0x2b/0x50 >>>> [<ffffffffa0415aba>] radeon_pm_set_clocks+0x3a/0x2d0 [radeon] >>>> [<ffffffff812abc51>] ? idr_get_new_above+0x11/0x40 >>>> [<ffffffffa01679ab>] ? drm_mode_object_get+0x6b/0xc0 [drm] >>>> [<ffffffff811410ae>] ? kmem_cache_alloc_notrace+0x8e/0xd0 >>>> [<ffffffffa041662d>] radeon_pm_compute_clocks+0xcd/0x240 [radeon] >>>> [<ffffffffa03a9d21>] atombios_crtc_mode_fixup+0x31/0x60 [radeon] >>>> [<ffffffffa02ba13a>] drm_crtc_helper_set_mode+0x16a/0x370 [drm_kms_helper] >>>> [<ffffffffa02ba3bd>] drm_helper_resume_force_mode+0x7d/0x150 [drm_kms_helper] >>>> [<ffffffffa03a2f35>] radeon_gpu_reset+0x85/0xd0 [radeon] >>>> [<ffffffffa03b66e7>] radeon_fence_wait+0x2c7/0x320 [radeon] >>>> [<ffffffff8107fcd0>] ? autoremove_wake_function+0x0/0x40 >>>> [<ffffffffa03b74a1>] radeon_sync_obj_wait+0x11/0x20 [radeon] >>>> [<ffffffffa02ff332>] ttm_bo_wait+0x102/0x1b0 [ttm] >>>> [<ffffffffa0302ede>] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm] >>>> [<ffffffffa03b75f4>] radeon_move_blit+0x124/0x170 [radeon] >>>> >>>> I couldn't reproduce it (but did hit it a number of times over the >>>> last two weeks), so have been analysing with what we have. The kernel >>>> the backtrace is against isn't built with debug, so: >>>> >>>> radeon_pm_set_clocks(): >>>> 8ca80: 55 push %rbp >>>> 8ca81: 48 89 e5 mov %rsp,%rbp >>>> 8ca84: 41 57 push %r15 >>>> 8ca86: 41 56 push %r14 >>>> 8ca88: 41 55 push %r13 >>>> 8ca8a: 41 54 push %r12 >>>> 8ca8c: 53 push %rbx >>>> 8ca8d: 48 83 ec 48 sub $0x48,%rsp >>>> 8ca91: e8 00 00 00 00 callq 8ca96 <radeon_pm_set_clocks+0x16> >>>> 8ca96: 49 89 fc mov %rdi,%r12 >>>> 8ca99: 48 8b 7f 08 mov 0x8(%rdi),%rdi >>>> 8ca9d: 48 83 c7 20 add $0x20,%rdi >>>> 8caa1: e8 00 00 00 00 callq 8caa6 <radeon_pm_set_clocks+0x26> >>>> 8caa6: 49 8d 84 24 a0 1c 00 lea 0x1ca0(%r12),%rax >>>> 8caad: 00 >>>> 8caae: 48 89 c7 mov %rax,%rdi >>>> 8cab1: 48 89 45 90 mov %rax,-0x70(%rbp) >>>> 8cab5: e8 00 00 00 00 callq 8caba <radeon_pm_set_clocks+0x3a> >>>> -> 8caba: 49 8d 8c 24 e8 09 00 lea 0x9e8(%r12),%rcx >>>> 8cac1: 00 >>>> 8cac2: 48 89 cf mov %rcx,%rdi >>>> 8cac5: 48 89 4d 98 mov %rcx,-0x68(%rbp) >>>> 8cac9: e8 00 00 00 00 callq 8cace <radeon_pm_set_clocks+0x4e> >>>> 8cace: 41 83 7c 24 78 1a cmpl $0x1a,0x78(%r12) >>>> 8cad4: 0f 86 46 01 00 00 jbe 8cc20 >>>> <radeon_pm_set_clocks+0x1a0> >>>> 8cada: 41 80 bc 24 d0 0d 00 cmpb $0x0,0xdd0(%r12) >>>> 8cae1: 00 00 >>>> >>>> This corresponds to: >>>> >>>> static void radeon_pm_set_clocks(struct radeon_device *rdev) >>>> { >>>> int i; >>>> >>>> mutex_lock(&rdev->ddev->struct_mutex); >>>> mutex_lock(&rdev->vram_mutex); >>>> -> mutex_lock(&rdev->cp.mutex); >>>> >>>> We know r12 is the rdev struct pointer (%rdi is the first function >>>> argument gets loaded into r12), and 0x9e8 is plausable as the offset >>>> of cp.mutex, as vram_mutex is further out and plausable as 0x1ca0, so >>>> the code is trying to acquire cp.mutex it seems. >>>> >>>> From the kernel log, the command proc ring test completed after the >>>> start of the GPU reset, which is the last callsite to unlock cp.mutex, >>>> ie <radeon_fence_wait->radeon_gpu_reset->radeon_resume->rv770_resume->rv770_startup->r600_cp_resume->radeon_ring_test->r600_ring_test->radeon_ring_unlock_commit> >>>> before the hang at >>>> <radeon_fence_wait->radeon_gpu_reset->->drm_helper_resume_force_mode->drm_crtc_helper_set_mode->mode_fixup->atombios_crtc_mode_fixup->radeon_pm_compute_clocks->radeon_pm_set_clocks>. >>>> >>>> Looking through all the calls between these two spots, I just can't >>>> see anything that _could_ have locked cp.mutex, unless I'm misreading >>>> the disassembly of mutex_lock. >>>> >>>> Any thoughts? >>> >>> Well, looking further back down the call chain, radeon_bo_create takes >>> the vram_mutex across the ttm_bo_init call, so I was duped by the >>> compiler's instruction scheduling. >>> >>> I looked at various changes to the locking scheme which did address >>> the X hang (thus blackouts), but all attempts failed with lockdep >>> violations somewhere, so there is no minimal fix without revisiting >>> recent locking scheme changes [1,2,3]. >>> >>> I'll pass on some other minor fixes I've cooked up in the meantime. >>> Note that I am running with the kernel boot options 'dynpm=1 >>> radeon.dynclks=1' to stop everything getting toasty; let me know if >>> you need more info for this issue. >>> >>> Thanks, >>> Daniel >>> >>> [1] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37917 >>> [2] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37923 >>> [3] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37937 >> >> I'm hitting this issue again on 2.6.36-rc1; it reproduces the 1000ms >> timeout and subsequent hang every time on my RV710 when starting a >> game on doom3 (eg >> http://ftp.jeuxlinux.fr/files/doom3-linux-1.1.1286-demo.x86.run), even >> without the dympm/dynclks options. >> >> It looks like the locking strategy of the Matthew's patches needs >> reworking. I was able to reproduce the hang by forcing a GPU reset >> also. >> >> What else would help with this? > > Does this patch help? > > http://git.kernel.org/?p=linux/kernel/git/airlied/drm-2.6.git;a=commitdiff;h=d02a83bb2dd3e34408844961336a66310a35ba30 > > It fixes several other similar bugs. Yes, sorry for the delays replying. I still didn't get time to analyse other call-paths this can affect, and thus the possibility of a pending power mode change causing the same deadlock. That aside, this patch does prevent the deadlocks I was able to reproduce, including when the CS parser detects issues in user-submitted CS packets. Thanks, Daniel -- Daniel J Blueman ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2010-08-23 14:26 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-08-05 10:52 GPU lockup+blackout on 2.6.35 with r600 Daniel J Blueman 2010-08-05 13:50 ` Jerome Glisse 2010-08-05 15:23 ` Alex Deucher 2010-08-08 16:36 ` Daniel J Blueman 2010-08-09 13:41 ` Daniel J Blueman 2010-08-17 22:12 ` Daniel J Blueman 2010-08-17 22:37 ` Alex Deucher 2010-08-23 14:26 ` Daniel J Blueman
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.