GPU lockup+blackout on 2.6.35 with r600...

All of lore.kernel.org
 help / color / mirror / Atom feed

* GPU lockup+blackout on 2.6.35 with r600...
@ 2010-08-05 10:52 Daniel J Blueman
  2010-08-05 13:50 ` Jerome Glisse
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel J Blueman @ 2010-08-05 10:52 UTC (permalink / raw)
  To: dri-devel

After around 10-30 mins of typical usage (ie cairo rendering to
surfaces + compiz) on my Dell Studio 15 (model 1557), my R600 radeon
locks up, giving a blank screen.

The only information I have is a backtrace [1]. 'radeontool regs'
doesn't give anything - what information, /sys files etc would be
useful here, and would it be useful to log a freedesktop.org bug
report at this stage, assuming I haven't found a reliable reproducer?

OS is Ubuntu 10.10 development release, mesa 7.8.2, libdrm2 2.4.21.

Thanks,
  Daniel

--- [1]

radeon 0000:02:00.0: GPU lockup CP stall for more than 1000msec
------------[ cut here ]------------
WARNING: at /home/kernel-ppa/COD/linux/drivers/gpu/drm/radeon/radeon_fence.c:235
radeon_fence_wait+0x2b7/0x320 [radeon]()
Hardware name: Studio 1557
GPU lockup (waiting for 0x00008C35 last fence id 0x00008C33)
Modules linked in: binfmt_misc parport_pc ppdev kvm_intel kvm
microcode snd_hda_codec_atihdmi joydev ipt_REJECT xt_limit xt_tcpudp
ipt_addrtype xt_state snd_hda_codec_idt ip6table_filter ip6_tables
snd_hda_intel nf_nat_irc nf_conntrack_irc snd_hda_codec nf_nat_ftp
nf_nat snd_hwdep nf_conntrack_ipv4 snd_seq_midi arc4 snd_pcm
snd_rawmidi nf_defrag_ipv4 radeon nf_conntrack_ftp nf_conntrack
iptable_filter snd_seq_midi_event snd_seq ip_tables ttm iwlagn video
snd_timer uvcvideo x_tables snd_seq_device output drm_kms_helper lp
dell_laptop iwlcore drm videodev dcdbas dell_wmi v4l1_compat mac80211
i7core_edac parport v4l2_compat_ioctl32 snd psmouse edac_core
i2c_algo_bit soundcore cfg80211 snd_page_alloc serio_raw sdhci_pci
sdhci ahci libahci led_class r8169 mii btrfs zlib_deflate crc32c
libcrc32c
Pid: 1624, comm: Xorg Tainted: G      D     2.6.35-020635-generic #020635
Call Trace:
? radeon_fence_wait+0x2b7/0x320 [radeon]
warn_slowpath_common+0x90/0xc0
warn_slowpath_fmt+0x6e/0x70
? schedule_timeout+0x15a/0x2e0
? r600_irq_set+0x27d/0xc00 [radeon]
? radeon_ring_commit+0xa3/0xb0 [radeon]
? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon]
? r600_gpu_is_lockup+0x1cb/0x220 [radeon]
radeon_fence_wait+0x2b7/0x320 [radeon]
? autoremove_wake_function+0x0/0x40
radeon_sync_obj_wait+0x11/0x20 [radeon]
ttm_bo_wait+0x102/0x1b0 [ttm]
ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm]
radeon_move_blit+0x124/0x170 [radeon]
radeon_bo_move+0xda/0x1a0 [radeon]
ttm_bo_handle_move_mem+0xec/0x370 [ttm]
ttm_bo_evict+0x1cc/0x270 [ttm]
? drm_mm_split_at_start+0x1d/0x80 [drm]
ttm_mem_evict_first+0xed/0x180 [ttm]
? ttm_bo_man_get_node+0xd5/0xe0 [ttm]
ttm_bo_mem_force_space+0xab/0x110 [ttm]
ttm_bo_mem_space+0x305/0x370 [ttm]
ttm_bo_move_buffer+0xcf/0x140 [ttm]
? drm_mm_split_at_start+0x1d/0x80 [drm]
ttm_bo_validate+0xd5/0x100 [ttm]
ttm_bo_init+0x1f2/0x240 [ttm]
radeon_bo_create+0x121/0x240 [radeon]
? radeon_ttm_bo_destroy+0x0/0x80 [radeon]
radeon_gem_object_create+0x89/0x100 [radeon]
? drm_gem_handle_create+0x8b/0xa0 [drm]
radeon_gem_create_ioctl+0x58/0xe0 [radeon]
drm_ioctl+0x283/0x460 [drm]
? radeon_gem_create_ioctl+0x0/0xe0 [radeon]
? do_readv_writev+0x16b/0x1e0
vfs_ioctl+0x3a/0xc0
do_vfs_ioctl+0x6d/0x1f0
sys_ioctl+0x87/0xa0
? sys_writev+0x55/0xb0
system_call_fastpath+0x16/0x1b
---[ end trace c0bc12025fa8386c ]---
radeon 0000:02:00.0: GPU softreset
radeon 0000:02:00.0:   R_008010_GRBM_STATUS=0xA0003028
radeon 0000:02:00.0:   R_008014_GRBM_STATUS2=0x00000002
radeon 0000:02:00.0:   R_000E50_SRBM_STATUS=0x200000C0
radeon 0000:02:00.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00000001
radeon 0000:02:00.0:   R_008010_GRBM_STATUS=0x00003028
radeon 0000:02:00.0:   R_008014_GRBM_STATUS2=0x00000002
radeon 0000:02:00.0:   R_000E50_SRBM_STATUS=0x200000C0
radeon 0000:02:00.0: GPU reset succeed
Clocks initialized !
ring test succeeded in 1 usecs
ib test succeeded in 1 usecs
-- 
Daniel J Blueman

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: GPU lockup+blackout on 2.6.35 with r600...
  2010-08-05 10:52 GPU lockup+blackout on 2.6.35 with r600 Daniel J Blueman
@ 2010-08-05 13:50 ` Jerome Glisse
  2010-08-05 15:23   ` Alex Deucher
  0 siblings, 1 reply; 8+ messages in thread
From: Jerome Glisse @ 2010-08-05 13:50 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: dri-devel

On 08/05/2010 06:52 AM, Daniel J Blueman wrote:
> After around 10-30 mins of typical usage (ie cairo rendering to
> surfaces + compiz) on my Dell Studio 15 (model 1557), my R600 radeon
> locks up, giving a blank screen.
>
> The only information I have is a backtrace [1]. 'radeontool regs'
> doesn't give anything - what information, /sys files etc would be
> useful here, and would it be useful to log a freedesktop.org bug
> report at this stage, assuming I haven't found a reliable reproducer?
>
> OS is Ubuntu 10.10 development release, mesa 7.8.2, libdrm2 2.4.21.
>
> Thanks,
>    Daniel
>
> --- [1]
>
> radeon 0000:02:00.0: GPU lockup CP stall for more than 1000msec
> ------------[ cut here ]------------
> WARNING: at /home/kernel-ppa/COD/linux/drivers/gpu/drm/radeon/radeon_fence.c:235
> radeon_fence_wait+0x2b7/0x320 [radeon]()
> Hardware name: Studio 1557
> GPU lockup (waiting for 0x00008C35 last fence id 0x00008C33)
> Modules linked in: binfmt_misc parport_pc ppdev kvm_intel kvm
> microcode snd_hda_codec_atihdmi joydev ipt_REJECT xt_limit xt_tcpudp
> ipt_addrtype xt_state snd_hda_codec_idt ip6table_filter ip6_tables
> snd_hda_intel nf_nat_irc nf_conntrack_irc snd_hda_codec nf_nat_ftp
> nf_nat snd_hwdep nf_conntrack_ipv4 snd_seq_midi arc4 snd_pcm
> snd_rawmidi nf_defrag_ipv4 radeon nf_conntrack_ftp nf_conntrack
> iptable_filter snd_seq_midi_event snd_seq ip_tables ttm iwlagn video
> snd_timer uvcvideo x_tables snd_seq_device output drm_kms_helper lp
> dell_laptop iwlcore drm videodev dcdbas dell_wmi v4l1_compat mac80211
> i7core_edac parport v4l2_compat_ioctl32 snd psmouse edac_core
> i2c_algo_bit soundcore cfg80211 snd_page_alloc serio_raw sdhci_pci
> sdhci ahci libahci led_class r8169 mii btrfs zlib_deflate crc32c
> libcrc32c
> Pid: 1624, comm: Xorg Tainted: G      D     2.6.35-020635-generic #020635
> Call Trace:
> ? radeon_fence_wait+0x2b7/0x320 [radeon]
> warn_slowpath_common+0x90/0xc0
> warn_slowpath_fmt+0x6e/0x70
> ? schedule_timeout+0x15a/0x2e0
> ? r600_irq_set+0x27d/0xc00 [radeon]
> ? radeon_ring_commit+0xa3/0xb0 [radeon]
> ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon]
> ? r600_gpu_is_lockup+0x1cb/0x220 [radeon]
> radeon_fence_wait+0x2b7/0x320 [radeon]
> ? autoremove_wake_function+0x0/0x40
> radeon_sync_obj_wait+0x11/0x20 [radeon]
> ttm_bo_wait+0x102/0x1b0 [ttm]
> ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm]
> radeon_move_blit+0x124/0x170 [radeon]
> radeon_bo_move+0xda/0x1a0 [radeon]
> ttm_bo_handle_move_mem+0xec/0x370 [ttm]
> ttm_bo_evict+0x1cc/0x270 [ttm]
> ? drm_mm_split_at_start+0x1d/0x80 [drm]
> ttm_mem_evict_first+0xed/0x180 [ttm]
> ? ttm_bo_man_get_node+0xd5/0xe0 [ttm]
> ttm_bo_mem_force_space+0xab/0x110 [ttm]
> ttm_bo_mem_space+0x305/0x370 [ttm]
> ttm_bo_move_buffer+0xcf/0x140 [ttm]
> ? drm_mm_split_at_start+0x1d/0x80 [drm]
> ttm_bo_validate+0xd5/0x100 [ttm]
> ttm_bo_init+0x1f2/0x240 [ttm]
> radeon_bo_create+0x121/0x240 [radeon]
> ? radeon_ttm_bo_destroy+0x0/0x80 [radeon]
> radeon_gem_object_create+0x89/0x100 [radeon]
> ? drm_gem_handle_create+0x8b/0xa0 [drm]
> radeon_gem_create_ioctl+0x58/0xe0 [radeon]
> drm_ioctl+0x283/0x460 [drm]
> ? radeon_gem_create_ioctl+0x0/0xe0 [radeon]
> ? do_readv_writev+0x16b/0x1e0
> vfs_ioctl+0x3a/0xc0
> do_vfs_ioctl+0x6d/0x1f0
> sys_ioctl+0x87/0xa0
> ? sys_writev+0x55/0xb0
> system_call_fastpath+0x16/0x1b
> ---[ end trace c0bc12025fa8386c ]---
> radeon 0000:02:00.0: GPU softreset
> radeon 0000:02:00.0:   R_008010_GRBM_STATUS=0xA0003028
> radeon 0000:02:00.0:   R_008014_GRBM_STATUS2=0x00000002
> radeon 0000:02:00.0:   R_000E50_SRBM_STATUS=0x200000C0
> radeon 0000:02:00.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
> radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00000001
> radeon 0000:02:00.0:   R_008010_GRBM_STATUS=0x00003028
> radeon 0000:02:00.0:   R_008014_GRBM_STATUS2=0x00000002
> radeon 0000:02:00.0:   R_000E50_SRBM_STATUS=0x200000C0
> radeon 0000:02:00.0: GPU reset succeed
> Clocks initialized !
> ring test succeeded in 1 usecs
> ib test succeeded in 1 usecs


SO it keep reseting ? According to log there is a GPU lockup but then
a successfull GPU reset so GPU should resume fine (that's what log says)
Best is to open a bug and attach full dmesg + lspci -vv and context
in which the lockup happen

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: GPU lockup+blackout on 2.6.35 with r600...
  2010-08-05 13:50 ` Jerome Glisse
@ 2010-08-05 15:23   ` Alex Deucher
  2010-08-08 16:36     ` Daniel J Blueman
  0 siblings, 1 reply; 8+ messages in thread
From: Alex Deucher @ 2010-08-05 15:23 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Daniel J Blueman, dri-devel

On Thu, Aug 5, 2010 at 9:50 AM, Jerome Glisse <glisse@freedesktop.org> wrote:
> On 08/05/2010 06:52 AM, Daniel J Blueman wrote:
>>
>> After around 10-30 mins of typical usage (ie cairo rendering to
>> surfaces + compiz) on my Dell Studio 15 (model 1557), my R600 radeon
>> locks up, giving a blank screen.
>>
>> The only information I have is a backtrace [1]. 'radeontool regs'
>> doesn't give anything - what information, /sys files etc would be
>> useful here, and would it be useful to log a freedesktop.org bug
>> report at this stage, assuming I haven't found a reliable reproducer?
>>
>> OS is Ubuntu 10.10 development release, mesa 7.8.2, libdrm2 2.4.21.
>>
>> Thanks,
>>   Daniel
>>
>> --- [1]
>>
>> radeon 0000:02:00.0: GPU lockup CP stall for more than 1000msec
>> ------------[ cut here ]------------
>> WARNING: at
>> /home/kernel-ppa/COD/linux/drivers/gpu/drm/radeon/radeon_fence.c:235
>> radeon_fence_wait+0x2b7/0x320 [radeon]()
>> Hardware name: Studio 1557
>> GPU lockup (waiting for 0x00008C35 last fence id 0x00008C33)
>> Modules linked in: binfmt_misc parport_pc ppdev kvm_intel kvm
>> microcode snd_hda_codec_atihdmi joydev ipt_REJECT xt_limit xt_tcpudp
>> ipt_addrtype xt_state snd_hda_codec_idt ip6table_filter ip6_tables
>> snd_hda_intel nf_nat_irc nf_conntrack_irc snd_hda_codec nf_nat_ftp
>> nf_nat snd_hwdep nf_conntrack_ipv4 snd_seq_midi arc4 snd_pcm
>> snd_rawmidi nf_defrag_ipv4 radeon nf_conntrack_ftp nf_conntrack
>> iptable_filter snd_seq_midi_event snd_seq ip_tables ttm iwlagn video
>> snd_timer uvcvideo x_tables snd_seq_device output drm_kms_helper lp
>> dell_laptop iwlcore drm videodev dcdbas dell_wmi v4l1_compat mac80211
>> i7core_edac parport v4l2_compat_ioctl32 snd psmouse edac_core
>> i2c_algo_bit soundcore cfg80211 snd_page_alloc serio_raw sdhci_pci
>> sdhci ahci libahci led_class r8169 mii btrfs zlib_deflate crc32c
>> libcrc32c
>> Pid: 1624, comm: Xorg Tainted: G      D     2.6.35-020635-generic #020635
>> Call Trace:
>> ? radeon_fence_wait+0x2b7/0x320 [radeon]
>> warn_slowpath_common+0x90/0xc0
>> warn_slowpath_fmt+0x6e/0x70
>> ? schedule_timeout+0x15a/0x2e0
>> ? r600_irq_set+0x27d/0xc00 [radeon]
>> ? radeon_ring_commit+0xa3/0xb0 [radeon]
>> ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon]
>> ? r600_gpu_is_lockup+0x1cb/0x220 [radeon]
>> radeon_fence_wait+0x2b7/0x320 [radeon]
>> ? autoremove_wake_function+0x0/0x40
>> radeon_sync_obj_wait+0x11/0x20 [radeon]
>> ttm_bo_wait+0x102/0x1b0 [ttm]
>> ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm]
>> radeon_move_blit+0x124/0x170 [radeon]
>> radeon_bo_move+0xda/0x1a0 [radeon]
>> ttm_bo_handle_move_mem+0xec/0x370 [ttm]
>> ttm_bo_evict+0x1cc/0x270 [ttm]
>> ? drm_mm_split_at_start+0x1d/0x80 [drm]
>> ttm_mem_evict_first+0xed/0x180 [ttm]
>> ? ttm_bo_man_get_node+0xd5/0xe0 [ttm]
>> ttm_bo_mem_force_space+0xab/0x110 [ttm]
>> ttm_bo_mem_space+0x305/0x370 [ttm]
>> ttm_bo_move_buffer+0xcf/0x140 [ttm]
>> ? drm_mm_split_at_start+0x1d/0x80 [drm]
>> ttm_bo_validate+0xd5/0x100 [ttm]
>> ttm_bo_init+0x1f2/0x240 [ttm]
>> radeon_bo_create+0x121/0x240 [radeon]
>> ? radeon_ttm_bo_destroy+0x0/0x80 [radeon]
>> radeon_gem_object_create+0x89/0x100 [radeon]
>> ? drm_gem_handle_create+0x8b/0xa0 [drm]
>> radeon_gem_create_ioctl+0x58/0xe0 [radeon]
>> drm_ioctl+0x283/0x460 [drm]
>> ? radeon_gem_create_ioctl+0x0/0xe0 [radeon]
>> ? do_readv_writev+0x16b/0x1e0
>> vfs_ioctl+0x3a/0xc0
>> do_vfs_ioctl+0x6d/0x1f0
>> sys_ioctl+0x87/0xa0
>> ? sys_writev+0x55/0xb0
>> system_call_fastpath+0x16/0x1b
>> ---[ end trace c0bc12025fa8386c ]---
>> radeon 0000:02:00.0: GPU softreset
>> radeon 0000:02:00.0:   R_008010_GRBM_STATUS=0xA0003028
>> radeon 0000:02:00.0:   R_008014_GRBM_STATUS2=0x00000002
>> radeon 0000:02:00.0:   R_000E50_SRBM_STATUS=0x200000C0
>> radeon 0000:02:00.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
>> radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00000001
>> radeon 0000:02:00.0:   R_008010_GRBM_STATUS=0x00003028
>> radeon 0000:02:00.0:   R_008014_GRBM_STATUS2=0x00000002
>> radeon 0000:02:00.0:   R_000E50_SRBM_STATUS=0x200000C0
>> radeon 0000:02:00.0: GPU reset succeed
>> Clocks initialized !
>> ring test succeeded in 1 usecs
>> ib test succeeded in 1 usecs
>
>
> SO it keep reseting ? According to log there is a GPU lockup but then
> a successfull GPU reset so GPU should resume fine (that's what log says)
> Best is to open a bug and attach full dmesg + lspci -vv and context
> in which the lockup happen
>

Perhaps the lockup timeout is too short or interrupts are being
delivered late or something?  There's a fdo or kernel bugzilla entry
where removing the lockup timeout fixed the issue.  Does something
like this:

--- a/drivers/gpu/drm/radeon/radeon_fence.c
+++ b/drivers/gpu/drm/radeon/radeon_fence.c
@@ -237,9 +237,11 @@ retry:
                         * as signaled for now
                         */
                        rdev->gpu_lockup = true;
+#if 0
                        r = radeon_gpu_reset(rdev);
                        if (r)
                                return r;
+#endif
                        WREG32(rdev->fence_drv.scratch_reg, fence->seq);
                        rdev->gpu_lockup = false;
                }

fix the issue?

Alex

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: GPU lockup+blackout on 2.6.35 with r600...
  2010-08-05 15:23   ` Alex Deucher
@ 2010-08-08 16:36     ` Daniel J Blueman
  2010-08-09 13:41       ` Daniel J Blueman
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel J Blueman @ 2010-08-08 16:36 UTC (permalink / raw)
  To: Alex Deucher, Jerome Glisse; +Cc: dri-devel

On 5 August 2010 16:23, Alex Deucher <alexdeucher@gmail.com> wrote:
> On Thu, Aug 5, 2010 at 9:50 AM, Jerome Glisse <glisse@freedesktop.org> wrote:
>> On 08/05/2010 06:52 AM, Daniel J Blueman wrote:
>>>
>>> After around 10-30 mins of typical usage (ie cairo rendering to
>>> surfaces + compiz) on my Dell Studio 15 (model 1557), my R600 radeon
>>> locks up, giving a blank screen.
>>>
>>> The only information I have is a backtrace [1]. 'radeontool regs'
>>> doesn't give anything - what information, /sys files etc would be
>>> useful here, and would it be useful to log a freedesktop.org bug
>>> report at this stage, assuming I haven't found a reliable reproducer?
>>>
>>> OS is Ubuntu 10.10 development release, mesa 7.8.2, libdrm2 2.4.21.
>>>
>>> Thanks,
>>>   Daniel
>>>
>>> --- [1]
>>>
>>> radeon 0000:02:00.0: GPU lockup CP stall for more than 1000msec
>>> ------------[ cut here ]------------
>>> WARNING: at
>>> /home/kernel-ppa/COD/linux/drivers/gpu/drm/radeon/radeon_fence.c:235
>>> radeon_fence_wait+0x2b7/0x320 [radeon]()
>>> Hardware name: Studio 1557
>>> GPU lockup (waiting for 0x00008C35 last fence id 0x00008C33)
>>> Modules linked in: binfmt_misc parport_pc ppdev kvm_intel kvm
>>> microcode snd_hda_codec_atihdmi joydev ipt_REJECT xt_limit xt_tcpudp
>>> ipt_addrtype xt_state snd_hda_codec_idt ip6table_filter ip6_tables
>>> snd_hda_intel nf_nat_irc nf_conntrack_irc snd_hda_codec nf_nat_ftp
>>> nf_nat snd_hwdep nf_conntrack_ipv4 snd_seq_midi arc4 snd_pcm
>>> snd_rawmidi nf_defrag_ipv4 radeon nf_conntrack_ftp nf_conntrack
>>> iptable_filter snd_seq_midi_event snd_seq ip_tables ttm iwlagn video
>>> snd_timer uvcvideo x_tables snd_seq_device output drm_kms_helper lp
>>> dell_laptop iwlcore drm videodev dcdbas dell_wmi v4l1_compat mac80211
>>> i7core_edac parport v4l2_compat_ioctl32 snd psmouse edac_core
>>> i2c_algo_bit soundcore cfg80211 snd_page_alloc serio_raw sdhci_pci
>>> sdhci ahci libahci led_class r8169 mii btrfs zlib_deflate crc32c
>>> libcrc32c
>>> Pid: 1624, comm: Xorg Tainted: G      D     2.6.35-020635-generic #020635
>>> Call Trace:
>>> ? radeon_fence_wait+0x2b7/0x320 [radeon]
>>> warn_slowpath_common+0x90/0xc0
>>> warn_slowpath_fmt+0x6e/0x70
>>> ? schedule_timeout+0x15a/0x2e0
>>> ? r600_irq_set+0x27d/0xc00 [radeon]
>>> ? radeon_ring_commit+0xa3/0xb0 [radeon]
>>> ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon]
>>> ? r600_gpu_is_lockup+0x1cb/0x220 [radeon]
>>> radeon_fence_wait+0x2b7/0x320 [radeon]
>>> ? autoremove_wake_function+0x0/0x40
>>> radeon_sync_obj_wait+0x11/0x20 [radeon]
>>> ttm_bo_wait+0x102/0x1b0 [ttm]
>>> ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm]
>>> radeon_move_blit+0x124/0x170 [radeon]
>>> radeon_bo_move+0xda/0x1a0 [radeon]
>>> ttm_bo_handle_move_mem+0xec/0x370 [ttm]
>>> ttm_bo_evict+0x1cc/0x270 [ttm]
>>> ? drm_mm_split_at_start+0x1d/0x80 [drm]
>>> ttm_mem_evict_first+0xed/0x180 [ttm]
>>> ? ttm_bo_man_get_node+0xd5/0xe0 [ttm]
>>> ttm_bo_mem_force_space+0xab/0x110 [ttm]
>>> ttm_bo_mem_space+0x305/0x370 [ttm]
>>> ttm_bo_move_buffer+0xcf/0x140 [ttm]
>>> ? drm_mm_split_at_start+0x1d/0x80 [drm]
>>> ttm_bo_validate+0xd5/0x100 [ttm]
>>> ttm_bo_init+0x1f2/0x240 [ttm]
>>> radeon_bo_create+0x121/0x240 [radeon]
>>> ? radeon_ttm_bo_destroy+0x0/0x80 [radeon]
>>> radeon_gem_object_create+0x89/0x100 [radeon]
>>> ? drm_gem_handle_create+0x8b/0xa0 [drm]
>>> radeon_gem_create_ioctl+0x58/0xe0 [radeon]
>>> drm_ioctl+0x283/0x460 [drm]
>>> ? radeon_gem_create_ioctl+0x0/0xe0 [radeon]
>>> ? do_readv_writev+0x16b/0x1e0
>>> vfs_ioctl+0x3a/0xc0
>>> do_vfs_ioctl+0x6d/0x1f0
>>> sys_ioctl+0x87/0xa0
>>> ? sys_writev+0x55/0xb0
>>> system_call_fastpath+0x16/0x1b
>>> ---[ end trace c0bc12025fa8386c ]---
>>> radeon 0000:02:00.0: GPU softreset
>>> radeon 0000:02:00.0:   R_008010_GRBM_STATUS=0xA0003028
>>> radeon 0000:02:00.0:   R_008014_GRBM_STATUS2=0x00000002
>>> radeon 0000:02:00.0:   R_000E50_SRBM_STATUS=0x200000C0
>>> radeon 0000:02:00.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
>>> radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00000001
>>> radeon 0000:02:00.0:   R_008010_GRBM_STATUS=0x00003028
>>> radeon 0000:02:00.0:   R_008014_GRBM_STATUS2=0x00000002
>>> radeon 0000:02:00.0:   R_000E50_SRBM_STATUS=0x200000C0
>>> radeon 0000:02:00.0: GPU reset succeed
>>> Clocks initialized !
>>> ring test succeeded in 1 usecs
>>> ib test succeeded in 1 usecs
>>
>> SO it keep reseting ? According to log there is a GPU lockup but then
>> a successfull GPU reset so GPU should resume fine (that's what log says)
>> Best is to open a bug and attach full dmesg + lspci -vv and context
>> in which the lockup happen
>
> Perhaps the lockup timeout is too short or interrupts are being
> delivered late or something?  There's a fdo or kernel bugzilla entry
> where removing the lockup timeout fixed the issue.  Does something
> like this:
>
> --- a/drivers/gpu/drm/radeon/radeon_fence.c
> +++ b/drivers/gpu/drm/radeon/radeon_fence.c
> @@ -237,9 +237,11 @@ retry:
>                         * as signaled for now
>                         */
>                        rdev->gpu_lockup = true;
> +#if 0
>                        r = radeon_gpu_reset(rdev);
>                        if (r)
>                                return r;
> +#endif
>                        WREG32(rdev->fence_drv.scratch_reg, fence->seq);
>                        rdev->gpu_lockup = false;
>                }
>
> fix the issue?

Correction, RV710.

Well, RADEON_FENCE_JIFFIES_TIMEOUT plus 1000ms is already
excruciatingly long for the command processor to not have responded,
but the issue I'm really hitting here is that X if left hanging after
GPU reset. The GPU lockup was being reported at:

[<ffffffffa03b66d7>] ? radeon_fence_wait+0x2b7/0x320 [radeon]
[<ffffffff81061370>] warn_slowpath_common+0x90/0xc0
[<ffffffff8106149e>] warn_slowpath_fmt+0x6e/0x70
[<ffffffff81575f8a>] ? schedule_timeout+0x15a/0x2e0
[<ffffffffa03f5a9d>] ? r600_irq_set+0x27d/0xc00 [radeon]
[<ffffffffa03ce1f3>] ? radeon_ring_commit+0xa3/0xb0 [radeon]
[<ffffffffa03d5412>] ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon]
[<ffffffffa040063b>] ? r600_gpu_is_lockup+0x1cb/0x220 [radeon]
[<ffffffffa03b66d7>] radeon_fence_wait+0x2b7/0x320 [radeon]
[<ffffffff8107fcd0>] ? autoremove_wake_function+0x0/0x40
[<ffffffffa03b74a1>] radeon_sync_obj_wait+0x11/0x20 [radeon]
[<ffffffffa02ff332>] ttm_bo_wait+0x102/0x1b0 [ttm]
[<ffffffffa0302ede>] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm]
[<ffffffffa03b75f4>] radeon_move_blit+0x124/0x170 [radeon]

X consistently hangs after GPU reset at (with sysrq-T):

[<ffffffff81576838>] __mutex_lock_slowpath+0xe8/0x170
[<ffffffff8157647b>] mutex_lock+0x2b/0x50
[<ffffffffa0415aba>] radeon_pm_set_clocks+0x3a/0x2d0 [radeon]
[<ffffffff812abc51>] ? idr_get_new_above+0x11/0x40
[<ffffffffa01679ab>] ? drm_mode_object_get+0x6b/0xc0 [drm]
[<ffffffff811410ae>] ? kmem_cache_alloc_notrace+0x8e/0xd0
[<ffffffffa041662d>] radeon_pm_compute_clocks+0xcd/0x240 [radeon]
[<ffffffffa03a9d21>] atombios_crtc_mode_fixup+0x31/0x60 [radeon]
[<ffffffffa02ba13a>] drm_crtc_helper_set_mode+0x16a/0x370 [drm_kms_helper]
[<ffffffffa02ba3bd>] drm_helper_resume_force_mode+0x7d/0x150 [drm_kms_helper]
[<ffffffffa03a2f35>] radeon_gpu_reset+0x85/0xd0 [radeon]
[<ffffffffa03b66e7>] radeon_fence_wait+0x2c7/0x320 [radeon]
[<ffffffff8107fcd0>] ? autoremove_wake_function+0x0/0x40
[<ffffffffa03b74a1>] radeon_sync_obj_wait+0x11/0x20 [radeon]
[<ffffffffa02ff332>] ttm_bo_wait+0x102/0x1b0 [ttm]
[<ffffffffa0302ede>] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm]
[<ffffffffa03b75f4>] radeon_move_blit+0x124/0x170 [radeon]

I couldn't reproduce it (but did hit it a number of times over the
last two weeks), so have been analysing with what we have. The kernel
the backtrace is against isn't built with debug, so:

radeon_pm_set_clocks():
   8ca80:       55                      push   %rbp
   8ca81:       48 89 e5                mov    %rsp,%rbp
   8ca84:       41 57                   push   %r15
   8ca86:       41 56                   push   %r14
   8ca88:       41 55                   push   %r13
   8ca8a:       41 54                   push   %r12
   8ca8c:       53                      push   %rbx
   8ca8d:       48 83 ec 48             sub    $0x48,%rsp
   8ca91:       e8 00 00 00 00          callq  8ca96 <radeon_pm_set_clocks+0x16>
   8ca96:       49 89 fc                mov    %rdi,%r12
   8ca99:       48 8b 7f 08             mov    0x8(%rdi),%rdi
   8ca9d:       48 83 c7 20             add    $0x20,%rdi
   8caa1:       e8 00 00 00 00          callq  8caa6 <radeon_pm_set_clocks+0x26>
   8caa6:       49 8d 84 24 a0 1c 00    lea    0x1ca0(%r12),%rax
   8caad:       00
   8caae:       48 89 c7                mov    %rax,%rdi
   8cab1:       48 89 45 90             mov    %rax,-0x70(%rbp)
   8cab5:       e8 00 00 00 00          callq  8caba <radeon_pm_set_clocks+0x3a>
-> 8caba:       49 8d 8c 24 e8 09 00    lea    0x9e8(%r12),%rcx
   8cac1:       00
   8cac2:       48 89 cf                mov    %rcx,%rdi
   8cac5:       48 89 4d 98             mov    %rcx,-0x68(%rbp)
   8cac9:       e8 00 00 00 00          callq  8cace <radeon_pm_set_clocks+0x4e>
   8cace:       41 83 7c 24 78 1a       cmpl   $0x1a,0x78(%r12)
   8cad4:       0f 86 46 01 00 00       jbe    8cc20
<radeon_pm_set_clocks+0x1a0>
   8cada:       41 80 bc 24 d0 0d 00    cmpb   $0x0,0xdd0(%r12)
   8cae1:       00 00

This corresponds to:

static void radeon_pm_set_clocks(struct radeon_device *rdev)
{
	int i;

	mutex_lock(&rdev->ddev->struct_mutex);
	mutex_lock(&rdev->vram_mutex);
->	mutex_lock(&rdev->cp.mutex);

We know r12 is the rdev struct pointer (%rdi is the first function
argument gets loaded into r12), and 0x9e8 is plausable as the offset
of cp.mutex, as vram_mutex is further out and plausable as 0x1ca0, so
the code is trying to acquire cp.mutex it seems.

>From the kernel log, the command proc ring test completed after the
start of the GPU reset, which is the last callsite to unlock cp.mutex,
ie <radeon_fence_wait->radeon_gpu_reset->radeon_resume->rv770_resume->rv770_startup->r600_cp_resume->radeon_ring_test->r600_ring_test->radeon_ring_unlock_commit>
before the hang at
<radeon_fence_wait->radeon_gpu_reset->->drm_helper_resume_force_mode->drm_crtc_helper_set_mode->mode_fixup->atombios_crtc_mode_fixup->radeon_pm_compute_clocks->radeon_pm_set_clocks>.

Looking through all the calls between these two spots, I just can't
see anything that _could_ have locked cp.mutex, unless I'm misreading
the disassembly of mutex_lock.

Any thoughts?

Dan
-- 
Daniel J Blueman

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: GPU lockup+blackout on 2.6.35 with r600...
  2010-08-08 16:36     ` Daniel J Blueman
@ 2010-08-09 13:41       ` Daniel J Blueman
  2010-08-17 22:12         ` Daniel J Blueman
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel J Blueman @ 2010-08-09 13:41 UTC (permalink / raw)
  To: Alex Deucher, Jerome Glisse, Matthew Garrett; +Cc: dri-devel

On 8 August 2010 17:36, Daniel J Blueman <daniel.blueman@gmail.com> wrote:
> On 5 August 2010 16:23, Alex Deucher <alexdeucher@gmail.com> wrote:
>> On Thu, Aug 5, 2010 at 9:50 AM, Jerome Glisse <glisse@freedesktop.org> wrote:
>>> On 08/05/2010 06:52 AM, Daniel J Blueman wrote:
>>>>
>>>> After around 10-30 mins of typical usage (ie cairo rendering to
>>>> surfaces + compiz) on my Dell Studio 15 (model 1557), my R600 radeon
>>>> locks up, giving a blank screen.
>>>>
>>>> The only information I have is a backtrace [1]. 'radeontool regs'
>>>> doesn't give anything - what information, /sys files etc would be
>>>> useful here, and would it be useful to log a freedesktop.org bug
>>>> report at this stage, assuming I haven't found a reliable reproducer?
>>>>
>>>> OS is Ubuntu 10.10 development release, mesa 7.8.2, libdrm2 2.4.21.
>>>>
>>>> Thanks,
>>>>   Daniel
>>>>
>>>> --- [1]
>>>>
>>>> radeon 0000:02:00.0: GPU lockup CP stall for more than 1000msec
>>>> ------------[ cut here ]------------
>>>> WARNING: at
>>>> /home/kernel-ppa/COD/linux/drivers/gpu/drm/radeon/radeon_fence.c:235
>>>> radeon_fence_wait+0x2b7/0x320 [radeon]()
>>>> Hardware name: Studio 1557
>>>> GPU lockup (waiting for 0x00008C35 last fence id 0x00008C33)
>>>> Modules linked in: binfmt_misc parport_pc ppdev kvm_intel kvm
>>>> microcode snd_hda_codec_atihdmi joydev ipt_REJECT xt_limit xt_tcpudp
>>>> ipt_addrtype xt_state snd_hda_codec_idt ip6table_filter ip6_tables
>>>> snd_hda_intel nf_nat_irc nf_conntrack_irc snd_hda_codec nf_nat_ftp
>>>> nf_nat snd_hwdep nf_conntrack_ipv4 snd_seq_midi arc4 snd_pcm
>>>> snd_rawmidi nf_defrag_ipv4 radeon nf_conntrack_ftp nf_conntrack
>>>> iptable_filter snd_seq_midi_event snd_seq ip_tables ttm iwlagn video
>>>> snd_timer uvcvideo x_tables snd_seq_device output drm_kms_helper lp
>>>> dell_laptop iwlcore drm videodev dcdbas dell_wmi v4l1_compat mac80211
>>>> i7core_edac parport v4l2_compat_ioctl32 snd psmouse edac_core
>>>> i2c_algo_bit soundcore cfg80211 snd_page_alloc serio_raw sdhci_pci
>>>> sdhci ahci libahci led_class r8169 mii btrfs zlib_deflate crc32c
>>>> libcrc32c
>>>> Pid: 1624, comm: Xorg Tainted: G      D     2.6.35-020635-generic #020635
>>>> Call Trace:
>>>> ? radeon_fence_wait+0x2b7/0x320 [radeon]
>>>> warn_slowpath_common+0x90/0xc0
>>>> warn_slowpath_fmt+0x6e/0x70
>>>> ? schedule_timeout+0x15a/0x2e0
>>>> ? r600_irq_set+0x27d/0xc00 [radeon]
>>>> ? radeon_ring_commit+0xa3/0xb0 [radeon]
>>>> ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon]
>>>> ? r600_gpu_is_lockup+0x1cb/0x220 [radeon]
>>>> radeon_fence_wait+0x2b7/0x320 [radeon]
>>>> ? autoremove_wake_function+0x0/0x40
>>>> radeon_sync_obj_wait+0x11/0x20 [radeon]
>>>> ttm_bo_wait+0x102/0x1b0 [ttm]
>>>> ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm]
>>>> radeon_move_blit+0x124/0x170 [radeon]
>>>> radeon_bo_move+0xda/0x1a0 [radeon]
>>>> ttm_bo_handle_move_mem+0xec/0x370 [ttm]
>>>> ttm_bo_evict+0x1cc/0x270 [ttm]
>>>> ? drm_mm_split_at_start+0x1d/0x80 [drm]
>>>> ttm_mem_evict_first+0xed/0x180 [ttm]
>>>> ? ttm_bo_man_get_node+0xd5/0xe0 [ttm]
>>>> ttm_bo_mem_force_space+0xab/0x110 [ttm]
>>>> ttm_bo_mem_space+0x305/0x370 [ttm]
>>>> ttm_bo_move_buffer+0xcf/0x140 [ttm]
>>>> ? drm_mm_split_at_start+0x1d/0x80 [drm]
>>>> ttm_bo_validate+0xd5/0x100 [ttm]
>>>> ttm_bo_init+0x1f2/0x240 [ttm]
>>>> radeon_bo_create+0x121/0x240 [radeon]
>>>> ? radeon_ttm_bo_destroy+0x0/0x80 [radeon]
>>>> radeon_gem_object_create+0x89/0x100 [radeon]
>>>> ? drm_gem_handle_create+0x8b/0xa0 [drm]
>>>> radeon_gem_create_ioctl+0x58/0xe0 [radeon]
>>>> drm_ioctl+0x283/0x460 [drm]
>>>> ? radeon_gem_create_ioctl+0x0/0xe0 [radeon]
>>>> ? do_readv_writev+0x16b/0x1e0
>>>> vfs_ioctl+0x3a/0xc0
>>>> do_vfs_ioctl+0x6d/0x1f0
>>>> sys_ioctl+0x87/0xa0
>>>> ? sys_writev+0x55/0xb0
>>>> system_call_fastpath+0x16/0x1b
>>>> ---[ end trace c0bc12025fa8386c ]---
>>>> radeon 0000:02:00.0: GPU softreset
>>>> radeon 0000:02:00.0:   R_008010_GRBM_STATUS=0xA0003028
>>>> radeon 0000:02:00.0:   R_008014_GRBM_STATUS2=0x00000002
>>>> radeon 0000:02:00.0:   R_000E50_SRBM_STATUS=0x200000C0
>>>> radeon 0000:02:00.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
>>>> radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00000001
>>>> radeon 0000:02:00.0:   R_008010_GRBM_STATUS=0x00003028
>>>> radeon 0000:02:00.0:   R_008014_GRBM_STATUS2=0x00000002
>>>> radeon 0000:02:00.0:   R_000E50_SRBM_STATUS=0x200000C0
>>>> radeon 0000:02:00.0: GPU reset succeed
>>>> Clocks initialized !
>>>> ring test succeeded in 1 usecs
>>>> ib test succeeded in 1 usecs
>>>
>>> SO it keep reseting ? According to log there is a GPU lockup but then
>>> a successfull GPU reset so GPU should resume fine (that's what log says)
>>> Best is to open a bug and attach full dmesg + lspci -vv and context
>>> in which the lockup happen
>>
>> Perhaps the lockup timeout is too short or interrupts are being
>> delivered late or something?  There's a fdo or kernel bugzilla entry
>> where removing the lockup timeout fixed the issue.  Does something
>> like this:
>>
>> --- a/drivers/gpu/drm/radeon/radeon_fence.c
>> +++ b/drivers/gpu/drm/radeon/radeon_fence.c
>> @@ -237,9 +237,11 @@ retry:
>>                         * as signaled for now
>>                         */
>>                        rdev->gpu_lockup = true;
>> +#if 0
>>                        r = radeon_gpu_reset(rdev);
>>                        if (r)
>>                                return r;
>> +#endif
>>                        WREG32(rdev->fence_drv.scratch_reg, fence->seq);
>>                        rdev->gpu_lockup = false;
>>                }
>>
>> fix the issue?
>
> Correction, RV710.
>
> Well, RADEON_FENCE_JIFFIES_TIMEOUT plus 1000ms is already
> excruciatingly long for the command processor to not have responded,
> but the issue I'm really hitting here is that X if left hanging after
> GPU reset. The GPU lockup was being reported at:
>
> [<ffffffffa03b66d7>] ? radeon_fence_wait+0x2b7/0x320 [radeon]
> [<ffffffff81061370>] warn_slowpath_common+0x90/0xc0
> [<ffffffff8106149e>] warn_slowpath_fmt+0x6e/0x70
> [<ffffffff81575f8a>] ? schedule_timeout+0x15a/0x2e0
> [<ffffffffa03f5a9d>] ? r600_irq_set+0x27d/0xc00 [radeon]
> [<ffffffffa03ce1f3>] ? radeon_ring_commit+0xa3/0xb0 [radeon]
> [<ffffffffa03d5412>] ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon]
> [<ffffffffa040063b>] ? r600_gpu_is_lockup+0x1cb/0x220 [radeon]
> [<ffffffffa03b66d7>] radeon_fence_wait+0x2b7/0x320 [radeon]
> [<ffffffff8107fcd0>] ? autoremove_wake_function+0x0/0x40
> [<ffffffffa03b74a1>] radeon_sync_obj_wait+0x11/0x20 [radeon]
> [<ffffffffa02ff332>] ttm_bo_wait+0x102/0x1b0 [ttm]
> [<ffffffffa0302ede>] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm]
> [<ffffffffa03b75f4>] radeon_move_blit+0x124/0x170 [radeon]
>
> X consistently hangs after GPU reset at (with sysrq-T):
>
> [<ffffffff81576838>] __mutex_lock_slowpath+0xe8/0x170
> [<ffffffff8157647b>] mutex_lock+0x2b/0x50
> [<ffffffffa0415aba>] radeon_pm_set_clocks+0x3a/0x2d0 [radeon]
> [<ffffffff812abc51>] ? idr_get_new_above+0x11/0x40
> [<ffffffffa01679ab>] ? drm_mode_object_get+0x6b/0xc0 [drm]
> [<ffffffff811410ae>] ? kmem_cache_alloc_notrace+0x8e/0xd0
> [<ffffffffa041662d>] radeon_pm_compute_clocks+0xcd/0x240 [radeon]
> [<ffffffffa03a9d21>] atombios_crtc_mode_fixup+0x31/0x60 [radeon]
> [<ffffffffa02ba13a>] drm_crtc_helper_set_mode+0x16a/0x370 [drm_kms_helper]
> [<ffffffffa02ba3bd>] drm_helper_resume_force_mode+0x7d/0x150 [drm_kms_helper]
> [<ffffffffa03a2f35>] radeon_gpu_reset+0x85/0xd0 [radeon]
> [<ffffffffa03b66e7>] radeon_fence_wait+0x2c7/0x320 [radeon]
> [<ffffffff8107fcd0>] ? autoremove_wake_function+0x0/0x40
> [<ffffffffa03b74a1>] radeon_sync_obj_wait+0x11/0x20 [radeon]
> [<ffffffffa02ff332>] ttm_bo_wait+0x102/0x1b0 [ttm]
> [<ffffffffa0302ede>] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm]
> [<ffffffffa03b75f4>] radeon_move_blit+0x124/0x170 [radeon]
>
> I couldn't reproduce it (but did hit it a number of times over the
> last two weeks), so have been analysing with what we have. The kernel
> the backtrace is against isn't built with debug, so:
>
> radeon_pm_set_clocks():
>   8ca80:       55                      push   %rbp
>   8ca81:       48 89 e5                mov    %rsp,%rbp
>   8ca84:       41 57                   push   %r15
>   8ca86:       41 56                   push   %r14
>   8ca88:       41 55                   push   %r13
>   8ca8a:       41 54                   push   %r12
>   8ca8c:       53                      push   %rbx
>   8ca8d:       48 83 ec 48             sub    $0x48,%rsp
>   8ca91:       e8 00 00 00 00          callq  8ca96 <radeon_pm_set_clocks+0x16>
>   8ca96:       49 89 fc                mov    %rdi,%r12
>   8ca99:       48 8b 7f 08             mov    0x8(%rdi),%rdi
>   8ca9d:       48 83 c7 20             add    $0x20,%rdi
>   8caa1:       e8 00 00 00 00          callq  8caa6 <radeon_pm_set_clocks+0x26>
>   8caa6:       49 8d 84 24 a0 1c 00    lea    0x1ca0(%r12),%rax
>   8caad:       00
>   8caae:       48 89 c7                mov    %rax,%rdi
>   8cab1:       48 89 45 90             mov    %rax,-0x70(%rbp)
>   8cab5:       e8 00 00 00 00          callq  8caba <radeon_pm_set_clocks+0x3a>
> -> 8caba:       49 8d 8c 24 e8 09 00    lea    0x9e8(%r12),%rcx
>   8cac1:       00
>   8cac2:       48 89 cf                mov    %rcx,%rdi
>   8cac5:       48 89 4d 98             mov    %rcx,-0x68(%rbp)
>   8cac9:       e8 00 00 00 00          callq  8cace <radeon_pm_set_clocks+0x4e>
>   8cace:       41 83 7c 24 78 1a       cmpl   $0x1a,0x78(%r12)
>   8cad4:       0f 86 46 01 00 00       jbe    8cc20
> <radeon_pm_set_clocks+0x1a0>
>   8cada:       41 80 bc 24 d0 0d 00    cmpb   $0x0,0xdd0(%r12)
>   8cae1:       00 00
>
> This corresponds to:
>
> static void radeon_pm_set_clocks(struct radeon_device *rdev)
> {
>        int i;
>
>        mutex_lock(&rdev->ddev->struct_mutex);
>        mutex_lock(&rdev->vram_mutex);
> ->      mutex_lock(&rdev->cp.mutex);
>
> We know r12 is the rdev struct pointer (%rdi is the first function
> argument gets loaded into r12), and 0x9e8 is plausable as the offset
> of cp.mutex, as vram_mutex is further out and plausable as 0x1ca0, so
> the code is trying to acquire cp.mutex it seems.
>
> From the kernel log, the command proc ring test completed after the
> start of the GPU reset, which is the last callsite to unlock cp.mutex,
> ie <radeon_fence_wait->radeon_gpu_reset->radeon_resume->rv770_resume->rv770_startup->r600_cp_resume->radeon_ring_test->r600_ring_test->radeon_ring_unlock_commit>
> before the hang at
> <radeon_fence_wait->radeon_gpu_reset->->drm_helper_resume_force_mode->drm_crtc_helper_set_mode->mode_fixup->atombios_crtc_mode_fixup->radeon_pm_compute_clocks->radeon_pm_set_clocks>.
>
> Looking through all the calls between these two spots, I just can't
> see anything that _could_ have locked cp.mutex, unless I'm misreading
> the disassembly of mutex_lock.
>
> Any thoughts?

Well, looking further back down the call chain, radeon_bo_create takes
the vram_mutex across the ttm_bo_init call, so I was duped by the
compiler's instruction scheduling.

I looked at various changes to the locking scheme which did address
the X hang (thus blackouts), but all attempts failed with lockdep
violations somewhere, so there is no minimal fix without revisiting
recent locking scheme changes [1,2,3].

I'll pass on some other minor fixes I've cooked up in the meantime.
Note that I am running with the kernel boot options 'dynpm=1
radeon.dynclks=1' to stop everything getting toasty; let me know if
you need more info for this issue.

Thanks,
  Daniel

[1] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37917
[2] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37923
[3] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37937
-- 
Daniel J Blueman

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: GPU lockup+blackout on 2.6.35 with r600...
  2010-08-09 13:41       ` Daniel J Blueman
@ 2010-08-17 22:12         ` Daniel J Blueman
  2010-08-17 22:37           ` Alex Deucher
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel J Blueman @ 2010-08-17 22:12 UTC (permalink / raw)
  To: Alex Deucher, Jerome Glisse, Matthew Garrett; +Cc: dri-devel

Hi chaps,

On 9 August 2010 14:41, Daniel J Blueman <daniel.blueman@gmail.com> wrote:
> On 8 August 2010 17:36, Daniel J Blueman <daniel.blueman@gmail.com> wrote:
>> On 5 August 2010 16:23, Alex Deucher <alexdeucher@gmail.com> wrote:
>>> On Thu, Aug 5, 2010 at 9:50 AM, Jerome Glisse <glisse@freedesktop.org> wrote:
>>>> On 08/05/2010 06:52 AM, Daniel J Blueman wrote:
>>>>>
>>>>> After around 10-30 mins of typical usage (ie cairo rendering to
>>>>> surfaces + compiz) on my Dell Studio 15 (model 1557), my R600 radeon
>>>>> locks up, giving a blank screen.
>>>>>
>>>>> The only information I have is a backtrace [1]. 'radeontool regs'
>>>>> doesn't give anything - what information, /sys files etc would be
>>>>> useful here, and would it be useful to log a freedesktop.org bug
>>>>> report at this stage, assuming I haven't found a reliable reproducer?
>>>>>
>>>>> OS is Ubuntu 10.10 development release, mesa 7.8.2, libdrm2 2.4.21.
>>>>>
>>>>> Thanks,
>>>>>   Daniel
>>>>>
>>>>> --- [1]
>>>>>
>>>>> radeon 0000:02:00.0: GPU lockup CP stall for more than 1000msec
>>>>> ------------[ cut here ]------------
>>>>> WARNING: at
>>>>> /home/kernel-ppa/COD/linux/drivers/gpu/drm/radeon/radeon_fence.c:235
>>>>> radeon_fence_wait+0x2b7/0x320 [radeon]()
>>>>> Hardware name: Studio 1557
>>>>> GPU lockup (waiting for 0x00008C35 last fence id 0x00008C33)
>>>>> Modules linked in: binfmt_misc parport_pc ppdev kvm_intel kvm
>>>>> microcode snd_hda_codec_atihdmi joydev ipt_REJECT xt_limit xt_tcpudp
>>>>> ipt_addrtype xt_state snd_hda_codec_idt ip6table_filter ip6_tables
>>>>> snd_hda_intel nf_nat_irc nf_conntrack_irc snd_hda_codec nf_nat_ftp
>>>>> nf_nat snd_hwdep nf_conntrack_ipv4 snd_seq_midi arc4 snd_pcm
>>>>> snd_rawmidi nf_defrag_ipv4 radeon nf_conntrack_ftp nf_conntrack
>>>>> iptable_filter snd_seq_midi_event snd_seq ip_tables ttm iwlagn video
>>>>> snd_timer uvcvideo x_tables snd_seq_device output drm_kms_helper lp
>>>>> dell_laptop iwlcore drm videodev dcdbas dell_wmi v4l1_compat mac80211
>>>>> i7core_edac parport v4l2_compat_ioctl32 snd psmouse edac_core
>>>>> i2c_algo_bit soundcore cfg80211 snd_page_alloc serio_raw sdhci_pci
>>>>> sdhci ahci libahci led_class r8169 mii btrfs zlib_deflate crc32c
>>>>> libcrc32c
>>>>> Pid: 1624, comm: Xorg Tainted: G      D     2.6.35-020635-generic #020635
>>>>> Call Trace:
>>>>> ? radeon_fence_wait+0x2b7/0x320 [radeon]
>>>>> warn_slowpath_common+0x90/0xc0
>>>>> warn_slowpath_fmt+0x6e/0x70
>>>>> ? schedule_timeout+0x15a/0x2e0
>>>>> ? r600_irq_set+0x27d/0xc00 [radeon]
>>>>> ? radeon_ring_commit+0xa3/0xb0 [radeon]
>>>>> ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon]
>>>>> ? r600_gpu_is_lockup+0x1cb/0x220 [radeon]
>>>>> radeon_fence_wait+0x2b7/0x320 [radeon]
>>>>> ? autoremove_wake_function+0x0/0x40
>>>>> radeon_sync_obj_wait+0x11/0x20 [radeon]
>>>>> ttm_bo_wait+0x102/0x1b0 [ttm]
>>>>> ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm]
>>>>> radeon_move_blit+0x124/0x170 [radeon]
>>>>> radeon_bo_move+0xda/0x1a0 [radeon]
>>>>> ttm_bo_handle_move_mem+0xec/0x370 [ttm]
>>>>> ttm_bo_evict+0x1cc/0x270 [ttm]
>>>>> ? drm_mm_split_at_start+0x1d/0x80 [drm]
>>>>> ttm_mem_evict_first+0xed/0x180 [ttm]
>>>>> ? ttm_bo_man_get_node+0xd5/0xe0 [ttm]
>>>>> ttm_bo_mem_force_space+0xab/0x110 [ttm]
>>>>> ttm_bo_mem_space+0x305/0x370 [ttm]
>>>>> ttm_bo_move_buffer+0xcf/0x140 [ttm]
>>>>> ? drm_mm_split_at_start+0x1d/0x80 [drm]
>>>>> ttm_bo_validate+0xd5/0x100 [ttm]
>>>>> ttm_bo_init+0x1f2/0x240 [ttm]
>>>>> radeon_bo_create+0x121/0x240 [radeon]
>>>>> ? radeon_ttm_bo_destroy+0x0/0x80 [radeon]
>>>>> radeon_gem_object_create+0x89/0x100 [radeon]
>>>>> ? drm_gem_handle_create+0x8b/0xa0 [drm]
>>>>> radeon_gem_create_ioctl+0x58/0xe0 [radeon]
>>>>> drm_ioctl+0x283/0x460 [drm]
>>>>> ? radeon_gem_create_ioctl+0x0/0xe0 [radeon]
>>>>> ? do_readv_writev+0x16b/0x1e0
>>>>> vfs_ioctl+0x3a/0xc0
>>>>> do_vfs_ioctl+0x6d/0x1f0
>>>>> sys_ioctl+0x87/0xa0
>>>>> ? sys_writev+0x55/0xb0
>>>>> system_call_fastpath+0x16/0x1b
>>>>> ---[ end trace c0bc12025fa8386c ]---
>>>>> radeon 0000:02:00.0: GPU softreset
>>>>> radeon 0000:02:00.0:   R_008010_GRBM_STATUS=0xA0003028
>>>>> radeon 0000:02:00.0:   R_008014_GRBM_STATUS2=0x00000002
>>>>> radeon 0000:02:00.0:   R_000E50_SRBM_STATUS=0x200000C0
>>>>> radeon 0000:02:00.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
>>>>> radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00000001
>>>>> radeon 0000:02:00.0:   R_008010_GRBM_STATUS=0x00003028
>>>>> radeon 0000:02:00.0:   R_008014_GRBM_STATUS2=0x00000002
>>>>> radeon 0000:02:00.0:   R_000E50_SRBM_STATUS=0x200000C0
>>>>> radeon 0000:02:00.0: GPU reset succeed
>>>>> Clocks initialized !
>>>>> ring test succeeded in 1 usecs
>>>>> ib test succeeded in 1 usecs
>>>>
>>>> SO it keep reseting ? According to log there is a GPU lockup but then
>>>> a successfull GPU reset so GPU should resume fine (that's what log says)
>>>> Best is to open a bug and attach full dmesg + lspci -vv and context
>>>> in which the lockup happen
>>>
>>> Perhaps the lockup timeout is too short or interrupts are being
>>> delivered late or something?  There's a fdo or kernel bugzilla entry
>>> where removing the lockup timeout fixed the issue.  Does something
>>> like this:
>>>
>>> --- a/drivers/gpu/drm/radeon/radeon_fence.c
>>> +++ b/drivers/gpu/drm/radeon/radeon_fence.c
>>> @@ -237,9 +237,11 @@ retry:
>>>                         * as signaled for now
>>>                         */
>>>                        rdev->gpu_lockup = true;
>>> +#if 0
>>>                        r = radeon_gpu_reset(rdev);
>>>                        if (r)
>>>                                return r;
>>> +#endif
>>>                        WREG32(rdev->fence_drv.scratch_reg, fence->seq);
>>>                        rdev->gpu_lockup = false;
>>>                }
>>>
>>> fix the issue?
>>
>> Correction, RV710.
>>
>> Well, RADEON_FENCE_JIFFIES_TIMEOUT plus 1000ms is already
>> excruciatingly long for the command processor to not have responded,
>> but the issue I'm really hitting here is that X if left hanging after
>> GPU reset. The GPU lockup was being reported at:
>>
>> [<ffffffffa03b66d7>] ? radeon_fence_wait+0x2b7/0x320 [radeon]
>> [<ffffffff81061370>] warn_slowpath_common+0x90/0xc0
>> [<ffffffff8106149e>] warn_slowpath_fmt+0x6e/0x70
>> [<ffffffff81575f8a>] ? schedule_timeout+0x15a/0x2e0
>> [<ffffffffa03f5a9d>] ? r600_irq_set+0x27d/0xc00 [radeon]
>> [<ffffffffa03ce1f3>] ? radeon_ring_commit+0xa3/0xb0 [radeon]
>> [<ffffffffa03d5412>] ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon]
>> [<ffffffffa040063b>] ? r600_gpu_is_lockup+0x1cb/0x220 [radeon]
>> [<ffffffffa03b66d7>] radeon_fence_wait+0x2b7/0x320 [radeon]
>> [<ffffffff8107fcd0>] ? autoremove_wake_function+0x0/0x40
>> [<ffffffffa03b74a1>] radeon_sync_obj_wait+0x11/0x20 [radeon]
>> [<ffffffffa02ff332>] ttm_bo_wait+0x102/0x1b0 [ttm]
>> [<ffffffffa0302ede>] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm]
>> [<ffffffffa03b75f4>] radeon_move_blit+0x124/0x170 [radeon]
>>
>> X consistently hangs after GPU reset at (with sysrq-T):
>>
>> [<ffffffff81576838>] __mutex_lock_slowpath+0xe8/0x170
>> [<ffffffff8157647b>] mutex_lock+0x2b/0x50
>> [<ffffffffa0415aba>] radeon_pm_set_clocks+0x3a/0x2d0 [radeon]
>> [<ffffffff812abc51>] ? idr_get_new_above+0x11/0x40
>> [<ffffffffa01679ab>] ? drm_mode_object_get+0x6b/0xc0 [drm]
>> [<ffffffff811410ae>] ? kmem_cache_alloc_notrace+0x8e/0xd0
>> [<ffffffffa041662d>] radeon_pm_compute_clocks+0xcd/0x240 [radeon]
>> [<ffffffffa03a9d21>] atombios_crtc_mode_fixup+0x31/0x60 [radeon]
>> [<ffffffffa02ba13a>] drm_crtc_helper_set_mode+0x16a/0x370 [drm_kms_helper]
>> [<ffffffffa02ba3bd>] drm_helper_resume_force_mode+0x7d/0x150 [drm_kms_helper]
>> [<ffffffffa03a2f35>] radeon_gpu_reset+0x85/0xd0 [radeon]
>> [<ffffffffa03b66e7>] radeon_fence_wait+0x2c7/0x320 [radeon]
>> [<ffffffff8107fcd0>] ? autoremove_wake_function+0x0/0x40
>> [<ffffffffa03b74a1>] radeon_sync_obj_wait+0x11/0x20 [radeon]
>> [<ffffffffa02ff332>] ttm_bo_wait+0x102/0x1b0 [ttm]
>> [<ffffffffa0302ede>] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm]
>> [<ffffffffa03b75f4>] radeon_move_blit+0x124/0x170 [radeon]
>>
>> I couldn't reproduce it (but did hit it a number of times over the
>> last two weeks), so have been analysing with what we have. The kernel
>> the backtrace is against isn't built with debug, so:
>>
>> radeon_pm_set_clocks():
>>   8ca80:       55                      push   %rbp
>>   8ca81:       48 89 e5                mov    %rsp,%rbp
>>   8ca84:       41 57                   push   %r15
>>   8ca86:       41 56                   push   %r14
>>   8ca88:       41 55                   push   %r13
>>   8ca8a:       41 54                   push   %r12
>>   8ca8c:       53                      push   %rbx
>>   8ca8d:       48 83 ec 48             sub    $0x48,%rsp
>>   8ca91:       e8 00 00 00 00          callq  8ca96 <radeon_pm_set_clocks+0x16>
>>   8ca96:       49 89 fc                mov    %rdi,%r12
>>   8ca99:       48 8b 7f 08             mov    0x8(%rdi),%rdi
>>   8ca9d:       48 83 c7 20             add    $0x20,%rdi
>>   8caa1:       e8 00 00 00 00          callq  8caa6 <radeon_pm_set_clocks+0x26>
>>   8caa6:       49 8d 84 24 a0 1c 00    lea    0x1ca0(%r12),%rax
>>   8caad:       00
>>   8caae:       48 89 c7                mov    %rax,%rdi
>>   8cab1:       48 89 45 90             mov    %rax,-0x70(%rbp)
>>   8cab5:       e8 00 00 00 00          callq  8caba <radeon_pm_set_clocks+0x3a>
>> -> 8caba:       49 8d 8c 24 e8 09 00    lea    0x9e8(%r12),%rcx
>>   8cac1:       00
>>   8cac2:       48 89 cf                mov    %rcx,%rdi
>>   8cac5:       48 89 4d 98             mov    %rcx,-0x68(%rbp)
>>   8cac9:       e8 00 00 00 00          callq  8cace <radeon_pm_set_clocks+0x4e>
>>   8cace:       41 83 7c 24 78 1a       cmpl   $0x1a,0x78(%r12)
>>   8cad4:       0f 86 46 01 00 00       jbe    8cc20
>> <radeon_pm_set_clocks+0x1a0>
>>   8cada:       41 80 bc 24 d0 0d 00    cmpb   $0x0,0xdd0(%r12)
>>   8cae1:       00 00
>>
>> This corresponds to:
>>
>> static void radeon_pm_set_clocks(struct radeon_device *rdev)
>> {
>>        int i;
>>
>>        mutex_lock(&rdev->ddev->struct_mutex);
>>        mutex_lock(&rdev->vram_mutex);
>> ->      mutex_lock(&rdev->cp.mutex);
>>
>> We know r12 is the rdev struct pointer (%rdi is the first function
>> argument gets loaded into r12), and 0x9e8 is plausable as the offset
>> of cp.mutex, as vram_mutex is further out and plausable as 0x1ca0, so
>> the code is trying to acquire cp.mutex it seems.
>>
>> From the kernel log, the command proc ring test completed after the
>> start of the GPU reset, which is the last callsite to unlock cp.mutex,
>> ie <radeon_fence_wait->radeon_gpu_reset->radeon_resume->rv770_resume->rv770_startup->r600_cp_resume->radeon_ring_test->r600_ring_test->radeon_ring_unlock_commit>
>> before the hang at
>> <radeon_fence_wait->radeon_gpu_reset->->drm_helper_resume_force_mode->drm_crtc_helper_set_mode->mode_fixup->atombios_crtc_mode_fixup->radeon_pm_compute_clocks->radeon_pm_set_clocks>.
>>
>> Looking through all the calls between these two spots, I just can't
>> see anything that _could_ have locked cp.mutex, unless I'm misreading
>> the disassembly of mutex_lock.
>>
>> Any thoughts?
>
> Well, looking further back down the call chain, radeon_bo_create takes
> the vram_mutex across the ttm_bo_init call, so I was duped by the
> compiler's instruction scheduling.
>
> I looked at various changes to the locking scheme which did address
> the X hang (thus blackouts), but all attempts failed with lockdep
> violations somewhere, so there is no minimal fix without revisiting
> recent locking scheme changes [1,2,3].
>
> I'll pass on some other minor fixes I've cooked up in the meantime.
> Note that I am running with the kernel boot options 'dynpm=1
> radeon.dynclks=1' to stop everything getting toasty; let me know if
> you need more info for this issue.
>
> Thanks,
>  Daniel
>
> [1] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37917
> [2] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37923
> [3] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37937

I'm hitting this issue again on 2.6.36-rc1; it reproduces the 1000ms
timeout and subsequent hang every time on my RV710 when starting a
game on doom3 (eg
http://ftp.jeuxlinux.fr/files/doom3-linux-1.1.1286-demo.x86.run), even
without the dympm/dynclks options.

It looks like the locking strategy of the Matthew's patches needs
reworking. I was able to reproduce the hang by forcing a GPU reset
also.

What else would help with this?

Thanks,
  Daniel
-- 
Daniel J Blueman

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: GPU lockup+blackout on 2.6.35 with r600...
  2010-08-17 22:12         ` Daniel J Blueman
@ 2010-08-17 22:37           ` Alex Deucher
  2010-08-23 14:26             ` Daniel J Blueman
  0 siblings, 1 reply; 8+ messages in thread
From: Alex Deucher @ 2010-08-17 22:37 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: dri-devel, Matthew Garrett

On Tue, Aug 17, 2010 at 6:12 PM, Daniel J Blueman
<daniel.blueman@gmail.com> wrote:
> Hi chaps,
>
> On 9 August 2010 14:41, Daniel J Blueman <daniel.blueman@gmail.com> wrote:
>> On 8 August 2010 17:36, Daniel J Blueman <daniel.blueman@gmail.com> wrote:
>>> On 5 August 2010 16:23, Alex Deucher <alexdeucher@gmail.com> wrote:
>>>> On Thu, Aug 5, 2010 at 9:50 AM, Jerome Glisse <glisse@freedesktop.org> wrote:
>>>>> On 08/05/2010 06:52 AM, Daniel J Blueman wrote:
>>>>>>
>>>>>> After around 10-30 mins of typical usage (ie cairo rendering to
>>>>>> surfaces + compiz) on my Dell Studio 15 (model 1557), my R600 radeon
>>>>>> locks up, giving a blank screen.
>>>>>>
>>>>>> The only information I have is a backtrace [1]. 'radeontool regs'
>>>>>> doesn't give anything - what information, /sys files etc would be
>>>>>> useful here, and would it be useful to log a freedesktop.org bug
>>>>>> report at this stage, assuming I haven't found a reliable reproducer?
>>>>>>
>>>>>> OS is Ubuntu 10.10 development release, mesa 7.8.2, libdrm2 2.4.21.
>>>>>>
>>>>>> Thanks,
>>>>>>   Daniel
>>>>>>
>>>>>> --- [1]
>>>>>>
>>>>>> radeon 0000:02:00.0: GPU lockup CP stall for more than 1000msec
>>>>>> ------------[ cut here ]------------
>>>>>> WARNING: at
>>>>>> /home/kernel-ppa/COD/linux/drivers/gpu/drm/radeon/radeon_fence.c:235
>>>>>> radeon_fence_wait+0x2b7/0x320 [radeon]()
>>>>>> Hardware name: Studio 1557
>>>>>> GPU lockup (waiting for 0x00008C35 last fence id 0x00008C33)
>>>>>> Modules linked in: binfmt_misc parport_pc ppdev kvm_intel kvm
>>>>>> microcode snd_hda_codec_atihdmi joydev ipt_REJECT xt_limit xt_tcpudp
>>>>>> ipt_addrtype xt_state snd_hda_codec_idt ip6table_filter ip6_tables
>>>>>> snd_hda_intel nf_nat_irc nf_conntrack_irc snd_hda_codec nf_nat_ftp
>>>>>> nf_nat snd_hwdep nf_conntrack_ipv4 snd_seq_midi arc4 snd_pcm
>>>>>> snd_rawmidi nf_defrag_ipv4 radeon nf_conntrack_ftp nf_conntrack
>>>>>> iptable_filter snd_seq_midi_event snd_seq ip_tables ttm iwlagn video
>>>>>> snd_timer uvcvideo x_tables snd_seq_device output drm_kms_helper lp
>>>>>> dell_laptop iwlcore drm videodev dcdbas dell_wmi v4l1_compat mac80211
>>>>>> i7core_edac parport v4l2_compat_ioctl32 snd psmouse edac_core
>>>>>> i2c_algo_bit soundcore cfg80211 snd_page_alloc serio_raw sdhci_pci
>>>>>> sdhci ahci libahci led_class r8169 mii btrfs zlib_deflate crc32c
>>>>>> libcrc32c
>>>>>> Pid: 1624, comm: Xorg Tainted: G      D     2.6.35-020635-generic #020635
>>>>>> Call Trace:
>>>>>> ? radeon_fence_wait+0x2b7/0x320 [radeon]
>>>>>> warn_slowpath_common+0x90/0xc0
>>>>>> warn_slowpath_fmt+0x6e/0x70
>>>>>> ? schedule_timeout+0x15a/0x2e0
>>>>>> ? r600_irq_set+0x27d/0xc00 [radeon]
>>>>>> ? radeon_ring_commit+0xa3/0xb0 [radeon]
>>>>>> ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon]
>>>>>> ? r600_gpu_is_lockup+0x1cb/0x220 [radeon]
>>>>>> radeon_fence_wait+0x2b7/0x320 [radeon]
>>>>>> ? autoremove_wake_function+0x0/0x40
>>>>>> radeon_sync_obj_wait+0x11/0x20 [radeon]
>>>>>> ttm_bo_wait+0x102/0x1b0 [ttm]
>>>>>> ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm]
>>>>>> radeon_move_blit+0x124/0x170 [radeon]
>>>>>> radeon_bo_move+0xda/0x1a0 [radeon]
>>>>>> ttm_bo_handle_move_mem+0xec/0x370 [ttm]
>>>>>> ttm_bo_evict+0x1cc/0x270 [ttm]
>>>>>> ? drm_mm_split_at_start+0x1d/0x80 [drm]
>>>>>> ttm_mem_evict_first+0xed/0x180 [ttm]
>>>>>> ? ttm_bo_man_get_node+0xd5/0xe0 [ttm]
>>>>>> ttm_bo_mem_force_space+0xab/0x110 [ttm]
>>>>>> ttm_bo_mem_space+0x305/0x370 [ttm]
>>>>>> ttm_bo_move_buffer+0xcf/0x140 [ttm]
>>>>>> ? drm_mm_split_at_start+0x1d/0x80 [drm]
>>>>>> ttm_bo_validate+0xd5/0x100 [ttm]
>>>>>> ttm_bo_init+0x1f2/0x240 [ttm]
>>>>>> radeon_bo_create+0x121/0x240 [radeon]
>>>>>> ? radeon_ttm_bo_destroy+0x0/0x80 [radeon]
>>>>>> radeon_gem_object_create+0x89/0x100 [radeon]
>>>>>> ? drm_gem_handle_create+0x8b/0xa0 [drm]
>>>>>> radeon_gem_create_ioctl+0x58/0xe0 [radeon]
>>>>>> drm_ioctl+0x283/0x460 [drm]
>>>>>> ? radeon_gem_create_ioctl+0x0/0xe0 [radeon]
>>>>>> ? do_readv_writev+0x16b/0x1e0
>>>>>> vfs_ioctl+0x3a/0xc0
>>>>>> do_vfs_ioctl+0x6d/0x1f0
>>>>>> sys_ioctl+0x87/0xa0
>>>>>> ? sys_writev+0x55/0xb0
>>>>>> system_call_fastpath+0x16/0x1b
>>>>>> ---[ end trace c0bc12025fa8386c ]---
>>>>>> radeon 0000:02:00.0: GPU softreset
>>>>>> radeon 0000:02:00.0:   R_008010_GRBM_STATUS=0xA0003028
>>>>>> radeon 0000:02:00.0:   R_008014_GRBM_STATUS2=0x00000002
>>>>>> radeon 0000:02:00.0:   R_000E50_SRBM_STATUS=0x200000C0
>>>>>> radeon 0000:02:00.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
>>>>>> radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00000001
>>>>>> radeon 0000:02:00.0:   R_008010_GRBM_STATUS=0x00003028
>>>>>> radeon 0000:02:00.0:   R_008014_GRBM_STATUS2=0x00000002
>>>>>> radeon 0000:02:00.0:   R_000E50_SRBM_STATUS=0x200000C0
>>>>>> radeon 0000:02:00.0: GPU reset succeed
>>>>>> Clocks initialized !
>>>>>> ring test succeeded in 1 usecs
>>>>>> ib test succeeded in 1 usecs
>>>>>
>>>>> SO it keep reseting ? According to log there is a GPU lockup but then
>>>>> a successfull GPU reset so GPU should resume fine (that's what log says)
>>>>> Best is to open a bug and attach full dmesg + lspci -vv and context
>>>>> in which the lockup happen
>>>>
>>>> Perhaps the lockup timeout is too short or interrupts are being
>>>> delivered late or something?  There's a fdo or kernel bugzilla entry
>>>> where removing the lockup timeout fixed the issue.  Does something
>>>> like this:
>>>>
>>>> --- a/drivers/gpu/drm/radeon/radeon_fence.c
>>>> +++ b/drivers/gpu/drm/radeon/radeon_fence.c
>>>> @@ -237,9 +237,11 @@ retry:
>>>>                         * as signaled for now
>>>>                         */
>>>>                        rdev->gpu_lockup = true;
>>>> +#if 0
>>>>                        r = radeon_gpu_reset(rdev);
>>>>                        if (r)
>>>>                                return r;
>>>> +#endif
>>>>                        WREG32(rdev->fence_drv.scratch_reg, fence->seq);
>>>>                        rdev->gpu_lockup = false;
>>>>                }
>>>>
>>>> fix the issue?
>>>
>>> Correction, RV710.
>>>
>>> Well, RADEON_FENCE_JIFFIES_TIMEOUT plus 1000ms is already
>>> excruciatingly long for the command processor to not have responded,
>>> but the issue I'm really hitting here is that X if left hanging after
>>> GPU reset. The GPU lockup was being reported at:
>>>
>>> [<ffffffffa03b66d7>] ? radeon_fence_wait+0x2b7/0x320 [radeon]
>>> [<ffffffff81061370>] warn_slowpath_common+0x90/0xc0
>>> [<ffffffff8106149e>] warn_slowpath_fmt+0x6e/0x70
>>> [<ffffffff81575f8a>] ? schedule_timeout+0x15a/0x2e0
>>> [<ffffffffa03f5a9d>] ? r600_irq_set+0x27d/0xc00 [radeon]
>>> [<ffffffffa03ce1f3>] ? radeon_ring_commit+0xa3/0xb0 [radeon]
>>> [<ffffffffa03d5412>] ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon]
>>> [<ffffffffa040063b>] ? r600_gpu_is_lockup+0x1cb/0x220 [radeon]
>>> [<ffffffffa03b66d7>] radeon_fence_wait+0x2b7/0x320 [radeon]
>>> [<ffffffff8107fcd0>] ? autoremove_wake_function+0x0/0x40
>>> [<ffffffffa03b74a1>] radeon_sync_obj_wait+0x11/0x20 [radeon]
>>> [<ffffffffa02ff332>] ttm_bo_wait+0x102/0x1b0 [ttm]
>>> [<ffffffffa0302ede>] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm]
>>> [<ffffffffa03b75f4>] radeon_move_blit+0x124/0x170 [radeon]
>>>
>>> X consistently hangs after GPU reset at (with sysrq-T):
>>>
>>> [<ffffffff81576838>] __mutex_lock_slowpath+0xe8/0x170
>>> [<ffffffff8157647b>] mutex_lock+0x2b/0x50
>>> [<ffffffffa0415aba>] radeon_pm_set_clocks+0x3a/0x2d0 [radeon]
>>> [<ffffffff812abc51>] ? idr_get_new_above+0x11/0x40
>>> [<ffffffffa01679ab>] ? drm_mode_object_get+0x6b/0xc0 [drm]
>>> [<ffffffff811410ae>] ? kmem_cache_alloc_notrace+0x8e/0xd0
>>> [<ffffffffa041662d>] radeon_pm_compute_clocks+0xcd/0x240 [radeon]
>>> [<ffffffffa03a9d21>] atombios_crtc_mode_fixup+0x31/0x60 [radeon]
>>> [<ffffffffa02ba13a>] drm_crtc_helper_set_mode+0x16a/0x370 [drm_kms_helper]
>>> [<ffffffffa02ba3bd>] drm_helper_resume_force_mode+0x7d/0x150 [drm_kms_helper]
>>> [<ffffffffa03a2f35>] radeon_gpu_reset+0x85/0xd0 [radeon]
>>> [<ffffffffa03b66e7>] radeon_fence_wait+0x2c7/0x320 [radeon]
>>> [<ffffffff8107fcd0>] ? autoremove_wake_function+0x0/0x40
>>> [<ffffffffa03b74a1>] radeon_sync_obj_wait+0x11/0x20 [radeon]
>>> [<ffffffffa02ff332>] ttm_bo_wait+0x102/0x1b0 [ttm]
>>> [<ffffffffa0302ede>] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm]
>>> [<ffffffffa03b75f4>] radeon_move_blit+0x124/0x170 [radeon]
>>>
>>> I couldn't reproduce it (but did hit it a number of times over the
>>> last two weeks), so have been analysing with what we have. The kernel
>>> the backtrace is against isn't built with debug, so:
>>>
>>> radeon_pm_set_clocks():
>>>   8ca80:       55                      push   %rbp
>>>   8ca81:       48 89 e5                mov    %rsp,%rbp
>>>   8ca84:       41 57                   push   %r15
>>>   8ca86:       41 56                   push   %r14
>>>   8ca88:       41 55                   push   %r13
>>>   8ca8a:       41 54                   push   %r12
>>>   8ca8c:       53                      push   %rbx
>>>   8ca8d:       48 83 ec 48             sub    $0x48,%rsp
>>>   8ca91:       e8 00 00 00 00          callq  8ca96 <radeon_pm_set_clocks+0x16>
>>>   8ca96:       49 89 fc                mov    %rdi,%r12
>>>   8ca99:       48 8b 7f 08             mov    0x8(%rdi),%rdi
>>>   8ca9d:       48 83 c7 20             add    $0x20,%rdi
>>>   8caa1:       e8 00 00 00 00          callq  8caa6 <radeon_pm_set_clocks+0x26>
>>>   8caa6:       49 8d 84 24 a0 1c 00    lea    0x1ca0(%r12),%rax
>>>   8caad:       00
>>>   8caae:       48 89 c7                mov    %rax,%rdi
>>>   8cab1:       48 89 45 90             mov    %rax,-0x70(%rbp)
>>>   8cab5:       e8 00 00 00 00          callq  8caba <radeon_pm_set_clocks+0x3a>
>>> -> 8caba:       49 8d 8c 24 e8 09 00    lea    0x9e8(%r12),%rcx
>>>   8cac1:       00
>>>   8cac2:       48 89 cf                mov    %rcx,%rdi
>>>   8cac5:       48 89 4d 98             mov    %rcx,-0x68(%rbp)
>>>   8cac9:       e8 00 00 00 00          callq  8cace <radeon_pm_set_clocks+0x4e>
>>>   8cace:       41 83 7c 24 78 1a       cmpl   $0x1a,0x78(%r12)
>>>   8cad4:       0f 86 46 01 00 00       jbe    8cc20
>>> <radeon_pm_set_clocks+0x1a0>
>>>   8cada:       41 80 bc 24 d0 0d 00    cmpb   $0x0,0xdd0(%r12)
>>>   8cae1:       00 00
>>>
>>> This corresponds to:
>>>
>>> static void radeon_pm_set_clocks(struct radeon_device *rdev)
>>> {
>>>        int i;
>>>
>>>        mutex_lock(&rdev->ddev->struct_mutex);
>>>        mutex_lock(&rdev->vram_mutex);
>>> ->      mutex_lock(&rdev->cp.mutex);
>>>
>>> We know r12 is the rdev struct pointer (%rdi is the first function
>>> argument gets loaded into r12), and 0x9e8 is plausable as the offset
>>> of cp.mutex, as vram_mutex is further out and plausable as 0x1ca0, so
>>> the code is trying to acquire cp.mutex it seems.
>>>
>>> From the kernel log, the command proc ring test completed after the
>>> start of the GPU reset, which is the last callsite to unlock cp.mutex,
>>> ie <radeon_fence_wait->radeon_gpu_reset->radeon_resume->rv770_resume->rv770_startup->r600_cp_resume->radeon_ring_test->r600_ring_test->radeon_ring_unlock_commit>
>>> before the hang at
>>> <radeon_fence_wait->radeon_gpu_reset->->drm_helper_resume_force_mode->drm_crtc_helper_set_mode->mode_fixup->atombios_crtc_mode_fixup->radeon_pm_compute_clocks->radeon_pm_set_clocks>.
>>>
>>> Looking through all the calls between these two spots, I just can't
>>> see anything that _could_ have locked cp.mutex, unless I'm misreading
>>> the disassembly of mutex_lock.
>>>
>>> Any thoughts?
>>
>> Well, looking further back down the call chain, radeon_bo_create takes
>> the vram_mutex across the ttm_bo_init call, so I was duped by the
>> compiler's instruction scheduling.
>>
>> I looked at various changes to the locking scheme which did address
>> the X hang (thus blackouts), but all attempts failed with lockdep
>> violations somewhere, so there is no minimal fix without revisiting
>> recent locking scheme changes [1,2,3].
>>
>> I'll pass on some other minor fixes I've cooked up in the meantime.
>> Note that I am running with the kernel boot options 'dynpm=1
>> radeon.dynclks=1' to stop everything getting toasty; let me know if
>> you need more info for this issue.
>>
>> Thanks,
>>  Daniel
>>
>> [1] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37917
>> [2] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37923
>> [3] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37937
>
> I'm hitting this issue again on 2.6.36-rc1; it reproduces the 1000ms
> timeout and subsequent hang every time on my RV710 when starting a
> game on doom3 (eg
> http://ftp.jeuxlinux.fr/files/doom3-linux-1.1.1286-demo.x86.run), even
> without the dympm/dynclks options.
>
> It looks like the locking strategy of the Matthew's patches needs
> reworking. I was able to reproduce the hang by forcing a GPU reset
> also.
>
> What else would help with this?

Does this patch help?

http://git.kernel.org/?p=linux/kernel/git/airlied/drm-2.6.git;a=commitdiff;h=d02a83bb2dd3e34408844961336a66310a35ba30

It fixes several other similar bugs.

Alex

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: GPU lockup+blackout on 2.6.35 with r600...
  2010-08-17 22:37           ` Alex Deucher
@ 2010-08-23 14:26             ` Daniel J Blueman
  0 siblings, 0 replies; 8+ messages in thread
From: Daniel J Blueman @ 2010-08-23 14:26 UTC (permalink / raw)
  To: Alex Deucher; +Cc: dri-devel, Matthew Garrett

On 17 August 2010 23:37, Alex Deucher <alexdeucher@gmail.com> wrote:
> On Tue, Aug 17, 2010 at 6:12 PM, Daniel J Blueman
> <daniel.blueman@gmail.com> wrote:
>> Hi chaps,
>>
>> On 9 August 2010 14:41, Daniel J Blueman <daniel.blueman@gmail.com> wrote:
>>> On 8 August 2010 17:36, Daniel J Blueman <daniel.blueman@gmail.com> wrote:
>>>> On 5 August 2010 16:23, Alex Deucher <alexdeucher@gmail.com> wrote:
>>>>> On Thu, Aug 5, 2010 at 9:50 AM, Jerome Glisse <glisse@freedesktop.org> wrote:
>>>>>> On 08/05/2010 06:52 AM, Daniel J Blueman wrote:
>>>>>>>
>>>>>>> After around 10-30 mins of typical usage (ie cairo rendering to
>>>>>>> surfaces + compiz) on my Dell Studio 15 (model 1557), my R600 radeon
>>>>>>> locks up, giving a blank screen.
>>>>>>>
>>>>>>> The only information I have is a backtrace [1]. 'radeontool regs'
>>>>>>> doesn't give anything - what information, /sys files etc would be
>>>>>>> useful here, and would it be useful to log a freedesktop.org bug
>>>>>>> report at this stage, assuming I haven't found a reliable reproducer?
>>>>>>>
>>>>>>> OS is Ubuntu 10.10 development release, mesa 7.8.2, libdrm2 2.4.21.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>   Daniel
>>>>>>>
>>>>>>> --- [1]
>>>>>>>
>>>>>>> radeon 0000:02:00.0: GPU lockup CP stall for more than 1000msec
>>>>>>> ------------[ cut here ]------------
>>>>>>> WARNING: at
>>>>>>> /home/kernel-ppa/COD/linux/drivers/gpu/drm/radeon/radeon_fence.c:235
>>>>>>> radeon_fence_wait+0x2b7/0x320 [radeon]()
>>>>>>> Hardware name: Studio 1557
>>>>>>> GPU lockup (waiting for 0x00008C35 last fence id 0x00008C33)
>>>>>>> Modules linked in: binfmt_misc parport_pc ppdev kvm_intel kvm
>>>>>>> microcode snd_hda_codec_atihdmi joydev ipt_REJECT xt_limit xt_tcpudp
>>>>>>> ipt_addrtype xt_state snd_hda_codec_idt ip6table_filter ip6_tables
>>>>>>> snd_hda_intel nf_nat_irc nf_conntrack_irc snd_hda_codec nf_nat_ftp
>>>>>>> nf_nat snd_hwdep nf_conntrack_ipv4 snd_seq_midi arc4 snd_pcm
>>>>>>> snd_rawmidi nf_defrag_ipv4 radeon nf_conntrack_ftp nf_conntrack
>>>>>>> iptable_filter snd_seq_midi_event snd_seq ip_tables ttm iwlagn video
>>>>>>> snd_timer uvcvideo x_tables snd_seq_device output drm_kms_helper lp
>>>>>>> dell_laptop iwlcore drm videodev dcdbas dell_wmi v4l1_compat mac80211
>>>>>>> i7core_edac parport v4l2_compat_ioctl32 snd psmouse edac_core
>>>>>>> i2c_algo_bit soundcore cfg80211 snd_page_alloc serio_raw sdhci_pci
>>>>>>> sdhci ahci libahci led_class r8169 mii btrfs zlib_deflate crc32c
>>>>>>> libcrc32c
>>>>>>> Pid: 1624, comm: Xorg Tainted: G      D     2.6.35-020635-generic #020635
>>>>>>> Call Trace:
>>>>>>> ? radeon_fence_wait+0x2b7/0x320 [radeon]
>>>>>>> warn_slowpath_common+0x90/0xc0
>>>>>>> warn_slowpath_fmt+0x6e/0x70
>>>>>>> ? schedule_timeout+0x15a/0x2e0
>>>>>>> ? r600_irq_set+0x27d/0xc00 [radeon]
>>>>>>> ? radeon_ring_commit+0xa3/0xb0 [radeon]
>>>>>>> ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon]
>>>>>>> ? r600_gpu_is_lockup+0x1cb/0x220 [radeon]
>>>>>>> radeon_fence_wait+0x2b7/0x320 [radeon]
>>>>>>> ? autoremove_wake_function+0x0/0x40
>>>>>>> radeon_sync_obj_wait+0x11/0x20 [radeon]
>>>>>>> ttm_bo_wait+0x102/0x1b0 [ttm]
>>>>>>> ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm]
>>>>>>> radeon_move_blit+0x124/0x170 [radeon]
>>>>>>> radeon_bo_move+0xda/0x1a0 [radeon]
>>>>>>> ttm_bo_handle_move_mem+0xec/0x370 [ttm]
>>>>>>> ttm_bo_evict+0x1cc/0x270 [ttm]
>>>>>>> ? drm_mm_split_at_start+0x1d/0x80 [drm]
>>>>>>> ttm_mem_evict_first+0xed/0x180 [ttm]
>>>>>>> ? ttm_bo_man_get_node+0xd5/0xe0 [ttm]
>>>>>>> ttm_bo_mem_force_space+0xab/0x110 [ttm]
>>>>>>> ttm_bo_mem_space+0x305/0x370 [ttm]
>>>>>>> ttm_bo_move_buffer+0xcf/0x140 [ttm]
>>>>>>> ? drm_mm_split_at_start+0x1d/0x80 [drm]
>>>>>>> ttm_bo_validate+0xd5/0x100 [ttm]
>>>>>>> ttm_bo_init+0x1f2/0x240 [ttm]
>>>>>>> radeon_bo_create+0x121/0x240 [radeon]
>>>>>>> ? radeon_ttm_bo_destroy+0x0/0x80 [radeon]
>>>>>>> radeon_gem_object_create+0x89/0x100 [radeon]
>>>>>>> ? drm_gem_handle_create+0x8b/0xa0 [drm]
>>>>>>> radeon_gem_create_ioctl+0x58/0xe0 [radeon]
>>>>>>> drm_ioctl+0x283/0x460 [drm]
>>>>>>> ? radeon_gem_create_ioctl+0x0/0xe0 [radeon]
>>>>>>> ? do_readv_writev+0x16b/0x1e0
>>>>>>> vfs_ioctl+0x3a/0xc0
>>>>>>> do_vfs_ioctl+0x6d/0x1f0
>>>>>>> sys_ioctl+0x87/0xa0
>>>>>>> ? sys_writev+0x55/0xb0
>>>>>>> system_call_fastpath+0x16/0x1b
>>>>>>> ---[ end trace c0bc12025fa8386c ]---
>>>>>>> radeon 0000:02:00.0: GPU softreset
>>>>>>> radeon 0000:02:00.0:   R_008010_GRBM_STATUS=0xA0003028
>>>>>>> radeon 0000:02:00.0:   R_008014_GRBM_STATUS2=0x00000002
>>>>>>> radeon 0000:02:00.0:   R_000E50_SRBM_STATUS=0x200000C0
>>>>>>> radeon 0000:02:00.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
>>>>>>> radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00000001
>>>>>>> radeon 0000:02:00.0:   R_008010_GRBM_STATUS=0x00003028
>>>>>>> radeon 0000:02:00.0:   R_008014_GRBM_STATUS2=0x00000002
>>>>>>> radeon 0000:02:00.0:   R_000E50_SRBM_STATUS=0x200000C0
>>>>>>> radeon 0000:02:00.0: GPU reset succeed
>>>>>>> Clocks initialized !
>>>>>>> ring test succeeded in 1 usecs
>>>>>>> ib test succeeded in 1 usecs
>>>>>>
>>>>>> SO it keep reseting ? According to log there is a GPU lockup but then
>>>>>> a successfull GPU reset so GPU should resume fine (that's what log says)
>>>>>> Best is to open a bug and attach full dmesg + lspci -vv and context
>>>>>> in which the lockup happen
>>>>>
>>>>> Perhaps the lockup timeout is too short or interrupts are being
>>>>> delivered late or something?  There's a fdo or kernel bugzilla entry
>>>>> where removing the lockup timeout fixed the issue.  Does something
>>>>> like this:
>>>>>
>>>>> --- a/drivers/gpu/drm/radeon/radeon_fence.c
>>>>> +++ b/drivers/gpu/drm/radeon/radeon_fence.c
>>>>> @@ -237,9 +237,11 @@ retry:
>>>>>                         * as signaled for now
>>>>>                         */
>>>>>                        rdev->gpu_lockup = true;
>>>>> +#if 0
>>>>>                        r = radeon_gpu_reset(rdev);
>>>>>                        if (r)
>>>>>                                return r;
>>>>> +#endif
>>>>>                        WREG32(rdev->fence_drv.scratch_reg, fence->seq);
>>>>>                        rdev->gpu_lockup = false;
>>>>>                }
>>>>>
>>>>> fix the issue?
>>>>
>>>> Correction, RV710.
>>>>
>>>> Well, RADEON_FENCE_JIFFIES_TIMEOUT plus 1000ms is already
>>>> excruciatingly long for the command processor to not have responded,
>>>> but the issue I'm really hitting here is that X if left hanging after
>>>> GPU reset. The GPU lockup was being reported at:
>>>>
>>>> [<ffffffffa03b66d7>] ? radeon_fence_wait+0x2b7/0x320 [radeon]
>>>> [<ffffffff81061370>] warn_slowpath_common+0x90/0xc0
>>>> [<ffffffff8106149e>] warn_slowpath_fmt+0x6e/0x70
>>>> [<ffffffff81575f8a>] ? schedule_timeout+0x15a/0x2e0
>>>> [<ffffffffa03f5a9d>] ? r600_irq_set+0x27d/0xc00 [radeon]
>>>> [<ffffffffa03ce1f3>] ? radeon_ring_commit+0xa3/0xb0 [radeon]
>>>> [<ffffffffa03d5412>] ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon]
>>>> [<ffffffffa040063b>] ? r600_gpu_is_lockup+0x1cb/0x220 [radeon]
>>>> [<ffffffffa03b66d7>] radeon_fence_wait+0x2b7/0x320 [radeon]
>>>> [<ffffffff8107fcd0>] ? autoremove_wake_function+0x0/0x40
>>>> [<ffffffffa03b74a1>] radeon_sync_obj_wait+0x11/0x20 [radeon]
>>>> [<ffffffffa02ff332>] ttm_bo_wait+0x102/0x1b0 [ttm]
>>>> [<ffffffffa0302ede>] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm]
>>>> [<ffffffffa03b75f4>] radeon_move_blit+0x124/0x170 [radeon]
>>>>
>>>> X consistently hangs after GPU reset at (with sysrq-T):
>>>>
>>>> [<ffffffff81576838>] __mutex_lock_slowpath+0xe8/0x170
>>>> [<ffffffff8157647b>] mutex_lock+0x2b/0x50
>>>> [<ffffffffa0415aba>] radeon_pm_set_clocks+0x3a/0x2d0 [radeon]
>>>> [<ffffffff812abc51>] ? idr_get_new_above+0x11/0x40
>>>> [<ffffffffa01679ab>] ? drm_mode_object_get+0x6b/0xc0 [drm]
>>>> [<ffffffff811410ae>] ? kmem_cache_alloc_notrace+0x8e/0xd0
>>>> [<ffffffffa041662d>] radeon_pm_compute_clocks+0xcd/0x240 [radeon]
>>>> [<ffffffffa03a9d21>] atombios_crtc_mode_fixup+0x31/0x60 [radeon]
>>>> [<ffffffffa02ba13a>] drm_crtc_helper_set_mode+0x16a/0x370 [drm_kms_helper]
>>>> [<ffffffffa02ba3bd>] drm_helper_resume_force_mode+0x7d/0x150 [drm_kms_helper]
>>>> [<ffffffffa03a2f35>] radeon_gpu_reset+0x85/0xd0 [radeon]
>>>> [<ffffffffa03b66e7>] radeon_fence_wait+0x2c7/0x320 [radeon]
>>>> [<ffffffff8107fcd0>] ? autoremove_wake_function+0x0/0x40
>>>> [<ffffffffa03b74a1>] radeon_sync_obj_wait+0x11/0x20 [radeon]
>>>> [<ffffffffa02ff332>] ttm_bo_wait+0x102/0x1b0 [ttm]
>>>> [<ffffffffa0302ede>] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm]
>>>> [<ffffffffa03b75f4>] radeon_move_blit+0x124/0x170 [radeon]
>>>>
>>>> I couldn't reproduce it (but did hit it a number of times over the
>>>> last two weeks), so have been analysing with what we have. The kernel
>>>> the backtrace is against isn't built with debug, so:
>>>>
>>>> radeon_pm_set_clocks():
>>>>   8ca80:       55                      push   %rbp
>>>>   8ca81:       48 89 e5                mov    %rsp,%rbp
>>>>   8ca84:       41 57                   push   %r15
>>>>   8ca86:       41 56                   push   %r14
>>>>   8ca88:       41 55                   push   %r13
>>>>   8ca8a:       41 54                   push   %r12
>>>>   8ca8c:       53                      push   %rbx
>>>>   8ca8d:       48 83 ec 48             sub    $0x48,%rsp
>>>>   8ca91:       e8 00 00 00 00          callq  8ca96 <radeon_pm_set_clocks+0x16>
>>>>   8ca96:       49 89 fc                mov    %rdi,%r12
>>>>   8ca99:       48 8b 7f 08             mov    0x8(%rdi),%rdi
>>>>   8ca9d:       48 83 c7 20             add    $0x20,%rdi
>>>>   8caa1:       e8 00 00 00 00          callq  8caa6 <radeon_pm_set_clocks+0x26>
>>>>   8caa6:       49 8d 84 24 a0 1c 00    lea    0x1ca0(%r12),%rax
>>>>   8caad:       00
>>>>   8caae:       48 89 c7                mov    %rax,%rdi
>>>>   8cab1:       48 89 45 90             mov    %rax,-0x70(%rbp)
>>>>   8cab5:       e8 00 00 00 00          callq  8caba <radeon_pm_set_clocks+0x3a>
>>>> -> 8caba:       49 8d 8c 24 e8 09 00    lea    0x9e8(%r12),%rcx
>>>>   8cac1:       00
>>>>   8cac2:       48 89 cf                mov    %rcx,%rdi
>>>>   8cac5:       48 89 4d 98             mov    %rcx,-0x68(%rbp)
>>>>   8cac9:       e8 00 00 00 00          callq  8cace <radeon_pm_set_clocks+0x4e>
>>>>   8cace:       41 83 7c 24 78 1a       cmpl   $0x1a,0x78(%r12)
>>>>   8cad4:       0f 86 46 01 00 00       jbe    8cc20
>>>> <radeon_pm_set_clocks+0x1a0>
>>>>   8cada:       41 80 bc 24 d0 0d 00    cmpb   $0x0,0xdd0(%r12)
>>>>   8cae1:       00 00
>>>>
>>>> This corresponds to:
>>>>
>>>> static void radeon_pm_set_clocks(struct radeon_device *rdev)
>>>> {
>>>>        int i;
>>>>
>>>>        mutex_lock(&rdev->ddev->struct_mutex);
>>>>        mutex_lock(&rdev->vram_mutex);
>>>> ->      mutex_lock(&rdev->cp.mutex);
>>>>
>>>> We know r12 is the rdev struct pointer (%rdi is the first function
>>>> argument gets loaded into r12), and 0x9e8 is plausable as the offset
>>>> of cp.mutex, as vram_mutex is further out and plausable as 0x1ca0, so
>>>> the code is trying to acquire cp.mutex it seems.
>>>>
>>>> From the kernel log, the command proc ring test completed after the
>>>> start of the GPU reset, which is the last callsite to unlock cp.mutex,
>>>> ie <radeon_fence_wait->radeon_gpu_reset->radeon_resume->rv770_resume->rv770_startup->r600_cp_resume->radeon_ring_test->r600_ring_test->radeon_ring_unlock_commit>
>>>> before the hang at
>>>> <radeon_fence_wait->radeon_gpu_reset->->drm_helper_resume_force_mode->drm_crtc_helper_set_mode->mode_fixup->atombios_crtc_mode_fixup->radeon_pm_compute_clocks->radeon_pm_set_clocks>.
>>>>
>>>> Looking through all the calls between these two spots, I just can't
>>>> see anything that _could_ have locked cp.mutex, unless I'm misreading
>>>> the disassembly of mutex_lock.
>>>>
>>>> Any thoughts?
>>>
>>> Well, looking further back down the call chain, radeon_bo_create takes
>>> the vram_mutex across the ttm_bo_init call, so I was duped by the
>>> compiler's instruction scheduling.
>>>
>>> I looked at various changes to the locking scheme which did address
>>> the X hang (thus blackouts), but all attempts failed with lockdep
>>> violations somewhere, so there is no minimal fix without revisiting
>>> recent locking scheme changes [1,2,3].
>>>
>>> I'll pass on some other minor fixes I've cooked up in the meantime.
>>> Note that I am running with the kernel boot options 'dynpm=1
>>> radeon.dynclks=1' to stop everything getting toasty; let me know if
>>> you need more info for this issue.
>>>
>>> Thanks,
>>>  Daniel
>>>
>>> [1] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37917
>>> [2] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37923
>>> [3] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37937
>>
>> I'm hitting this issue again on 2.6.36-rc1; it reproduces the 1000ms
>> timeout and subsequent hang every time on my RV710 when starting a
>> game on doom3 (eg
>> http://ftp.jeuxlinux.fr/files/doom3-linux-1.1.1286-demo.x86.run), even
>> without the dympm/dynclks options.
>>
>> It looks like the locking strategy of the Matthew's patches needs
>> reworking. I was able to reproduce the hang by forcing a GPU reset
>> also.
>>
>> What else would help with this?
>
> Does this patch help?
>
> http://git.kernel.org/?p=linux/kernel/git/airlied/drm-2.6.git;a=commitdiff;h=d02a83bb2dd3e34408844961336a66310a35ba30
>
> It fixes several other similar bugs.

Yes, sorry for the delays replying. I still didn't get time to analyse
other call-paths this can affect, and thus the possibility of a
pending power mode change causing the same deadlock.

That aside, this patch does prevent the deadlocks I was able to
reproduce, including when the CS parser detects issues in
user-submitted CS packets.

Thanks,
  Daniel
-- 
Daniel J Blueman

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-08-23 14:26 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-05 10:52 GPU lockup+blackout on 2.6.35 with r600 Daniel J Blueman
2010-08-05 13:50 ` Jerome Glisse
2010-08-05 15:23   ` Alex Deucher
2010-08-08 16:36     ` Daniel J Blueman
2010-08-09 13:41       ` Daniel J Blueman
2010-08-17 22:12         ` Daniel J Blueman
2010-08-17 22:37           ` Alex Deucher
2010-08-23 14:26             ` Daniel J Blueman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.