public inbox for intel-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed
* Re: Question about how to troubleshoot sandybridge kernel opps and subsequest GPU lockup
       [not found] <20111024024822.GA5123@mindspring.com>
@ 2011-10-24  4:12 ` James R. Leu
  2011-10-24  6:46   ` Daniel Vetter
  0 siblings, 1 reply; 12+ messages in thread
From: James R. Leu @ 2011-10-24  4:12 UTC (permalink / raw)
  To: intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 4353 bytes --]

Hello,

I'm running wow in wine on 64 bit fedora rawhide on a dell vostro 3550
(i5 with integrated GPU).

I'm reliably able to produce 2 types of crashes:
- wow freezes, but I can get to text console, in this case I'm able to
  grab a kernel stack trace  (below) prior to seeing the normal
  [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 452684 at 452608, next 452686)
- the other is a complete freeze of the system, hard reset required, nothing logged to /var/log/messages

Is there any value in me creating a bug report for this, it seems to be a pretty common issue.
Is there any use in my trying different kernel command line optios for the i915 driver
or config options to the xorg intel driver?

I have the various git trees pulled out (I was looking for recent changes that might be related
to this issue).  I'm capable of building and installing from these git trees if there are specific
bits that I should test.

Oct 22 20:52:59 localhost kernel: [  939.830806] ------------[ cut here ]------------
Oct 22 20:52:59 localhost kernel: [  939.830814] WARNING: at drivers/gpu/drm/i915/i915_drv.c:372 gen6_gt_force_wake_put+0x29/0x51 [i915]()
Oct 22 20:52:59 localhost kernel: [  939.830816] Hardware name: Vostro 3550
Oct 22 20:52:59 localhost kernel: [  939.830818] Modules linked in: snd_seq_dummy fuse ip6table_filter ip6_tables ebtable_nat ebtables xt_state xt_CHECKSUM iptable_mangle ppdev parport_pc lp parport vboxpci vboxnetadp vboxnetflt vboxdrv bridge stp llc tun rfcomm bnep ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 snd_hda_codec_hdmi snd_hda_codec_idt uvcvideo videodev btusb media bluetooth v4l2_compat_ioctl32 arc4 snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm iwlagn microcode mac80211 dell_laptop iTCO_wdt r8169 i2c_i801 snd_timer cfg80211 snd mii iTCO_vendor_support dcdbas dell_wmi sparse_keymap soundcore rfkill snd_page_alloc virtio_net kvm_intel kvm binfmt_misc wmi i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan]
Oct 22 20:52:59 localhost kernel: [  939.830926] Pid: 0, comm: swapper Tainted: G        WC  3.1.0-0.rc10.git0.1.fc17.x86_64 #1
Oct 22 20:52:59 localhost kernel: [  939.830928] Call Trace:
Oct 22 20:52:59 localhost kernel: [  939.830930]  <IRQ [<ffffffff8105c3a0>] warn_slowpath_common+0x83/0x9b
Oct 22 20:52:59 localhost kernel: [  939.830941]  [<ffffffff8105c3d2>] warn_slowpath_null+0x1a/0x1c
Oct 22 20:52:59 localhost kernel: [  939.830952]  [<ffffffffa006b624>] gen6_gt_force_wake_put+0x29/0x51 [i915]
Oct 22 20:52:59 localhost kernel: [  939.830963]  [<ffffffffa006f45f>] i915_read32+0x44/0x6b [i915]
Oct 22 20:52:59 localhost kernel: [  939.830975]  [<ffffffffa00724a9>] i915_hangcheck_elapsed+0xe8/0x1f8 [i915]
Oct 22 20:52:59 localhost kernel: [  939.831027]  [<ffffffff81062ddd>] irq_exit+0x5d/0xcf
Oct 22 20:52:59 localhost kernel: [  939.831032]  [<ffffffff8150de91>] smp_apic_timer_interrupt+0x7c/0x8a
Oct 22 20:52:59 localhost kernel: [  939.831036]  [<ffffffff8150bd73>] apic_timer_interrupt+0x73/0x80
Oct 22 20:52:59 localhost kernel: [  939.831038]  <EOI [<ffffffff81014ded>] ? paravirt_read_tsc+0x9/0xd
Oct 22 20:52:59 localhost kernel: [  939.831046]  [<ffffffff81297075>] ? intel_idle+0xe5/0x10c
Oct 22 20:52:59 localhost kernel: [  939.831050]  [<ffffffff81297071>] ? intel_idle+0xe1/0x10c
Oct 22 20:52:59 localhost kernel: [  939.831054]  [<ffffffff813e14fe>] cpuidle_idle_call+0x11c/0x1fe
Oct 22 20:52:59 localhost kernel: [  939.831059]  [<ffffffff8100e2ef>] cpu_idle+0xab/0x101
Oct 22 20:52:59 localhost kernel: [  939.831063]  [<ffffffff814df673>] rest_init+0xd7/0xde
Oct 22 20:52:59 localhost kernel: [  939.831067]  [<ffffffff814df59c>] ? csum_partial_copy_generic+0x16c/0x16c
Oct 22 20:52:59 localhost kernel: [  939.831072]  [<ffffffff81d53bb0>] start_kernel+0x3dd/0x3ea
Oct 22 20:52:59 localhost kernel: [  939.831076]  [<ffffffff81d532c4>] x86_64_start_reservations+0xaf/0xb3
Oct 22 20:52:59 localhost kernel: [  939.831081]  [<ffffffff81d53140>] ? early_idt_handlers+0x140/0x140
Oct 22 20:52:59 localhost kernel: [  939.831085]  [<ffffffff81d533ca>] x86_64_start_kernel+0x102/0x111
Oct 22 20:52:59 localhost kernel: [  939.831088] ---[ end trace f5cba358bac6b7e5 ]---

-- 
James R. Leu
jleu@mindspring.com

[-- Attachment #1.2: Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question about how to troubleshoot sandybridge kernel opps and subsequest GPU lockup
  2011-10-24  4:12 ` James R. Leu
@ 2011-10-24  6:46   ` Daniel Vetter
  2011-10-25  0:58     ` James R. Leu
  0 siblings, 1 reply; 12+ messages in thread
From: Daniel Vetter @ 2011-10-24  6:46 UTC (permalink / raw)
  To: James R. Leu; +Cc: intel-gfx

On Sun, Oct 23, 2011 at 11:12:21PM -0500, James R. Leu wrote:
> I'm running wow in wine on 64 bit fedora rawhide on a dell vostro 3550
> (i5 with integrated GPU).
> 
> I'm reliably able to produce 2 types of crashes:
> - wow freezes, but I can get to text console, in this case I'm able to
>   grab a kernel stack trace  (below) prior to seeing the normal
>   [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 452684 at 452608, next 452686)

I'm pretty sure that below that line there's a gpu hang report. If that's
the case, the please grab everything in /sys/kernel/debug/dri, put it into
a tar.gz and attach it (you need to do this _after_ the machine is hung,
the kernel will write a gpu crash dump into i915_error_state).

The userspace parts of the i915 driver are very important for gpu hangs,
so please attach the version of mesa, libdrm and xf86-video-intel you've
installed.

Also please attach all your i915.ko module options as listed in
/sys/module/i915/parameters

> - the other is a complete freeze of the system, hard reset required, nothing logged to /var/log/messages

It's rather likely that this is the same issue as above. Depending upon
exact circumstances the gpu can take down the entire system.

> Is there any value in me creating a bug report for this, it seems to be a pretty common issue.
> Is there any use in my trying different kernel command line optios for
> the i915 driver or config options to the xorg intel driver?

Yes, gpu hangs are one of the more common issues, but until you've
submitted the error_state there's no way to diagnose the issue and tell
whether we have got a report already.

> I have the various git trees pulled out (I was looking for recent changes that might be related
> to this issue).  I'm capable of building and installing from these git trees if there are specific
> bits that I should test.
> 
> [  939.830806] ------------[ cut here ]------------
> [  939.830814] WARNING: at drivers/gpu/drm/i915/i915_drv.c:372 gen6_gt_force_wake_put+0x29/0x51 [i915]()
> [  939.830816] Hardware name: Vostro 3550
> [  939.830818] Modules linked in: snd_seq_dummy fuse ip6table_filter ip6_tables ebtable_nat ebtables xt_state xt_CHECKSUM iptable_mangle ppdev parport_pc lp parport vboxpci vboxnetadp vboxnetflt vboxdrv bridge stp llc tun rfcomm bnep ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 snd_hda_codec_hdmi snd_hda_codec_idt uvcvideo videodev btusb media bluetooth v4l2_compat_ioctl32 arc4 snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm iwlagn microcode mac80211 dell_laptop iTCO_wdt r8169 i2c_i801 snd_timer cfg80211 snd mii iTCO_vendor_support dcdbas dell_wmi sparse_keymap soundcore rfkill snd_page_alloc virtio_net kvm_intel kvm binfmt_misc wmi i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan]
> [  939.830926] Pid: 0, comm: swapper Tainted: G        WC  3.1.0-0.rc10.git0.1.fc17.x86_64 #1
> [  939.830928] Call Trace:
> [  939.830930]  <IRQ [<ffffffff8105c3a0>] warn_slowpath_common+0x83/0x9b
> [  939.830941]  [<ffffffff8105c3d2>] warn_slowpath_null+0x1a/0x1c
> [  939.830952]  [<ffffffffa006b624>] gen6_gt_force_wake_put+0x29/0x51 [i915]
> [  939.830963]  [<ffffffffa006f45f>] i915_read32+0x44/0x6b [i915]
> [  939.830975]  [<ffffffffa00724a9>] i915_hangcheck_elapsed+0xe8/0x1f8 [i915]
> [  939.831027]  [<ffffffff81062ddd>] irq_exit+0x5d/0xcf
> [  939.831032]  [<ffffffff8150de91>] smp_apic_timer_interrupt+0x7c/0x8a
> [  939.831036]  [<ffffffff8150bd73>] apic_timer_interrupt+0x73/0x80
> [  939.831038]  <EOI [<ffffffff81014ded>] ? paravirt_read_tsc+0x9/0xd
> [  939.831046]  [<ffffffff81297075>] ? intel_idle+0xe5/0x10c
> [  939.831050]  [<ffffffff81297071>] ? intel_idle+0xe1/0x10c
> [  939.831054]  [<ffffffff813e14fe>] cpuidle_idle_call+0x11c/0x1fe
> [  939.831059]  [<ffffffff8100e2ef>] cpu_idle+0xab/0x101
> [  939.831063]  [<ffffffff814df673>] rest_init+0xd7/0xde
> [  939.831067]  [<ffffffff814df59c>] ? csum_partial_copy_generic+0x16c/0x16c
> [  939.831072]  [<ffffffff81d53bb0>] start_kernel+0x3dd/0x3ea
> [  939.831076]  [<ffffffff81d532c4>] x86_64_start_reservations+0xaf/0xb3
> [  939.831081]  [<ffffffff81d53140>] ? early_idt_handlers+0x140/0x140
> [  939.831085]  [<ffffffff81d533ca>] x86_64_start_kernel+0x102/0x111
> [  939.831088] ---[ end trace f5cba358bac6b7e5 ]---

This WARN here is a possible sideeffect of a dying gpu. Independant, but
rather harmless bug. Unfortunately no easy solution, hence no patch atm.

Yours, Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question about how to troubleshoot sandybridge kernel opps and subsequest GPU lockup
  2011-10-24  6:46   ` Daniel Vetter
@ 2011-10-25  0:58     ` James R. Leu
  2011-10-25  2:43       ` Kenneth Graunke
  0 siblings, 1 reply; 12+ messages in thread
From: James R. Leu @ 2011-10-25  0:58 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx


[-- Attachment #1.1.1: Type: text/plain, Size: 4948 bytes --]

Debug output attached

On Mon, Oct 24, 2011 at 08:46:56AM +0200, Daniel Vetter wrote:
> On Sun, Oct 23, 2011 at 11:12:21PM -0500, James R. Leu wrote:
> > I'm running wow in wine on 64 bit fedora rawhide on a dell vostro 3550
> > (i5 with integrated GPU).
> > 
> > I'm reliably able to produce 2 types of crashes:
> > - wow freezes, but I can get to text console, in this case I'm able to
> >   grab a kernel stack trace  (below) prior to seeing the normal
> >   [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 452684 at 452608, next 452686)
> 
> I'm pretty sure that below that line there's a gpu hang report. If that's
> the case, the please grab everything in /sys/kernel/debug/dri, put it into
> a tar.gz and attach it (you need to do this _after_ the machine is hung,
> the kernel will write a gpu crash dump into i915_error_state).
> 
> The userspace parts of the i915 driver are very important for gpu hangs,
> so please attach the version of mesa, libdrm and xf86-video-intel you've
> installed.
> 
> Also please attach all your i915.ko module options as listed in
> /sys/module/i915/parameters
> 
> > - the other is a complete freeze of the system, hard reset required, nothing logged to /var/log/messages
> 
> It's rather likely that this is the same issue as above. Depending upon
> exact circumstances the gpu can take down the entire system.
> 
> > Is there any value in me creating a bug report for this, it seems to be a pretty common issue.
> > Is there any use in my trying different kernel command line optios for
> > the i915 driver or config options to the xorg intel driver?
> 
> Yes, gpu hangs are one of the more common issues, but until you've
> submitted the error_state there's no way to diagnose the issue and tell
> whether we have got a report already.
> 
> > I have the various git trees pulled out (I was looking for recent changes that might be related
> > to this issue).  I'm capable of building and installing from these git trees if there are specific
> > bits that I should test.
> > 
> > [  939.830806] ------------[ cut here ]------------
> > [  939.830814] WARNING: at drivers/gpu/drm/i915/i915_drv.c:372 gen6_gt_force_wake_put+0x29/0x51 [i915]()
> > [  939.830816] Hardware name: Vostro 3550
> > [  939.830818] Modules linked in: snd_seq_dummy fuse ip6table_filter ip6_tables ebtable_nat ebtables xt_state xt_CHECKSUM iptable_mangle ppdev parport_pc lp parport vboxpci vboxnetadp vboxnetflt vboxdrv bridge stp llc tun rfcomm bnep ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 snd_hda_codec_hdmi snd_hda_codec_idt uvcvideo videodev btusb media bluetooth v4l2_compat_ioctl32 arc4 snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm iwlagn microcode mac80211 dell_laptop iTCO_wdt r8169 i2c_i801 snd_timer cfg80211 snd mii iTCO_vendor_support dcdbas dell_wmi sparse_keymap soundcore rfkill snd_page_alloc virtio_net kvm_intel kvm binfmt_misc wmi i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan]
> > [  939.830926] Pid: 0, comm: swapper Tainted: G        WC  3.1.0-0.rc10.git0.1.fc17.x86_64 #1
> > [  939.830928] Call Trace:
> > [  939.830930]  <IRQ [<ffffffff8105c3a0>] warn_slowpath_common+0x83/0x9b
> > [  939.830941]  [<ffffffff8105c3d2>] warn_slowpath_null+0x1a/0x1c
> > [  939.830952]  [<ffffffffa006b624>] gen6_gt_force_wake_put+0x29/0x51 [i915]
> > [  939.830963]  [<ffffffffa006f45f>] i915_read32+0x44/0x6b [i915]
> > [  939.830975]  [<ffffffffa00724a9>] i915_hangcheck_elapsed+0xe8/0x1f8 [i915]
> > [  939.831027]  [<ffffffff81062ddd>] irq_exit+0x5d/0xcf
> > [  939.831032]  [<ffffffff8150de91>] smp_apic_timer_interrupt+0x7c/0x8a
> > [  939.831036]  [<ffffffff8150bd73>] apic_timer_interrupt+0x73/0x80
> > [  939.831038]  <EOI [<ffffffff81014ded>] ? paravirt_read_tsc+0x9/0xd
> > [  939.831046]  [<ffffffff81297075>] ? intel_idle+0xe5/0x10c
> > [  939.831050]  [<ffffffff81297071>] ? intel_idle+0xe1/0x10c
> > [  939.831054]  [<ffffffff813e14fe>] cpuidle_idle_call+0x11c/0x1fe
> > [  939.831059]  [<ffffffff8100e2ef>] cpu_idle+0xab/0x101
> > [  939.831063]  [<ffffffff814df673>] rest_init+0xd7/0xde
> > [  939.831067]  [<ffffffff814df59c>] ? csum_partial_copy_generic+0x16c/0x16c
> > [  939.831072]  [<ffffffff81d53bb0>] start_kernel+0x3dd/0x3ea
> > [  939.831076]  [<ffffffff81d532c4>] x86_64_start_reservations+0xaf/0xb3
> > [  939.831081]  [<ffffffff81d53140>] ? early_idt_handlers+0x140/0x140
> > [  939.831085]  [<ffffffff81d533ca>] x86_64_start_kernel+0x102/0x111
> > [  939.831088] ---[ end trace f5cba358bac6b7e5 ]---
> 
> This WARN here is a possible sideeffect of a dying gpu. Independant, but
> rather harmless bug. Unfortunately no easy solution, hence no patch atm.
> 
> Yours, Daniel
> -- 
> Daniel Vetter
> Mail: daniel@ffwll.ch
> Mobile: +41 (0)79 365 57 48

-- 
James R. Leu
jleu@mindspring.com

[-- Attachment #1.1.2: 2542-debug.txt.bz2 --]
[-- Type: application/x-bzip2, Size: 1019884 bytes --]

[-- Attachment #1.1.3: 2542-params.txt.bz2 --]
[-- Type: application/x-bzip2, Size: 226 bytes --]

[-- Attachment #1.1.4: 2542-versions.txt.bz2 --]
[-- Type: application/x-bzip2, Size: 172 bytes --]

[-- Attachment #1.2: Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question about how to troubleshoot sandybridge kernel opps and subsequest GPU lockup
  2011-10-25  0:58     ` James R. Leu
@ 2011-10-25  2:43       ` Kenneth Graunke
  2011-10-25  7:15         ` Jesse Barnes
  0 siblings, 1 reply; 12+ messages in thread
From: Kenneth Graunke @ 2011-10-25  2:43 UTC (permalink / raw)
  To: jleu; +Cc: intel-gfx

On 10/24/2011 05:58 PM, James R. Leu wrote:
> Debug output attached

You're in luck!  I fixed this GPU hang today in Mesa master.

This commit fixes the hang:

commit 3cc0a7be23ab603ed40d602595f673a44e079885
Author: Kenneth Graunke <kenneth@whitecape.org>
Date:   Fri Oct 21 01:03:37 2011 -0700

    i965: Apply post-sync non-zero workaround to homebrew workaround.

    In commit 3e5d3626, Eric added a homebrew workaround to fix GPU hangs in
    the Mesa "engine" demo and oglc's api-texcoord test.

    Unfortunately, his PIPE_CONTROL contains a Depth Stall, which
    necessitates the post-sync non-zero workaround,

    Fixes GPU hangs in Civilization 4, PlaneShift, and 3DMMES.
    Hopefully Heroes of Newerth as well, though I haven't tested that.

    NOTE: This is candidate for the 7.11 branch.

    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=40324
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=41096
    Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
    Reviewed-and-tested-by: Eric Anholt <eric@anholt.net>

I'm planning on cherry-picking it to the 7.11 branch in the next few
days, so it ought to make the upcoming 7.11.1 release.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question about how to troubleshoot sandybridge kernel opps and subsequest GPU lockup
  2011-10-25  2:43       ` Kenneth Graunke
@ 2011-10-25  7:15         ` Jesse Barnes
  2011-10-25  7:49           ` Daniel Vetter
  0 siblings, 1 reply; 12+ messages in thread
From: Jesse Barnes @ 2011-10-25  7:15 UTC (permalink / raw)
  To: Kenneth Graunke; +Cc: intel-gfx

On Mon, 24 Oct 2011 19:43:44 -0700
Kenneth Graunke <kenneth@whitecape.org> wrote:

> On 10/24/2011 05:58 PM, James R. Leu wrote:
> > Debug output attached
> 
> You're in luck!  I fixed this GPU hang today in Mesa master.
> 
> This commit fixes the hang:
> 
> commit 3cc0a7be23ab603ed40d602595f673a44e079885
> Author: Kenneth Graunke <kenneth@whitecape.org>
> Date:   Fri Oct 21 01:03:37 2011 -0700
> 
>     i965: Apply post-sync non-zero workaround to homebrew workaround.
> 
>     In commit 3e5d3626, Eric added a homebrew workaround to fix GPU
> hangs in the Mesa "engine" demo and oglc's api-texcoord test.
> 
>     Unfortunately, his PIPE_CONTROL contains a Depth Stall, which
>     necessitates the post-sync non-zero workaround,
> 
>     Fixes GPU hangs in Civilization 4, PlaneShift, and 3DMMES.
>     Hopefully Heroes of Newerth as well, though I haven't tested that.
> 
>     NOTE: This is candidate for the 7.11 branch.
> 
>     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=40324
>     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=41096
>     Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
>     Reviewed-and-tested-by: Eric Anholt <eric@anholt.net>
> 
> I'm planning on cherry-picking it to the 7.11 branch in the next few
> days, so it ought to make the upcoming 7.11.1 release.

It's good that we have so many ways and opportunities to test our GPU
reset reliability.

Gordon, can you make sure our regular QA covers GPU hang detect and
reset using a few different methods (e.g. the ones above but without
the fix from Ken in Mesa)?  It's important that reset work really well
and ideally w/o even being noticed by the user, so the more ways we
have to wedge things, the better we can test the reset path's
invisibility.

Thanks,
Jesse

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question about how to troubleshoot sandybridge kernel opps and subsequest GPU lockup
  2011-10-25  7:15         ` Jesse Barnes
@ 2011-10-25  7:49           ` Daniel Vetter
  0 siblings, 0 replies; 12+ messages in thread
From: Daniel Vetter @ 2011-10-25  7:49 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: intel-gfx

On Tue, Oct 25, 2011 at 09:15:58AM +0200, Jesse Barnes wrote:
> On Mon, 24 Oct 2011 19:43:44 -0700
> Kenneth Graunke <kenneth@whitecape.org> wrote:
> 
> > On 10/24/2011 05:58 PM, James R. Leu wrote:
> > > Debug output attached
> > 
> > You're in luck!  I fixed this GPU hang today in Mesa master.
> > 
> > This commit fixes the hang:
> > 
> > commit 3cc0a7be23ab603ed40d602595f673a44e079885
> > Author: Kenneth Graunke <kenneth@whitecape.org>
> > Date:   Fri Oct 21 01:03:37 2011 -0700
> > 
> >     i965: Apply post-sync non-zero workaround to homebrew workaround.
> > 
> >     In commit 3e5d3626, Eric added a homebrew workaround to fix GPU
> > hangs in the Mesa "engine" demo and oglc's api-texcoord test.
> > 
> >     Unfortunately, his PIPE_CONTROL contains a Depth Stall, which
> >     necessitates the post-sync non-zero workaround,
> > 
> >     Fixes GPU hangs in Civilization 4, PlaneShift, and 3DMMES.
> >     Hopefully Heroes of Newerth as well, though I haven't tested that.
> > 
> >     NOTE: This is candidate for the 7.11 branch.
> > 
> >     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=40324
> >     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=41096
> >     Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
> >     Reviewed-and-tested-by: Eric Anholt <eric@anholt.net>
> > 
> > I'm planning on cherry-picking it to the 7.11 branch in the next few
> > days, so it ought to make the upcoming 7.11.1 release.
> 
> It's good that we have so many ways and opportunities to test our GPU
> reset reliability.
> 
> Gordon, can you make sure our regular QA covers GPU hang detect and
> reset using a few different methods (e.g. the ones above but without
> the fix from Ken in Mesa)?  It's important that reset work really well
> and ideally w/o even being noticed by the user, so the more ways we
> have to wedge things, the better we can test the reset path's
> invisibility.

I'm thinking about adding a debugfs file that stops ringbuffer tail writes
on the specified ring to simulate a gpu hang. This way we can really
stress-test the hangcheck and error_state capture code. And by throwing
random workloads at the gpu while we "hang" it we hopefully can decently
exercise the gpu reset code and see whether it properly resets the gpu (or
just takes down the entire system).
-Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question about how to troubleshoot sandybridge kernel opps and subsequest GPU lockup
@ 2011-10-28 14:12 Nicolas Kalkhof
  2011-10-28 14:45 ` Bojan Smojver
  0 siblings, 1 reply; 12+ messages in thread
From: Nicolas Kalkhof @ 2011-10-28 14:12 UTC (permalink / raw)
  To: jleu, intel-gfx

Hi,

looks like a known issue with mobile snb chips when rc6 is enabled. Please try to disable rc6 with  i915.i915_enable_rc6=0 in your kernel cmd line.
This should take care of the wakeup hangs but also causes the gpu to disregard power saving, draining approx 10 watts more from the system.
If disabling rc6 makes your system more stable, try the latest kernel drm next branch from Dave Airlie's git:
git://people.freedesktop.org/~airlied/linux.git  branch:drm-core-next
and use the following kernel cmd parameters:
intel_iommu=off pcie_aspm=force i915.i915_enable_rc6=1 i915.i915_enable_fbc=1 i915.lvds_downclock=1
No Idea if all of these params are effective but this works for me on my lenovo t420 with a i7 2620M.

Regards,
Nic




-----Ursprüngliche Nachricht-----
Von: "James R. Leu" <jleu@mindspring.com>
Gesendet: Oct 24, 2011 6:12:21 AM
An: intel-gfx@lists.freedesktop.org
Betreff: Re: [Intel-gfx] Question about how to troubleshoot sandybridge kernel opps and subsequest GPU lockup

>Hello,
>
>I'm running wow in wine on 64 bit fedora rawhide on a dell vostro 3550
>(i5 with integrated GPU).
>
>I'm reliably able to produce 2 types of crashes:
>- wow freezes, but I can get to text console, in this case I'm able to
> grab a kernel stack trace (below) prior to seeing the normal
> [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 452684 at 452608, next 452686)
>- the other is a complete freeze of the system, hard reset required, nothing logged to /var/log/messages
>
>Is there any value in me creating a bug report for this, it seems to be a pretty common issue.
>Is there any use in my trying different kernel command line optios for the i915 driver
>or config options to the xorg intel driver?
>
>I have the various git trees pulled out (I was looking for recent changes that might be related
>to this issue). I'm capable of building and installing from these git trees if there are specific
>bits that I should test.
>
>Oct 22 20:52:59 localhost kernel: [ 939.830806] ------------[ cut here ]------------
>Oct 22 20:52:59 localhost kernel: [ 939.830814] WARNING: at drivers/gpu/drm/i915/i915_drv.c:372 gen6_gt_force_wake_put+0x29/0x51 [i915]()
>Oct 22 20:52:59 localhost kernel: [ 939.830816] Hardware name: Vostro 3550
>Oct 22 20:52:59 localhost kernel: [ 939.830818] Modules linked in: snd_seq_dummy fuse ip6table_filter ip6_tables ebtable_nat ebtables xt_state xt_CHECKSUM iptable_mangle ppdev parport_pc lp parport vboxpci vboxnetadp vboxnetflt vboxdrv bridge stp llc tun rfcomm bnep ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 snd_hda_codec_hdmi snd_hda_codec_idt uvcvideo videodev btusb media bluetooth v4l2_compat_ioctl32 arc4 snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm iwlagn microcode mac80211 dell_laptop iTCO_wdt r8169 i2c_i801 snd_timer cfg80211 snd mii iTCO_vendor_support dcdbas dell_wmi sparse_keymap soundcore rfkill snd_page_alloc virtio_net kvm_intel kvm binfmt_misc wmi i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan]
>Oct 22 20:52:59 localhost kernel: [ 939.830926] Pid: 0, comm: swapper Tainted: G WC 3.1.0-0.rc10.git0.1.fc17.x86_64 #1
>Oct 22 20:52:59 localhost kernel: [ 939.830928] Call Trace:
>Oct 22 20:52:59 localhost kernel: [ 939.830930] <IRQ [<ffffffff8105c3a0>] warn_slowpath_common+0x83/0x9b
>Oct 22 20:52:59 localhost kernel: [ 939.830941] [<ffffffff8105c3d2>] warn_slowpath_null+0x1a/0x1c
>Oct 22 20:52:59 localhost kernel: [ 939.830952] [<ffffffffa006b624>] gen6_gt_force_wake_put+0x29/0x51 [i915]
>Oct 22 20:52:59 localhost kernel: [ 939.830963] [<ffffffffa006f45f>] i915_read32+0x44/0x6b [i915]
>Oct 22 20:52:59 localhost kernel: [ 939.830975] [<ffffffffa00724a9>] i915_hangcheck_elapsed+0xe8/0x1f8 [i915]
>Oct 22 20:52:59 localhost kernel: [ 939.831027] [<ffffffff81062ddd>] irq_exit+0x5d/0xcf
>Oct 22 20:52:59 localhost kernel: [ 939.831032] [<ffffffff8150de91>] smp_apic_timer_interrupt+0x7c/0x8a
>Oct 22 20:52:59 localhost kernel: [ 939.831036] [<ffffffff8150bd73>] apic_timer_interrupt+0x73/0x80
>Oct 22 20:52:59 localhost kernel: [ 939.831038] <EOI [<ffffffff81014ded>] ? paravirt_read_tsc+0x9/0xd
>Oct 22 20:52:59 localhost kernel: [ 939.831046] [<ffffffff81297075>] ? intel_idle+0xe5/0x10c
>Oct 22 20:52:59 localhost kernel: [ 939.831050] [<ffffffff81297071>] ? intel_idle+0xe1/0x10c
>Oct 22 20:52:59 localhost kernel: [ 939.831054] [<ffffffff813e14fe>] cpuidle_idle_call+0x11c/0x1fe
>Oct 22 20:52:59 localhost kernel: [ 939.831059] [<ffffffff8100e2ef>] cpu_idle+0xab/0x101
>Oct 22 20:52:59 localhost kernel: [ 939.831063] [<ffffffff814df673>] rest_init+0xd7/0xde
>Oct 22 20:52:59 localhost kernel: [ 939.831067] [<ffffffff814df59c>] ? csum_partial_copy_generic+0x16c/0x16c
>Oct 22 20:52:59 localhost kernel: [ 939.831072] [<ffffffff81d53bb0>] start_kernel+0x3dd/0x3ea
>Oct 22 20:52:59 localhost kernel: [ 939.831076] [<ffffffff81d532c4>] x86_64_start_reservations+0xaf/0xb3
>Oct 22 20:52:59 localhost kernel: [ 939.831081] [<ffffffff81d53140>] ? early_idt_handlers+0x140/0x140
>Oct 22 20:52:59 localhost kernel: [ 939.831085] [<ffffffff81d533ca>] x86_64_start_kernel+0x102/0x111
>Oct 22 20:52:59 localhost kernel: [ 939.831088] ---[ end trace f5cba358bac6b7e5 ]---
>
>--
>James R. Leu
>jleu@mindspring.com


___________________________________________________________
SMS schreiben mit WEB.DE FreeMail - einfach, schnell und
kostenguenstig. Jetzt gleich testen! http://f.web.de/?mc=021192
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question about how to troubleshoot sandybridge kernel opps and subsequest GPU lockup
  2011-10-28 14:12 Nicolas Kalkhof
@ 2011-10-28 14:45 ` Bojan Smojver
  2011-11-01  1:42   ` James R. Leu
  0 siblings, 1 reply; 12+ messages in thread
From: Bojan Smojver @ 2011-10-28 14:45 UTC (permalink / raw)
  To: nkalkhof; +Cc: intel-gfx

------- Original message -------
> From: Nicolas Kalkhof

> No Idea if all of these params are effective but this works for me on my 
> lenovo t420 with a i7 2620M.

Out of curiosity, unrelated to this problem and because you have similar 
hardware to mine - do repeated hibernate/thaw cycles cause the kernel on 
your system to start throwing all sorts of random errors, due to memory 
corruption?

--
Bojan 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question about how to troubleshoot sandybridge kernel opps and subsequest GPU lockup
@ 2011-10-28 14:47 nkalkhof
  0 siblings, 0 replies; 12+ messages in thread
From: nkalkhof @ 2011-10-28 14:47 UTC (permalink / raw)
  To: Bojan Smojver; +Cc: intel-gfx

Hi,

good question. I don't use hibernate so I can't say anything to that. :-(

Regards
Nic


-----Ursprüngliche Nachricht-----
Von: "Bojan Smojver" <bojan@rexursive.com>
Gesendet: Oct 28, 2011 4:45:31 PM
An: nkalkhof@web.de
Betreff: Re: [Intel-gfx] Question about how to troubleshoot sandybridge kernel opps and subsequest GPU lockup

>------- Original message -------
>> From: Nicolas Kalkhof
>
>> No Idea if all of these params are effective but this works for me on my
>> lenovo t420 with a i7 2620M.
>
>Out of curiosity, unrelated to this problem and because you have similar
>hardware to mine - do repeated hibernate/thaw cycles cause the kernel on
>your system to start throwing all sorts of random errors, due to memory
>corruption?
>
>--
>Bojan


___________________________________________________________
SMS schreiben mit WEB.DE FreeMail - einfach, schnell und
kostenguenstig. Jetzt gleich testen! http://f.web.de/?mc=021192
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question about how to troubleshoot sandybridge kernel opps and subsequest GPU lockup
  2011-10-28 14:45 ` Bojan Smojver
@ 2011-11-01  1:42   ` James R. Leu
  2011-11-01 10:37     ` Eugeni Dodonov
  0 siblings, 1 reply; 12+ messages in thread
From: James R. Leu @ 2011-11-01  1:42 UTC (permalink / raw)
  To: Bojan Smojver; +Cc: intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 701 bytes --]

I do not use hibernate, I have my lid close set to suspend.
Up until I upgraded to 3.2.0 kernel (rawhide) my suspend/wakup cycles
had been stable.


On Sat, Oct 29, 2011 at 01:45:31AM +1100, Bojan Smojver wrote:
> ------- Original message -------
> >From: Nicolas Kalkhof
> 
> >No Idea if all of these params are effective but this works for me
> >on my lenovo t420 with a i7 2620M.
> 
> Out of curiosity, unrelated to this problem and because you have
> similar hardware to mine - do repeated hibernate/thaw cycles cause
> the kernel on your system to start throwing all sorts of random
> errors, due to memory corruption?
> 
> --
> Bojan

-- 
James R. Leu
jleu@mindspring.com

[-- Attachment #1.2: Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question about how to troubleshoot sandybridge kernel opps and subsequest GPU lockup
  2011-11-01  1:42   ` James R. Leu
@ 2011-11-01 10:37     ` Eugeni Dodonov
  2011-11-01 11:05       ` James R. Leu
  0 siblings, 1 reply; 12+ messages in thread
From: Eugeni Dodonov @ 2011-11-01 10:37 UTC (permalink / raw)
  To: jleu; +Cc: intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 399 bytes --]

On Mon, Oct 31, 2011 at 23:42, James R. Leu <jleu@mindspring.com> wrote:

> I do not use hibernate, I have my lid close set to suspend.
> Up until I upgraded to 3.2.0 kernel (rawhide) my suspend/wakup cycles
> had been stable.
>

>From what version have you upgraded?

Could you try to bisect it, to find the commit which broke the suspend for
you?

-- 
Eugeni Dodonov
 <http://eugeni.dodonov.net/>

[-- Attachment #1.2: Type: text/html, Size: 696 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question about how to troubleshoot sandybridge kernel opps and subsequest GPU lockup
  2011-11-01 10:37     ` Eugeni Dodonov
@ 2011-11-01 11:05       ` James R. Leu
  0 siblings, 0 replies; 12+ messages in thread
From: James R. Leu @ 2011-11-01 11:05 UTC (permalink / raw)
  To: Eugeni Dodonov; +Cc: intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 685 bytes --]

I went form 3.1.0 to 3.2.0 (fedora rawhid RPMs).

I will pull the source RPMs and figure out what git commit they
are are based off of and then try a bisect.

On Tue, Nov 01, 2011 at 08:37:42AM -0200, Eugeni Dodonov wrote:
> On Mon, Oct 31, 2011 at 23:42, James R. Leu <jleu@mindspring.com> wrote:
>      I do not use hibernate, I have my lid close set to suspend.
>      Up until I upgraded to 3.2.0 kernel (rawhide) my suspend/wakup cycles
>      had been stable.
> 
> From what version have you upgraded?
> 
> Could you try to bisect it, to find the commit which broke the suspend for you?
> 
> 
> --
> Eugeni Dodonov
> 

-- 
James R. Leu
jleu@mindspring.com

[-- Attachment #1.2: Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2011-11-01 11:05 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-10-28 14:47 Question about how to troubleshoot sandybridge kernel opps and subsequest GPU lockup nkalkhof
  -- strict thread matches above, loose matches on Subject: below --
2011-10-28 14:12 Nicolas Kalkhof
2011-10-28 14:45 ` Bojan Smojver
2011-11-01  1:42   ` James R. Leu
2011-11-01 10:37     ` Eugeni Dodonov
2011-11-01 11:05       ` James R. Leu
     [not found] <20111024024822.GA5123@mindspring.com>
2011-10-24  4:12 ` James R. Leu
2011-10-24  6:46   ` Daniel Vetter
2011-10-25  0:58     ` James R. Leu
2011-10-25  2:43       ` Kenneth Graunke
2011-10-25  7:15         ` Jesse Barnes
2011-10-25  7:49           ` Daniel Vetter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox