* Oops at i915_gem_retire_requests_ring
@ 2011-03-17 12:39 Herton Ronaldo Krzesinski
2011-03-17 13:46 ` Chris Wilson
0 siblings, 1 reply; 4+ messages in thread
From: Herton Ronaldo Krzesinski @ 2011-03-17 12:39 UTC (permalink / raw)
To: dri-devel; +Cc: intel-gfx
(please CC me if replying)
Hi,
recently I saw a report about an oops in 2.6.38-rc7 in i915, attached at
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/733780
Unfortunately the oops is cut (without call trace):
[126167.230394] BUG: unable to handle kernel paging request at 00100104
[126167.230699] IP: [<f8c2ce44>] i915_gem_retire_requests_ring+0xd4/0x240 [i915]
[126167.231042] *pdpt = 00000000314c1001 *pde = 0000000000000000
[126167.231314] Oops: 0002 [#1] SMP
[126167.231471] last sysfs file: /sys/devices/LNXSYSTM:00/device:00/PNP0C0A:00/power_supply/BAT1/current_now
[126167.231901] Modules linked in: snd_seq_dummy nls_utf8 isofs btrfs zlib_deflate libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs exportfs reiserfs cryptd aes_i586 aes_generic binfmt_misc vboxnetadp vboxnetflt vboxdrv parport_pc ppdev snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_intel snd_hda_codec snd_hwdep arc4 snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq uvcvideo videodev snd_timer snd_seq_device joydev iwlagn iwlcore mac80211 snd cfg80211 soundcore i915 drm_kms_helper snd_page_alloc psmouse drm serio_raw i2c_algo_bit video lp parport usbhid hid sky2 sdhci_pci ahci sdhci libahci
[126167.232018]
[126167.232018] Pid: 1101, comm: Xorg Not tainted 2.6.38-6-generic-pae #34-Ubuntu Gateway MC7833U /
[126167.232018] EIP: 0060:[<f8c2ce44>] EFLAGS: 00213246 CPU: 0
[126167.232018] EIP is at i915_gem_retire_requests_ring+0xd4/0x240 [i915]
[126167.232018] EAX: 00200200 EBX: f1ac25b0 ECX: 00000040 EDX: 00100100
[126167.232018] ESI: f1a2801c EDI: e87fc060 EBP: ef4d7dd8 ESP: ef4d7db0
[126167.232018] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[126167.232018] Process Xorg (pid: 1101, ti=ef4d6000 task=f1ba6500 task.ti=ef4d6000)
[126167.232018] Stack:
[126167.232018] f1a28000 f1a2809c f1a28094 0058bd97 f1aa2400 f1a2801c 0058bd7b 0058bd85
[126167.232018] f1a2801c f1a28000 ef4d7e38 f8c2e995 ef4d7e30 ef4d7e60 c14d1ebc f6b3a040
[126167.232018] f1522cc0 000000db 00000000 f1ba6500 ffffffa1 00000000 00000001 f1a29214
[126167.232018] Call Trace:
But we already can see some things. After downloading the debug symbols
for kernel above [1] and extracting the files (dpkg-deb -x), setting up
the source files, in gdb we can get this:
(gdb) l *(i915_gem_retire_requests_ring+0xd4)
0x17e74 is in i915_gem_retire_requests_ring (/build/buildd/linux-2.6.38/include/linux/list.h:88).
warning: Source file is more recent than executable.
83 * This is only for internal list manipulation where we know
84 * the prev/next entries already!
85 */
86 static inline void __list_del(struct list_head * prev, struct list_head * next)
87 {
88 next->prev = prev;
89 prev->next = next;
90 }
91
92 /**
So the oops is in a list_del call. And looking at the EDX and
dereferenced value, it's LIST_POISON1 + 4, as
p &((struct list_head *)0)->prev == 0x4
So indeed the oops is happening on __list_del at next->prev = prev,
called from i915_gem_retire_requests_ring, probably from the
i915_gem_request_remove_from_client call, going back in the code for
example:
(gdb) l *(i915_gem_retire_requests_ring+0xc6)
0x17e66 is in i915_gem_retire_requests_ring (/build/buildd/linux-2.6.38/drivers/gpu/drm/i915/i915_gem.c:1748).
warning: Source file is more recent than executable.
1743 static inline void
1744 i915_gem_request_remove_from_client(struct drm_i915_gem_request *request)
1745 {
1746 struct drm_i915_file_private *file_priv = request->file_priv;
1747
1748 if (!file_priv)
1749 return;
1750
1751 spin_lock(&file_priv->mm.lock);
1752 list_del(&request->client_list);
But without the call trace no way to know for certain. Anyway the oops
with LIST_POISON1 means we are trying to delete from the list an already
deleted item. As I can see in the code (I'm not used to that code, so
may be there is something I don't see), it seems we indeed have the
possibility to remove an request->client_list twice, which would cause
the above, we do list_del(&request->client_list) on both
i915_gem_request_remove_from_client and i915_gem_release
I don't know if it's the most correct fix, but perhaps the simple fix
is needed in the code. It's against latest Linus tree. We may have an
already removed client_list, or we didn't add any item to client_list
(file_priv NULL in i915_add_request)
Signed-off-by: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
---
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 36e66cc..6077c0d 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1749,8 +1749,10 @@ i915_gem_request_remove_from_client(struct drm_i915_gem_request *request)
return;
spin_lock(&file_priv->mm.lock);
- list_del(&request->client_list);
- request->file_priv = NULL;
+ if (request->file_priv) {
+ list_del(&request->client_list);
+ request->file_priv = NULL;
+ }
spin_unlock(&file_priv->mm.lock);
}
@@ -4043,8 +4045,10 @@ void i915_gem_release(struct drm_device *dev, struct drm_file *file)
request = list_first_entry(&file_priv->mm.request_list,
struct drm_i915_gem_request,
client_list);
- list_del(&request->client_list);
- request->file_priv = NULL;
+ if (request->file_priv) {
+ list_del(&request->client_list);
+ request->file_priv = NULL;
+ }
}
spin_unlock(&file_priv->mm.lock);
}
[1] http://ddebs.ubuntu.com/pool/main/l/linux/linux-image-2.6.38-6-generic-pae-dbgsym_2.6.38-6.34_i386.ddeb
--
[]'s
Herton
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: Oops at i915_gem_retire_requests_ring
2011-03-17 12:39 Oops at i915_gem_retire_requests_ring Herton Ronaldo Krzesinski
@ 2011-03-17 13:46 ` Chris Wilson
2011-03-17 16:54 ` Herton Ronaldo Krzesinski
0 siblings, 1 reply; 4+ messages in thread
From: Chris Wilson @ 2011-03-17 13:46 UTC (permalink / raw)
To: Herton Ronaldo Krzesinski, dri-devel; +Cc: intel-gfx
On Thu, 17 Mar 2011 09:39:07 -0300, Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> wrote:
> I don't know if it's the most correct fix, but perhaps the simple fix
> is needed in the code. It's against latest Linus tree. We may have an
> already removed client_list, or we didn't add any item to client_list
> (file_priv NULL in i915_add_request)
>
> Signed-off-by: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
> ---
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 36e66cc..6077c0d 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -1749,8 +1749,10 @@ i915_gem_request_remove_from_client(struct drm_i915_gem_request *request)
> return;
>
> spin_lock(&file_priv->mm.lock);
> - list_del(&request->client_list);
> - request->file_priv = NULL;
> + if (request->file_priv) {
> + list_del(&request->client_list);
> + request->file_priv = NULL;
> + }
> spin_unlock(&file_priv->mm.lock);
> }
This is the single chunk required. I had thought that the actual
insertion/deletion was serialised under the struct mutex and the intention
of the spinlock was to protect the unlocked list traversal during
throttling. However, I missed that i915_gem_release() is also called
without struct mutex and so we do need the double check for
i915_gem_request_remove_from_client().
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Oops at i915_gem_retire_requests_ring
2011-03-17 13:46 ` Chris Wilson
@ 2011-03-17 16:54 ` Herton Ronaldo Krzesinski
2011-03-17 17:06 ` Chris Wilson
0 siblings, 1 reply; 4+ messages in thread
From: Herton Ronaldo Krzesinski @ 2011-03-17 16:54 UTC (permalink / raw)
To: Chris Wilson; +Cc: intel-gfx, dri-devel
On Thu, Mar 17, 2011 at 01:46:34PM +0000, Chris Wilson wrote:
> On Thu, 17 Mar 2011 09:39:07 -0300, Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> wrote:
> > I don't know if it's the most correct fix, but perhaps the simple fix
> > is needed in the code. It's against latest Linus tree. We may have an
> > already removed client_list, or we didn't add any item to client_list
> > (file_priv NULL in i915_add_request)
> >
> > Signed-off-by: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
> > ---
> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > index 36e66cc..6077c0d 100644
> > --- a/drivers/gpu/drm/i915/i915_gem.c
> > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > @@ -1749,8 +1749,10 @@ i915_gem_request_remove_from_client(struct drm_i915_gem_request *request)
> > return;
> >
> > spin_lock(&file_priv->mm.lock);
> > - list_del(&request->client_list);
> > - request->file_priv = NULL;
> > + if (request->file_priv) {
> > + list_del(&request->client_list);
> > + request->file_priv = NULL;
> > + }
> > spin_unlock(&file_priv->mm.lock);
> > }
>
> This is the single chunk required. I had thought that the actual
> insertion/deletion was serialised under the struct mutex and the intention
> of the spinlock was to protect the unlocked list traversal during
> throttling. However, I missed that i915_gem_release() is also called
> without struct mutex and so we do need the double check for
> i915_gem_request_remove_from_client().
Ok. I just still have one doubt though, if in i915_add_request
file/file_priv is NULL, wouldn't be possible to have an oops also in
i915_gem_release without the check? As in this case,
request->client_list wouldn't have mm.request_list added to it, and if
an error occurs and i915_reset is called, which ends up calling
i915_gem_release, we would try to do a list_del on request->client_list
without items.
If the check really isn't needed in i915_gem_release, then please
consider this patch:
From: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Avoid oops in i915_gem_request_remove_from_client client_list removal
When i915_gem_retire_requests_ring calls i915_gem_request_remove_from_client,
the client_list for that request may already be removed in i915_gem_release.
So we may call twice list_del(&request->client_list), resulting in an
oops like this report:
[126167.230394] BUG: unable to handle kernel paging request at 00100104
[126167.230699] IP: [<f8c2ce44>] i915_gem_retire_requests_ring+0xd4/0x240 [i915]
[126167.231042] *pdpt = 00000000314c1001 *pde = 0000000000000000
[126167.231314] Oops: 0002 [#1] SMP
[126167.231471] last sysfs file: /sys/devices/LNXSYSTM:00/device:00/PNP0C0A:00/power_supply/BAT1/current_now
[126167.231901] Modules linked in: snd_seq_dummy nls_utf8 isofs btrfs zlib_deflate libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs exportfs reiserfs cryptd aes_i586 aes_generic binfmt_misc vboxnetadp vboxnetflt vboxdrv parport_pc ppdev snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_intel snd_hda_codec snd_hwdep arc4 snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq uvcvideo videodev snd_timer snd_seq_device joydev iwlagn iwlcore mac80211 snd cfg80211 soundcore i915 drm_kms_helper snd_page_alloc psmouse drm serio_raw i2c_algo_bit video lp parport usbhid hid sky2 sdhci_pci ahci sdhci libahci
[126167.232018]
[126167.232018] Pid: 1101, comm: Xorg Not tainted 2.6.38-6-generic-pae #34-Ubuntu Gateway MC7833U /
[126167.232018] EIP: 0060:[<f8c2ce44>] EFLAGS: 00213246 CPU: 0
[126167.232018] EIP is at i915_gem_retire_requests_ring+0xd4/0x240 [i915]
[126167.232018] EAX: 00200200 EBX: f1ac25b0 ECX: 00000040 EDX: 00100100
[126167.232018] ESI: f1a2801c EDI: e87fc060 EBP: ef4d7dd8 ESP: ef4d7db0
[126167.232018] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[126167.232018] Process Xorg (pid: 1101, ti=ef4d6000 task=f1ba6500 task.ti=ef4d6000)
[126167.232018] Stack:
[126167.232018] f1a28000 f1a2809c f1a28094 0058bd97 f1aa2400 f1a2801c 0058bd7b 0058bd85
[126167.232018] f1a2801c f1a28000 ef4d7e38 f8c2e995 ef4d7e30 ef4d7e60 c14d1ebc f6b3a040
[126167.232018] f1522cc0 000000db 00000000 f1ba6500 ffffffa1 00000000 00000001 f1a29214
[126167.232018] Call Trace:
Unfortunately the call trace reported was cut, but looking at debug
symbols the crash is at __list_del, when probably list_del is called
twice on the same request->client_list, as the dereferenced value is
LIST_POISON1 + 4, and by looking more at the debug symbols before
list_del call it should have being called by
i915_gem_request_remove_from_client
And as I can see in the code, it seems we indeed have the possibility
to remove a request->client_list twice, which would cause the above,
because we do list_del(&request->client_list) on both
i915_gem_request_remove_from_client and i915_gem_release
As Chris Wilson pointed out, it's indeed the case:
"(...) I had thought that the actual insertion/deletion was serialised
under the struct mutex and the intention of the spinlock was to protect
the unlocked list traversal during throttling. However, I missed that
i915_gem_release() is also called without struct mutex and so we do need
the double check for i915_gem_request_remove_from_client()."
This change does the required check to see if we really need to remove
request->client_list.
BugLink: http://bugs.launchpad.net/bugs/733780
Cc: stable@kernel.org
Signed-off-by: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
---
drivers/gpu/drm/i915/i915_gem.c | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 950a5ab..fdf9166 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1753,8 +1753,10 @@ i915_gem_request_remove_from_client(struct drm_i915_gem_request *request)
return;
spin_lock(&file_priv->mm.lock);
- list_del(&request->client_list);
- request->file_priv = NULL;
+ if (request->file_priv) {
+ list_del(&request->client_list);
+ request->file_priv = NULL;
+ }
spin_unlock(&file_priv->mm.lock);
}
--
1.7.1
> -Chris
>
> --
> Chris Wilson, Intel Open Source Technology Centre
>
--
[]'s
Herton
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: Oops at i915_gem_retire_requests_ring
2011-03-17 16:54 ` Herton Ronaldo Krzesinski
@ 2011-03-17 17:06 ` Chris Wilson
0 siblings, 0 replies; 4+ messages in thread
From: Chris Wilson @ 2011-03-17 17:06 UTC (permalink / raw)
To: Herton Ronaldo Krzesinski; +Cc: intel-gfx, dri-devel
On Thu, 17 Mar 2011 13:54:45 -0300, Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> wrote:
> On Thu, Mar 17, 2011 at 01:46:34PM +0000, Chris Wilson wrote:
> > This is the single chunk required. I had thought that the actual
> > insertion/deletion was serialised under the struct mutex and the intention
> > of the spinlock was to protect the unlocked list traversal during
> > throttling. However, I missed that i915_gem_release() is also called
> > without struct mutex and so we do need the double check for
> > i915_gem_request_remove_from_client().
>
> Ok. I just still have one doubt though, if in i915_add_request
> file/file_priv is NULL, wouldn't be possible to have an oops also in
> i915_gem_release without the check? As in this case,
> request->client_list wouldn't have mm.request_list added to it, and if
> an error occurs and i915_reset is called, which ends up calling
> i915_gem_release, we would try to do a list_del on request->client_list
> without items.
If the file_priv is NULL, then the request is not added to the client
mm.request_list and so it is not seen during i915_gem_release.
The list is file_priv->mm.request_list, the nodes within that are
request->client_list.
> If the check really isn't needed in i915_gem_release, then please
> consider this patch:
Done, thanks,
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2011-03-17 17:06 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-17 12:39 Oops at i915_gem_retire_requests_ring Herton Ronaldo Krzesinski
2011-03-17 13:46 ` Chris Wilson
2011-03-17 16:54 ` Herton Ronaldo Krzesinski
2011-03-17 17:06 ` Chris Wilson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox