* [PATCH] tests/gem_reset_stats: add close-pending-fork
@ 2013-12-02 14:47 Mika Kuoppala
2013-12-02 15:03 ` Chris Wilson
2013-12-02 15:31 ` [RFC] drm/i915: reference count batch object on requests Mika Kuoppala
0 siblings, 2 replies; 12+ messages in thread
From: Mika Kuoppala @ 2013-12-02 14:47 UTC (permalink / raw)
To: intel-gfx
Fork and create another filedesc to wait access to already
hung GPU and then kill it. This triggers use after free of the
request->batch_obj.
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
tests/gem_reset_stats.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 63 insertions(+)
diff --git a/tests/gem_reset_stats.c b/tests/gem_reset_stats.c
index 6c22bce..242300c 100644
--- a/tests/gem_reset_stats.c
+++ b/tests/gem_reset_stats.c
@@ -25,6 +25,7 @@
*
*/
+#define _GNU_SOURCE
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
@@ -36,6 +37,7 @@
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <time.h>
+#include <signal.h>
#include "i915_drm.h"
#include "intel_bufmgr.h"
@@ -637,6 +639,64 @@ static void test_close_pending(void)
close(fd);
}
+static int __test_open_any(void)
+{
+ char *name;
+ int ret, fd;
+
+ ret = asprintf(&name, "/dev/dri/card%d", drm_get_card());
+ if (ret == -1)
+ return -1;
+
+ fd = open(name, O_RDWR);
+ free(name);
+
+ return fd;
+}
+
+static void test_close_pending_fork(void)
+{
+ int pid;
+ int fd, fd2, h;
+
+ fd = drm_open_any();
+ igt_assert(fd >= 0);
+
+ assert_reset_status(fd, 0, RS_NO_ERROR);
+
+ igt_disable_exit_handler();
+
+ h = inject_hang(fd, 0);
+ igt_assert(h >= 0);
+
+ sleep(1);
+
+ pid = fork();
+ if (pid == 0) {
+ fd2 = __test_open_any();
+ igt_assert(fd2 >= 0);
+ gem_quiescent_gpu(fd2);
+ close(fd2);
+ } else {
+ igt_assert(pid > 0);
+ sleep(1);
+ kill(pid, SIGKILL);
+ }
+
+ gem_close(fd, h);
+ close(fd);
+
+ fd = drm_open_any();
+ igt_assert(fd >= 0);
+
+ h = exec_valid(fd, 0);
+ igt_assert(h >= 0);
+
+ gem_sync(fd, h);
+ gem_close(fd, h);
+ close(fd);
+}
+
static void drop_root(void)
{
igt_assert(getuid() == 0);
@@ -837,6 +897,9 @@ igt_main
igt_subtest("close-pending")
test_close_pending();
+ igt_subtest("close-pending-fork")
+ test_close_pending_fork();
+
igt_subtest("params")
test_params();
}
--
1.7.9.5
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] tests/gem_reset_stats: add close-pending-fork
2013-12-02 14:47 [PATCH] tests/gem_reset_stats: add close-pending-fork Mika Kuoppala
@ 2013-12-02 15:03 ` Chris Wilson
2013-12-02 16:32 ` Mika Kuoppala
2013-12-02 15:31 ` [RFC] drm/i915: reference count batch object on requests Mika Kuoppala
1 sibling, 1 reply; 12+ messages in thread
From: Chris Wilson @ 2013-12-02 15:03 UTC (permalink / raw)
To: Mika Kuoppala; +Cc: intel-gfx
On Mon, Dec 02, 2013 at 04:47:46PM +0200, Mika Kuoppala wrote:
> Fork and create another filedesc to wait access to already
> hung GPU and then kill it. This triggers use after free of the
> request->batch_obj.
>
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
> +static int __test_open_any(void)
> +{
> + char *name;
> + int ret, fd;
> +
> + ret = asprintf(&name, "/dev/dri/card%d", drm_get_card());
> + if (ret == -1)
> + return -1;
> +
> + fd = open(name, O_RDWR);
> + free(name);
> +
> + return fd;
> +}
> +
> +static void test_close_pending_fork(void)
> +{
> + int pid;
> + int fd, fd2, h;
> +
> + fd = drm_open_any();
> + igt_assert(fd >= 0);
[snip]
> + close(fd);
> +
> + fd = drm_open_any();
This breaks the testsuite since drm_open_any() will now return the
closed fd. Hence your __test_open_any() above.
However this does not justify kernel doing anything other than terminating
your process with extreme prejudice.
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
^ permalink raw reply [flat|nested] 12+ messages in thread
* [RFC] drm/i915: reference count batch object on requests
2013-12-02 14:47 [PATCH] tests/gem_reset_stats: add close-pending-fork Mika Kuoppala
2013-12-02 15:03 ` Chris Wilson
@ 2013-12-02 15:31 ` Mika Kuoppala
2013-12-03 17:10 ` Daniel Vetter
1 sibling, 1 reply; 12+ messages in thread
From: Mika Kuoppala @ 2013-12-02 15:31 UTC (permalink / raw)
To: intel-gfx; +Cc: miku
We used to lean on active_list to handle the references
to batch objects. But there are useful cases when same,
albeit simple, batch can be executing on multiple rings
concurrently. For this case the active_list reference count
handling is just not enough as batch could be freed by
ring A request retirement as it is still running on ring B.
Fix this by doing proper batch_obj reference counting.
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
Notes:
This is a patch which ameliorates the
[PATCH] tests/gem_reset_stats: add close-pending-fork
Chris wasn't happy about the refcounting as it might hide
the true problem. But I haven't been able to find the real culprit,
thus the RFC.
---
drivers/gpu/drm/i915/i915_gem.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 40d9dcf..858538f 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2145,13 +2145,12 @@ int __i915_add_request(struct intel_ring_buffer *ring,
request->head = request_start;
request->tail = request_ring_position;
- /* Whilst this request exists, batch_obj will be on the
- * active_list, and so will hold the active reference. Only when this
- * request is retired will the the batch_obj be moved onto the
- * inactive_list and lose its active reference. Hence we do not need
- * to explicitly hold another reference here.
+ /* Active list has one reference but that is not enough as same
+ * batch_obj can be active on multiple rings
*/
request->batch_obj = obj;
+ if (request->batch_obj)
+ drm_gem_object_reference(&request->batch_obj->base);
/* Hold a reference to the current context so that we can inspect
* it later in case a hangcheck error event fires.
@@ -2340,6 +2339,9 @@ static void i915_gem_free_request(struct drm_i915_gem_request *request)
if (request->ctx)
i915_gem_context_unreference(request->ctx);
+ if (request->batch_obj)
+ drm_gem_object_unreference(&request->batch_obj->base);
+
kfree(request);
}
--
1.7.9.5
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] tests/gem_reset_stats: add close-pending-fork
2013-12-02 15:03 ` Chris Wilson
@ 2013-12-02 16:32 ` Mika Kuoppala
2013-12-03 17:03 ` Daniel Vetter
0 siblings, 1 reply; 12+ messages in thread
From: Mika Kuoppala @ 2013-12-02 16:32 UTC (permalink / raw)
To: Chris Wilson; +Cc: intel-gfx
Chris Wilson <chris@chris-wilson.co.uk> writes:
> On Mon, Dec 02, 2013 at 04:47:46PM +0200, Mika Kuoppala wrote:
>> Fork and create another filedesc to wait access to already
>> hung GPU and then kill it. This triggers use after free of the
>> request->batch_obj.
>>
>> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
>> ---
>> +static int __test_open_any(void)
>> +{
>> + char *name;
>> + int ret, fd;
>> +
>> + ret = asprintf(&name, "/dev/dri/card%d", drm_get_card());
>> + if (ret == -1)
>> + return -1;
>> +
>> + fd = open(name, O_RDWR);
>> + free(name);
>> +
>> + return fd;
>> +}
>> +
>> +static void test_close_pending_fork(void)
>> +{
>> + int pid;
>> + int fd, fd2, h;
>> +
>> + fd = drm_open_any();
>> + igt_assert(fd >= 0);
>
> [snip]
>
>> + close(fd);
>> +
>> + fd = drm_open_any();
>
> This breaks the testsuite since drm_open_any() will now return the
> closed fd. Hence your __test_open_any() above.
>
> However this does not justify kernel doing anything other than terminating
> your process with extreme prejudice.
I discussed this with Chris in IRC. There seems to be no problem
as drm_open_any doesn't cache file descriptors.
The reason i use __test_open_any is that I needed a way to bypass
testsuites exit handling.
--Mika
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] tests/gem_reset_stats: add close-pending-fork
2013-12-02 16:32 ` Mika Kuoppala
@ 2013-12-03 17:03 ` Daniel Vetter
2013-12-04 14:39 ` Mika Kuoppala
0 siblings, 1 reply; 12+ messages in thread
From: Daniel Vetter @ 2013-12-03 17:03 UTC (permalink / raw)
To: Mika Kuoppala; +Cc: intel-gfx
On Mon, Dec 02, 2013 at 06:32:51PM +0200, Mika Kuoppala wrote:
> Chris Wilson <chris@chris-wilson.co.uk> writes:
>
> > On Mon, Dec 02, 2013 at 04:47:46PM +0200, Mika Kuoppala wrote:
> >> Fork and create another filedesc to wait access to already
> >> hung GPU and then kill it. This triggers use after free of the
> >> request->batch_obj.
> >>
> >> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> >> ---
> >> +static int __test_open_any(void)
> >> +{
> >> + char *name;
> >> + int ret, fd;
> >> +
> >> + ret = asprintf(&name, "/dev/dri/card%d", drm_get_card());
> >> + if (ret == -1)
> >> + return -1;
> >> +
> >> + fd = open(name, O_RDWR);
> >> + free(name);
> >> +
> >> + return fd;
> >> +}
> >> +
> >> +static void test_close_pending_fork(void)
> >> +{
> >> + int pid;
> >> + int fd, fd2, h;
> >> +
> >> + fd = drm_open_any();
> >> + igt_assert(fd >= 0);
> >
> > [snip]
> >
> >> + close(fd);
> >> +
> >> + fd = drm_open_any();
> >
> > This breaks the testsuite since drm_open_any() will now return the
> > closed fd. Hence your __test_open_any() above.
> >
> > However this does not justify kernel doing anything other than terminating
> > your process with extreme prejudice.
>
> I discussed this with Chris in IRC. There seems to be no problem
> as drm_open_any doesn't cache file descriptors.
>
> The reason i use __test_open_any is that I needed a way to bypass
> testsuites exit handling.
I guess a quick comment to explain that in the code would help. Can you
please update the patch?
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC] drm/i915: reference count batch object on requests
2013-12-02 15:31 ` [RFC] drm/i915: reference count batch object on requests Mika Kuoppala
@ 2013-12-03 17:10 ` Daniel Vetter
2013-12-04 11:24 ` Chris Wilson
2013-12-04 13:28 ` Mika Kuoppala
0 siblings, 2 replies; 12+ messages in thread
From: Daniel Vetter @ 2013-12-03 17:10 UTC (permalink / raw)
To: Mika Kuoppala; +Cc: intel-gfx, miku
On Mon, Dec 02, 2013 at 05:31:53PM +0200, Mika Kuoppala wrote:
> We used to lean on active_list to handle the references
> to batch objects. But there are useful cases when same,
> albeit simple, batch can be executing on multiple rings
> concurrently. For this case the active_list reference count
> handling is just not enough as batch could be freed by
> ring A request retirement as it is still running on ring B.
>
> Fix this by doing proper batch_obj reference counting.
>
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
>
> Notes:
> This is a patch which ameliorates the
> [PATCH] tests/gem_reset_stats: add close-pending-fork
>
> Chris wasn't happy about the refcounting as it might hide
> the true problem. But I haven't been able to find the real culprit,
> thus the RFC.
I think I understand the bug now, and your patch looks like the correct
fix. But we need to pimp the commit message.
In i915_gem_reset_ring_lists we reset requests and move objects to the
inactive list. Which means if the active list is the last one to hold a
reference, the object will disappear.
Now the problem is that we do this per-ring, and not in the order that the
objects would have been retired if the gpu wouldn't have hung. E.g. if a
batch is active on both ring 1&2 but was last active on ring 1, then we'd
free it before we go ahead with cleaning up the requests for ring 2.
So reference-counting the batch_obj pointers looks like the real fix.
Can you please amend your patch with this explanation and also please add
a backtrace that shows the crash? Might need to gather it with netconsole.
Thanks, Daniel
> ---
> drivers/gpu/drm/i915/i915_gem.c | 12 +++++++-----
> 1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 40d9dcf..858538f 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2145,13 +2145,12 @@ int __i915_add_request(struct intel_ring_buffer *ring,
> request->head = request_start;
> request->tail = request_ring_position;
>
> - /* Whilst this request exists, batch_obj will be on the
> - * active_list, and so will hold the active reference. Only when this
> - * request is retired will the the batch_obj be moved onto the
> - * inactive_list and lose its active reference. Hence we do not need
> - * to explicitly hold another reference here.
> + /* Active list has one reference but that is not enough as same
> + * batch_obj can be active on multiple rings
> */
> request->batch_obj = obj;
> + if (request->batch_obj)
> + drm_gem_object_reference(&request->batch_obj->base);
>
> /* Hold a reference to the current context so that we can inspect
> * it later in case a hangcheck error event fires.
> @@ -2340,6 +2339,9 @@ static void i915_gem_free_request(struct drm_i915_gem_request *request)
> if (request->ctx)
> i915_gem_context_unreference(request->ctx);
>
> + if (request->batch_obj)
> + drm_gem_object_unreference(&request->batch_obj->base);
> +
> kfree(request);
> }
>
> --
> 1.7.9.5
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC] drm/i915: reference count batch object on requests
2013-12-03 17:10 ` Daniel Vetter
@ 2013-12-04 11:24 ` Chris Wilson
2013-12-04 12:08 ` Daniel Vetter
2013-12-04 13:28 ` Mika Kuoppala
1 sibling, 1 reply; 12+ messages in thread
From: Chris Wilson @ 2013-12-04 11:24 UTC (permalink / raw)
To: Daniel Vetter; +Cc: intel-gfx, miku
On Tue, Dec 03, 2013 at 06:10:05PM +0100, Daniel Vetter wrote:
> On Mon, Dec 02, 2013 at 05:31:53PM +0200, Mika Kuoppala wrote:
> > We used to lean on active_list to handle the references
> > to batch objects. But there are useful cases when same,
> > albeit simple, batch can be executing on multiple rings
> > concurrently. For this case the active_list reference count
> > handling is just not enough as batch could be freed by
> > ring A request retirement as it is still running on ring B.
> >
> > Fix this by doing proper batch_obj reference counting.
> >
> > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> >
> > Notes:
> > This is a patch which ameliorates the
> > [PATCH] tests/gem_reset_stats: add close-pending-fork
> >
> > Chris wasn't happy about the refcounting as it might hide
> > the true problem. But I haven't been able to find the real culprit,
> > thus the RFC.
>
> I think I understand the bug now, and your patch looks like the correct
> fix. But we need to pimp the commit message.
>
> In i915_gem_reset_ring_lists we reset requests and move objects to the
> inactive list. Which means if the active list is the last one to hold a
> reference, the object will disappear.
>
> Now the problem is that we do this per-ring, and not in the order that the
> objects would have been retired if the gpu wouldn't have hung. E.g. if a
> batch is active on both ring 1&2 but was last active on ring 1, then we'd
> free it before we go ahead with cleaning up the requests for ring 2.
>
> So reference-counting the batch_obj pointers looks like the real fix.
No. The bug only exists in i915_gem_reset() and should not impact
anywhere else.
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC] drm/i915: reference count batch object on requests
2013-12-04 11:24 ` Chris Wilson
@ 2013-12-04 12:08 ` Daniel Vetter
2013-12-04 12:11 ` Chris Wilson
0 siblings, 1 reply; 12+ messages in thread
From: Daniel Vetter @ 2013-12-04 12:08 UTC (permalink / raw)
To: Chris Wilson, Daniel Vetter, Mika Kuoppala, intel-gfx, miku
On Wed, Dec 4, 2013 at 12:24 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Tue, Dec 03, 2013 at 06:10:05PM +0100, Daniel Vetter wrote:
>> On Mon, Dec 02, 2013 at 05:31:53PM +0200, Mika Kuoppala wrote:
>> > We used to lean on active_list to handle the references
>> > to batch objects. But there are useful cases when same,
>> > albeit simple, batch can be executing on multiple rings
>> > concurrently. For this case the active_list reference count
>> > handling is just not enough as batch could be freed by
>> > ring A request retirement as it is still running on ring B.
>> >
>> > Fix this by doing proper batch_obj reference counting.
>> >
>> > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
>> >
>> > Notes:
>> > This is a patch which ameliorates the
>> > [PATCH] tests/gem_reset_stats: add close-pending-fork
>> >
>> > Chris wasn't happy about the refcounting as it might hide
>> > the true problem. But I haven't been able to find the real culprit,
>> > thus the RFC.
>>
>> I think I understand the bug now, and your patch looks like the correct
>> fix. But we need to pimp the commit message.
>>
>> In i915_gem_reset_ring_lists we reset requests and move objects to the
>> inactive list. Which means if the active list is the last one to hold a
>> reference, the object will disappear.
>>
>> Now the problem is that we do this per-ring, and not in the order that the
>> objects would have been retired if the gpu wouldn't have hung. E.g. if a
>> batch is active on both ring 1&2 but was last active on ring 1, then we'd
>> free it before we go ahead with cleaning up the requests for ring 2.
>>
>> So reference-counting the batch_obj pointers looks like the real fix.
>
> No. The bug only exists in i915_gem_reset() and should not impact
> anywhere else.
Well fixing the bug in i915_gem_reset would be lots more work and
rather fragile - we'd need to retire requests in the correct order.
That level of fanciness is generally not something I like to see in
less-tested code like the reset paths.
Also simply holding a reference for each pointer is established best
practice - atm we have a big comment explaining why it should work by
holding an implicit reference through the active list. With the
refcounting we'll simply trade that complexity (and the code comment
explaining things) in with another tricky case.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC] drm/i915: reference count batch object on requests
2013-12-04 12:08 ` Daniel Vetter
@ 2013-12-04 12:11 ` Chris Wilson
0 siblings, 0 replies; 12+ messages in thread
From: Chris Wilson @ 2013-12-04 12:11 UTC (permalink / raw)
To: Daniel Vetter; +Cc: intel-gfx, miku
On Wed, Dec 04, 2013 at 01:08:56PM +0100, Daniel Vetter wrote:
> On Wed, Dec 4, 2013 at 12:24 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > On Tue, Dec 03, 2013 at 06:10:05PM +0100, Daniel Vetter wrote:
> >> On Mon, Dec 02, 2013 at 05:31:53PM +0200, Mika Kuoppala wrote:
> >> > We used to lean on active_list to handle the references
> >> > to batch objects. But there are useful cases when same,
> >> > albeit simple, batch can be executing on multiple rings
> >> > concurrently. For this case the active_list reference count
> >> > handling is just not enough as batch could be freed by
> >> > ring A request retirement as it is still running on ring B.
> >> >
> >> > Fix this by doing proper batch_obj reference counting.
> >> >
> >> > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> >> >
> >> > Notes:
> >> > This is a patch which ameliorates the
> >> > [PATCH] tests/gem_reset_stats: add close-pending-fork
> >> >
> >> > Chris wasn't happy about the refcounting as it might hide
> >> > the true problem. But I haven't been able to find the real culprit,
> >> > thus the RFC.
> >>
> >> I think I understand the bug now, and your patch looks like the correct
> >> fix. But we need to pimp the commit message.
> >>
> >> In i915_gem_reset_ring_lists we reset requests and move objects to the
> >> inactive list. Which means if the active list is the last one to hold a
> >> reference, the object will disappear.
> >>
> >> Now the problem is that we do this per-ring, and not in the order that the
> >> objects would have been retired if the gpu wouldn't have hung. E.g. if a
> >> batch is active on both ring 1&2 but was last active on ring 1, then we'd
> >> free it before we go ahead with cleaning up the requests for ring 2.
> >>
> >> So reference-counting the batch_obj pointers looks like the real fix.
> >
> > No. The bug only exists in i915_gem_reset() and should not impact
> > anywhere else.
>
> Well fixing the bug in i915_gem_reset would be lots more work and
> rather fragile - we'd need to retire requests in the correct order.
> That level of fanciness is generally not something I like to see in
> less-tested code like the reset paths.
Nope, just the operations in the right order.
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
^ permalink raw reply [flat|nested] 12+ messages in thread
* [RFC] drm/i915: reference count batch object on requests
2013-12-03 17:10 ` Daniel Vetter
2013-12-04 11:24 ` Chris Wilson
@ 2013-12-04 13:28 ` Mika Kuoppala
1 sibling, 0 replies; 12+ messages in thread
From: Mika Kuoppala @ 2013-12-04 13:28 UTC (permalink / raw)
To: intel-gfx
In i915_gem_reset_ring_lists we reset requests and move objects to the
inactive list. Which means if the active list is the last one to hold a
reference, the object will disappear.
Now the problem is that we do this per-ring, and not in the order that the
objects would have been retired if the gpu wouldn't have hung. E.g. if a
batch is active on both ring 1&2 but was last active on ring 1, then we'd
free it before we go ahead with cleaning up the requests for ring 2.
Fixes regression (a possible OOPS following a GPU hang) from
commit aa60c664e6df502578454621c3a9b1f087ff8d25
Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Date: Wed Jun 12 15:13:20 2013 +0300
drm/i915: find guilty batch buffer on ring resets
Oops:
BUG: unable to handle kernel paging request at 6b6b6ce3
IP: [<f86124cc>] i915_gem_obj_offset+0xc/0x60 [i915]
*pdpt = 0000000000000000 *pde = 0000000000000000
Oops: 0000 [#1] SMP
CPU: 0 PID: 21 Comm: kworker/0:1 Not tainted 3.12.0+ #1274
Hardware name: /DZ77BH-55K, BIOS BHZ7710H.86A.0070.2012.0416.2117 04/16/2012
Workqueue: events i915_error_work_func [i915]
task: f6fd8000 ti: f2e3a000 task.ti: f2e3a000
EIP: 0060:[<f86124cc>] EFLAGS: 00010282 CPU: 0
EIP is at i915_gem_obj_offset+0xc/0x60 [i915]
EAX: ea75d880 EBX: e413497c ECX: 6b6b6b6b EDX: f6069998
ESI: f6069298 EDI: e4134960 EBP: f2e3be2c ESP: f2e3be28
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
CR0: 80050033 CR2: 6b6b6ce3 CR3: 0195b000 CR4: 001407f0
Stack:
e413497c f2e3be84 f8614d23 f8671864 f867ca2c f8689a2c f8686a49 0093c000
00000000 0093cb60 ea75d880 f6718e10 00fd8000 f6068000 00004208 00004208
00000000 00000002 fffff714 f6069340 f6718e10 00000000 f6068000 f2e3beac
Call Trace:
[<f8614d23>] i915_gem_reset+0xc3/0x2a0 [i915]
[<f85f9ada>] i915_reset+0x4a/0x160 [i915]
[<f86002d1>] i915_error_work_func+0xc1/0x110 [i915]
[<c1058b62>] process_one_work+0x122/0x3e0
[<c1059aa7>] worker_thread+0xf7/0x320
[<c10599b0>] ? manage_workers.isra.18+0x290/0x290
[<c105f3f4>] kthread+0x94/0xa0
[<c157daf7>] ret_from_kernel_thread+0x1b/0x28
[<c105f360>] ? flush_kthread_worker+0xb0/0xb0
Code: 2b 5c ca c8 31 c0 83 c4 10 5b 5e 5f 5d c3 90 b8 f4 ff ff ff eb f0 89 f6 8d bc 27 00 00 00 00 55 89 e5 53 3e 8d 74 26 00 8b 48 08 <8b> 89 78 01 00 00 39 91 c0 1a 00 00 8d 99 98 19 00 00 0f 44 d3
EIP: [<f86124cc>] i915_gem_obj_offset+0xc/0x60 [i915] SS:ESP 0068:f2e3be28
CR2: 000000006b6b6ce3
v2: Better commit message and backtrace (Daniel)
Testcase: igt/gem_reset_stats/close-pending-fork
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
drivers/gpu/drm/i915/i915_gem.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 92149bc..e677e56 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2145,13 +2145,12 @@ int __i915_add_request(struct intel_ring_buffer *ring,
request->head = request_start;
request->tail = request_ring_position;
- /* Whilst this request exists, batch_obj will be on the
- * active_list, and so will hold the active reference. Only when this
- * request is retired will the the batch_obj be moved onto the
- * inactive_list and lose its active reference. Hence we do not need
- * to explicitly hold another reference here.
+ /* Active list has one reference but that is not enough as same
+ * batch_obj can be active on multiple rings
*/
request->batch_obj = obj;
+ if (request->batch_obj)
+ drm_gem_object_reference(&request->batch_obj->base);
/* Hold a reference to the current context so that we can inspect
* it later in case a hangcheck error event fires.
@@ -2340,6 +2339,9 @@ static void i915_gem_free_request(struct drm_i915_gem_request *request)
if (request->ctx)
i915_gem_context_unreference(request->ctx);
+ if (request->batch_obj)
+ drm_gem_object_unreference(&request->batch_obj->base);
+
kfree(request);
}
--
1.7.9.5
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH] tests/gem_reset_stats: add close-pending-fork
2013-12-03 17:03 ` Daniel Vetter
@ 2013-12-04 14:39 ` Mika Kuoppala
2013-12-04 15:48 ` Daniel Vetter
0 siblings, 1 reply; 12+ messages in thread
From: Mika Kuoppala @ 2013-12-04 14:39 UTC (permalink / raw)
To: intel-gfx
From: Mika Kuoppala <mika.kuoppala@linux.intel.com>
This triggers use after free oops on request->batch_obj when
going through the rings and setting reset status on requests,
after a gpu hang.
v2: Streamlined the test and added comments (Daniel)
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
tests/gem_reset_stats.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 62 insertions(+)
diff --git a/tests/gem_reset_stats.c b/tests/gem_reset_stats.c
index 6c22bce..095b14b 100644
--- a/tests/gem_reset_stats.c
+++ b/tests/gem_reset_stats.c
@@ -25,6 +25,7 @@
*
*/
+#define _GNU_SOURCE
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
@@ -36,6 +37,7 @@
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <time.h>
+#include <signal.h>
#include "i915_drm.h"
#include "intel_bufmgr.h"
@@ -637,6 +639,63 @@ static void test_close_pending(void)
close(fd);
}
+static void test_close_pending_fork(void)
+{
+ int pid;
+ int fd, h;
+
+ fd = drm_open_any();
+ igt_assert(fd >= 0);
+
+ assert_reset_status(fd, 0, RS_NO_ERROR);
+
+ h = inject_hang(fd, 0);
+ igt_assert(h >= 0);
+
+ sleep(1);
+
+ /* Avoid helpers as we need to kill the child
+ * without any extra signal handling on behalf of
+ * lib/drmtest.c
+ */
+ pid = fork();
+ if (pid == 0) {
+ /* Not first drm_open_any() so we need to do
+ * gem_quiescent_gpu() explicitly, as it is the
+ * key component to trigger the oops
+ */
+ const int fd2 = drm_open_any();
+ igt_assert(fd2 >= 0);
+
+ /* This adds same batch on each ring */
+ gem_quiescent_gpu(fd2);
+
+ close(fd2);
+ return;
+ } else {
+ igt_assert(pid > 0);
+ sleep(1);
+
+ /* Kill the child to reduce refcounts on
+ batch_objs */
+ kill(pid, SIGKILL);
+ }
+
+ gem_close(fd, h);
+ close(fd);
+
+ /* Then we just wait on hang to happen */
+ fd = drm_open_any();
+ igt_assert(fd >= 0);
+
+ h = exec_valid(fd, 0);
+ igt_assert(h >= 0);
+
+ gem_sync(fd, h);
+ gem_close(fd, h);
+ close(fd);
+}
+
static void drop_root(void)
{
igt_assert(getuid() == 0);
@@ -837,6 +896,9 @@ igt_main
igt_subtest("close-pending")
test_close_pending();
+ igt_subtest("close-pending-fork")
+ test_close_pending_fork();
+
igt_subtest("params")
test_params();
}
--
1.7.9.5
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] tests/gem_reset_stats: add close-pending-fork
2013-12-04 14:39 ` Mika Kuoppala
@ 2013-12-04 15:48 ` Daniel Vetter
0 siblings, 0 replies; 12+ messages in thread
From: Daniel Vetter @ 2013-12-04 15:48 UTC (permalink / raw)
To: Mika Kuoppala; +Cc: intel-gfx
On Wed, Dec 04, 2013 at 04:39:09PM +0200, Mika Kuoppala wrote:
> From: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>
> This triggers use after free oops on request->batch_obj when
> going through the rings and setting reset status on requests,
> after a gpu hang.
>
> v2: Streamlined the test and added comments (Daniel)
>
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
> tests/gem_reset_stats.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 62 insertions(+)
>
> diff --git a/tests/gem_reset_stats.c b/tests/gem_reset_stats.c
> index 6c22bce..095b14b 100644
> --- a/tests/gem_reset_stats.c
> +++ b/tests/gem_reset_stats.c
> @@ -25,6 +25,7 @@
> *
> */
>
> +#define _GNU_SOURCE
> #include <unistd.h>
> #include <stdlib.h>
> #include <stdio.h>
> @@ -36,6 +37,7 @@
> #include <sys/ioctl.h>
> #include <sys/mman.h>
> #include <time.h>
> +#include <signal.h>
>
> #include "i915_drm.h"
> #include "intel_bufmgr.h"
> @@ -637,6 +639,63 @@ static void test_close_pending(void)
> close(fd);
> }
>
> +static void test_close_pending_fork(void)
> +{
> + int pid;
> + int fd, h;
> +
> + fd = drm_open_any();
> + igt_assert(fd >= 0);
> +
> + assert_reset_status(fd, 0, RS_NO_ERROR);
> +
> + h = inject_hang(fd, 0);
> + igt_assert(h >= 0);
> +
> + sleep(1);
> +
> + /* Avoid helpers as we need to kill the child
> + * without any extra signal handling on behalf of
> + * lib/drmtest.c
> + */
> + pid = fork();
> + if (pid == 0) {
> + /* Not first drm_open_any() so we need to do
> + * gem_quiescent_gpu() explicitly, as it is the
> + * key component to trigger the oops
> + */
I've added a small comment here explaining what exactly is the magic
ingredient in quiescent_gpu and applied and pushed the patch.
But on second thoughs it's a bit fragile to depend upon the test library
behaviour in such a way. I think it's better to copy-paste the code in
quiescent_gpu to this file to make sure we never accidentally change it
and end up breaking this test.
When doing that we could also try to run the empty batch on all rings in
reverse order, just in case someone reorders a list somewhere. That should
make the testcase more robust at catching issues.
-Daniel
> + const int fd2 = drm_open_any();
> + igt_assert(fd2 >= 0);
> +
> + /* This adds same batch on each ring */
> + gem_quiescent_gpu(fd2);
> +
> + close(fd2);
> + return;
> + } else {
> + igt_assert(pid > 0);
> + sleep(1);
> +
> + /* Kill the child to reduce refcounts on
> + batch_objs */
> + kill(pid, SIGKILL);
> + }
> +
> + gem_close(fd, h);
> + close(fd);
> +
> + /* Then we just wait on hang to happen */
> + fd = drm_open_any();
> + igt_assert(fd >= 0);
> +
> + h = exec_valid(fd, 0);
> + igt_assert(h >= 0);
> +
> + gem_sync(fd, h);
> + gem_close(fd, h);
> + close(fd);
> +}
> +
> static void drop_root(void)
> {
> igt_assert(getuid() == 0);
> @@ -837,6 +896,9 @@ igt_main
> igt_subtest("close-pending")
> test_close_pending();
>
> + igt_subtest("close-pending-fork")
> + test_close_pending_fork();
> +
> igt_subtest("params")
> test_params();
> }
> --
> 1.7.9.5
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2013-12-04 15:47 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-02 14:47 [PATCH] tests/gem_reset_stats: add close-pending-fork Mika Kuoppala
2013-12-02 15:03 ` Chris Wilson
2013-12-02 16:32 ` Mika Kuoppala
2013-12-03 17:03 ` Daniel Vetter
2013-12-04 14:39 ` Mika Kuoppala
2013-12-04 15:48 ` Daniel Vetter
2013-12-02 15:31 ` [RFC] drm/i915: reference count batch object on requests Mika Kuoppala
2013-12-03 17:10 ` Daniel Vetter
2013-12-04 11:24 ` Chris Wilson
2013-12-04 12:08 ` Daniel Vetter
2013-12-04 12:11 ` Chris Wilson
2013-12-04 13:28 ` Mika Kuoppala
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox