From: Daniel Vetter <daniel@ffwll.ch>
To: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: intel-gfx@lists.freedesktop.org, David Woodhouse <dwmw2@infradead.org>
Subject: Re: [PATCH 6/9] drm/i915: driver based PASID handling
Date: Thu, 8 Oct 2015 10:12:08 +0200 [thread overview]
Message-ID: <20151008081208.GY3383@phenom.ffwll.local> (raw)
In-Reply-To: <561555BC.7070700@virtuousgeek.org>
On Wed, Oct 07, 2015 at 10:26:20AM -0700, Jesse Barnes wrote:
> On 10/07/2015 10:17 AM, David Woodhouse wrote:
> > On Wed, 2015-10-07 at 09:28 -0700, Jesse Barnes wrote:
> >> On 10/07/2015 09:14 AM, Daniel Vetter wrote:
> >>> On Wed, Oct 07, 2015 at 08:16:42AM -0700, Jesse Barnes wrote:
> >>>> On 10/07/2015 06:00 AM, David Woodhouse wrote:
> >>>>> On Fri, 2015-09-04 at 09:59 -0700, Jesse Barnes wrote:
> >>>>>> +
> >>>>>> + ret = handle_mm_fault(mm, vma, address,
> >>>>>> + desc.wr_req ? FAULT_FLAG_WRITE : 0);
> >>>>>> + if (ret & VM_FAULT_ERROR) {
> >>>>>> + gpu_mm_segv(tsk, address, SEGV_ACCERR); /* ? */
> >>>>>> + goto out_unlock;
> >>>>>> + }
> >>>>>> +
> >>>>>
> >>>>> Hm, do you need to force the SEGV there, in what ought to be generic
> >>>>> IOMMU code?
> >>>>>
> >>>>> Can you instead just let the fault handler return an appropriate
> >>>>> failure code to the IOMMU request queue and then deal with the
> >>>>> resulting error on the i915 device side?
> >>>>
> >>>> I'm not sure if we get enough info on the i915 side to handle it
> >>>> reasonably, we'll have to test that out.
> >>>
> >>> We do know precisely which context blew up, but without the TDR work we
> >>> can't yet just kill the offender selective without affecting the other
> >>> active gpu contexts.
> >>
> >> How? The notification from the IOMMU queue is asynchronous...
> >
> > The page request, and the response, include 'private data' which an
> > endpoint can use to carry that kind of information.
> >
> > In $7.5.1.1 of the VT-d specification it tells us:
> >
> > "Private Data: The Private Data field can be used by
> > Root-Complex integrated endpoints to uniquely identify
> > device-specific private information associated with an
> > individual page request.
> >
> > "For Intel ® Processor Graphics device, the Private Data field
> > specifies the identity of the GPU advanced-context (see
> > Section 3.10) sending the page request."
>
> Ah right so we could put our private context ID in there if the PASID
> doesn't end up being per-context. That would work fine (though as
> Daniel said we still need fancier reset to handle things more
> gracefully).
I'd hope we can be even more lazy: If we complete the page request with a
failure then hopefully the gpu read/write transaction never completes in
the EU/ff-unit which means it'll be stuck forever and our hangcheck will
get around to clean up the mess. That should work with 0 code changes (but
needs a testcase ofc).
Later on we can get fancy and try to immediate preempt the ctx and kill it
if it faults. But there's a bit of infrastructure missing for that, and I
think it'd be better to not stall svm on that.
> >>> But besides that I really don't see a reason why we need to kill the
> >>> process if the gpu faults. After all if a thread sigfaults then signal
> >>> goes to that thread and not some random one (or the one thread that forked
> >>> the thread that blew up). And we do have interfaces to tell userspace that
> >>> something bad happened with the gpu work it submitted.
> >
> > I certainly don't want the core IOMMU code killing things. I really
> > want to just complete the page request with an appropriate failure
> > code, and let the endpoint device deal with it from there.
>
> I think that will work, but I want to test and make sure. In the driver
> mode version I took advantage of the fact that I got an unambiguous page
> request failure from the IOMMU along with a unique PASID to send the
> signal. Getting it on the GPU side means looking at some of our
> existing error state bits, which is something I've been avoiding...
gem_reset_stats does this for your already, including test coverage and
all. What might be missing is getting reset events from a pollable fd,
since the current needs explicit polling. It works that way since that's
all arb_robustness wants.
And with an fd we can always use the generic SIG_IO stuff for userspace
that wants a signal, but by default I don't think we should use signals at
all for gpu page faults: The current SIG_SEGV can't be used (since it's
clearly for thread-local faults only), same for SIG_BUS, SIG_IO is for fd
polling and there's nothing else available.
But even the pollable fd is probably best done in a follow-up series, if
we even have a need for it.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
next prev parent reply other threads:[~2015-10-08 8:09 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-09-04 16:58 [RFC] Page table sharing and bufferless execbuf Jesse Barnes
2015-09-04 16:58 ` [PATCH 1/9] mm: move mmu_find_ops to mmu_notifier.c Jesse Barnes
2015-09-04 16:58 ` [PATCH 2/9] signal: export force_sig_info Jesse Barnes
2015-09-04 16:58 ` [PATCH 3/9] android/sync: add sync_fence_create_dma Jesse Barnes
2015-09-04 16:58 ` [PATCH 4/9] android/sync: hack: enable fence signaling in Android Native Sync implementation Jesse Barnes
2015-09-04 16:58 ` [PATCH 5/9] drm/i915: add create_context2 ioctl Jesse Barnes
2015-09-04 16:59 ` [PATCH 6/9] drm/i915: driver based PASID handling Jesse Barnes
2015-10-07 13:00 ` David Woodhouse
2015-10-07 15:16 ` Jesse Barnes
2015-10-07 16:14 ` Daniel Vetter
2015-10-07 16:28 ` Jesse Barnes
2015-10-07 17:17 ` David Woodhouse
2015-10-07 17:26 ` Jesse Barnes
2015-10-08 8:12 ` Daniel Vetter [this message]
2015-10-08 10:28 ` Tomas Elf
2015-10-08 11:29 ` Tomas Elf
2015-10-08 22:46 ` David Woodhouse
2015-10-09 7:28 ` Daniel Vetter
2015-10-09 7:52 ` Daniel Vetter
2015-10-09 7:56 ` David Woodhouse
2015-10-09 8:47 ` Daniel Vetter
2015-10-09 9:07 ` David Woodhouse
2015-10-09 16:20 ` Jesse Barnes
2015-10-08 15:57 ` Chris Wilson
2015-10-09 7:24 ` David Woodhouse
2015-09-04 16:59 ` [PATCH 7/9] drm/i915: add fences to the request struct Jesse Barnes
2015-10-09 13:29 ` David Woodhouse
2015-10-09 16:11 ` Jesse Barnes
2015-09-04 16:59 ` [PATCH 8/9] drm/i915: Android sync points for i915 v4 (obsolete) Jesse Barnes
2015-09-04 16:59 ` [PATCH 9/9] drm/i915: add bufferless execbuf ioctl Jesse Barnes
2015-09-04 17:37 ` Chris Wilson
2015-09-04 19:09 ` Jesse Barnes
2015-10-08 10:34 ` David Woodhouse
2015-09-04 17:23 ` [RFC] Page table sharing and bufferless execbuf Chris Wilson
2015-09-04 19:08 ` Jesse Barnes
2015-09-26 14:55 ` David Woodhouse
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20151008081208.GY3383@phenom.ffwll.local \
--to=daniel@ffwll.ch \
--cc=dwmw2@infradead.org \
--cc=intel-gfx@lists.freedesktop.org \
--cc=jbarnes@virtuousgeek.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.