linux-tegra.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/5] More explicit pushbuf error handling
@ 2015-08-31 11:38 Konsta Hölttä
       [not found] ` <1441021115-28537-1-git-send-email-kholtta-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Konsta Hölttä @ 2015-08-31 11:38 UTC (permalink / raw)
  To: bskeggs-H+wXaHxf7aLQT0dZR+AlfA,
	nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: linux-tegra-u79uwXL29TY76Z2rM5mHXA

Hi there,

Resending these now that they've had some more polish and testing, and I heard
that Ben's vacation is over :-)

These patches work as a starting point for more explicit error mechanisms and
better robustness. At the moment, when a job hangs or faults, it seems that
nouveau doesn't quite know how to handle the situation and often results in a
hang. Some of these situations would require either completely resetting the
gpu, and/or a complex path for only recovering the broken channel.

To start, I worked on support for letting userspace know what exactly happened.
Proper recovery would come later. The "error notifier" in the first patch is a
simple shared buffer between kernel and userspace. Its error codes match
nvgpu's. Alternatively, the status could be queried with an ioctl, but that
would be considerably more heavyweight. I'd like to know if the event mechanism
is meant for these kinds of events at all (engines notify errors upwards to the
drm layer). Another alternative would probably be to register the same buffer
to all necessary engines separately in nvif method calls? Or register it to
just one (e.g., fifo) and get that engine when errors happen in others (e.g.,
gr)? And drm handles probably wouldn't fit there? Please comment on this; I
wrote this before understanding the mthd mechanism.

Additionally, priority and timeout management for separate channels in flight
on the gpu is added in two patches. Neither is exactly what the name says, but
the effect is the same, and this is what nvgpu does currently. Those two
patches call the fifo channel object's methods directly from userspace, so a
hack is added in the nvif path to accept that. The objects are NEW'd from
kernel space, so calling from userspace isn't allowed, as it appears. How
should this be managed in a clean way?

Also, since nouveau often hangs on errors, the userspace hangs too (waiting on
a fence). The final patch attempts to fix this in a couple of specific error
paths to forcibly update all fences to be finished. I'd like to hear how that
would be handled properly - consider the patch just a proof-of-concept and
sample of what would be necessary.

I don't expect the patches to be accepted as-is - as a newbie, I'd appreciate
any high-level comments on if I've understood anything, especially the event
and nvif/method mechanisms (I use the latter from userspace with a hack
constructed from the perfmon branch seen here earlier into nvidia's internal
libdrm-equivalent). The fence-forcing thing is something that is necessary with
the error notifiers (at least with our userspace that waits really long or
infinitely on fences). I'm working specifically on Tegra and don't know much
about the desktop's userspace details, so I may be biased in some areas.

I'd be happy to write sample tests on e.g. libdrm for the new methods once the
kernel patches would get to a good shape, if that's required for accepting new
features. I tested these to work as a proof-of-concept on Jetson TK1, and the
code is adapted from the latest nvgpu.

The patches can also be found in http://github.com/sooda/nouveau and are based
on a version of gnurou/staging.

Thanks!
Konsta (sooda in IRC)

Konsta Hölttä (5):
  notify channel errors to userspace
  don't verify route == owner in nvkm ioctl
  gk104: channel priority/timeslice support
  gk104: channel timeout detection
  HACK force fences updated on error

 drm/nouveau/include/nvif/class.h       |  20 ++++
 drm/nouveau/include/nvif/event.h       |  12 +++
 drm/nouveau/include/nvkm/engine/fifo.h |   5 +-
 drm/nouveau/nouveau_chan.c             |  95 +++++++++++++++++++
 drm/nouveau/nouveau_chan.h             |  10 ++
 drm/nouveau/nouveau_drm.c              |   1 +
 drm/nouveau/nouveau_fence.c            |  13 ++-
 drm/nouveau/nouveau_gem.c              |  29 ++++++
 drm/nouveau/nouveau_gem.h              |   2 +
 drm/nouveau/nvkm/core/ioctl.c          |   9 +-
 drm/nouveau/nvkm/engine/fifo/base.c    |  56 ++++++++++-
 drm/nouveau/nvkm/engine/fifo/gf100.c   |   2 +-
 drm/nouveau/nvkm/engine/fifo/gk104.c   | 166 ++++++++++++++++++++++++++++++---
 drm/nouveau/nvkm/engine/fifo/nv04.c    |   2 +-
 drm/nouveau/nvkm/engine/gr/gf100.c     |   5 +
 drm/nouveau/uapi/drm/nouveau_drm.h     |  13 +++
 16 files changed, 415 insertions(+), 25 deletions(-)

-- 
2.1.4

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-09-03  9:01 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-31 11:38 [RFC PATCH v2 0/5] More explicit pushbuf error handling Konsta Hölttä
     [not found] ` <1441021115-28537-1-git-send-email-kholtta-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
2015-08-31 11:38   ` [RFC PATCH v2 1/5] notify channel errors to userspace Konsta Hölttä
     [not found]     ` <1441021115-28537-2-git-send-email-kholtta-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
2015-09-03  9:01       ` [Nouveau] " Alexandre Courbot
2015-08-31 11:38   ` [RFC PATCH v2 2/5] don't verify route == owner in nvkm ioctl Konsta Hölttä
2015-08-31 11:38   ` [RFC PATCH v2 3/5] gk104: channel priority/timeslice support Konsta Hölttä
2015-08-31 11:38   ` [RFC PATCH v2 4/5] gk104: channel timeout detection Konsta Hölttä
2015-08-31 11:38   ` [RFC PATCH v2 5/5] HACK force fences updated on error Konsta Hölttä
2015-09-01 13:26   ` [Nouveau] [RFC PATCH v2 0/5] More explicit pushbuf error handling Ben Skeggs
     [not found]     ` <CACAvsv7Bfz-1Ck2=6DbA_N9CggEcPbXNOjM+6JHM_hxC9LHN-A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-09-02 10:01       ` Konsta Hölttä
     [not found]         ` <55E6C8E5.6050401-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
2015-09-02 10:13           ` Konsta Hölttä

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).