All of lore.kernel.org
 help / color / mirror / Atom feed
From: Daniel Vetter <daniel@ffwll.ch>
To: "Ahmed S. Darwish" <darwish.07@gmail.com>
Cc: "John Ogness" <john.ogness@linutronix.de>,
	"Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Michel Dänzer" <michel@daenzer.net>,
	intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	"Sam Ravnborg" <sam@ravnborg.org>,
	"Noralf Trønnes" <noralf@tronnes.org>,
	daniel.vetter@ffwll.ch, "Thomas Gleixner" <tglx@linutronix.de>,
	"Harry Wentland" <harry.wentland@amd.com>,
	"Nicholas Kazlauskas" <nicholas.kazlauskas@amd.com>
Subject: Re: [PATCH v2 0/3] drm: Add panic handling
Date: Mon, 25 Mar 2019 09:42:19 +0100	[thread overview]
Message-ID: <20190325084219.GD2665@phenom.ffwll.local> (raw)
In-Reply-To: <20190317230613.GA9080@darwi-home-pc>

On Mon, Mar 18, 2019 at 12:06:13AM +0100, Ahmed S. Darwish wrote:
> 
> => Now that the dust has settled, here's a summary of this huge
>    50-email thread (thanks Daniel, Noralf, John, everyone!).
> 
> => Parts of this document are a direct rewording of Daniel's replies,
>    so I took the liberty of adding a Co-developed-by tag here..
> 
> => This is only a summary, and _not_ an official patch submission.
>    It's now Show-me-the-code time ;-)
> 
> Subject: [PATCH] Documentation: gpu: Add initial DRM panic design
> 
> Co-developed-by: Daniel Vetter <daniel@ffwll.ch>
> Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
> ---
>  Documentation/gpu/drm-panic-design.rst | 124 +++++++++++++++++++++++++
>  1 file changed, 124 insertions(+)
>  create mode 100644 Documentation/gpu/drm-panic-design.rst
> 
> diff --git a/Documentation/gpu/drm-panic-design.rst b/Documentation/gpu/drm-panic-design.rst
> new file mode 100644
> index 000000000000..ba306193f347
> --- /dev/null
> +++ b/Documentation/gpu/drm-panic-design.rst
> @@ -0,0 +1,124 @@
> +
> +========================
> +A DRM-based panic viewer
> +========================
> +
> +The Linux Kernel currently contains the necessary plumbing for viewing
> +a kernel panic log using DRM-based kmsg dumpers, even if the system is
> +currently running a graphical session (e.g. wayland).
> +
> +.. _drm_panic_design:
> +
> +Implementation design
> +=====================
> +
> +Code pathes running in a panic context face several constraints:
> +
> +1. Code must be fully synchronous: interrupts are disabled
> +2. Dynamic memory allocations are not allowed
> +3. Cannot acquire any mutexes: atomic context, due to 1.
> +4. Cannot acquire any spin locks: potential spinning-forever risk

Maybe rephrase as:

3. No sleeping (there's other ways to sleep than just memory allocations
and acquiring a mutex)
4. No unconditional locking at all (there's more than spinlocks/mutexes).
E.g. one of the most important locks is drm_modest_lock, which is a
ww_mutex.

> +
> +For the *DRM* panic code, the extra conditions below apply:
> +
> +5. Code must only trylock relevant DRM subsystem locks
> +6. If any trylock operation fails, the code *must not* go through
> +7. All rendering is done on the GPU's "current display buffer" used
> +   at the time of panic(): no HW programming is done at all.

Maybe let's clarify this a bit, since from the discussions it sounded like
at least amdgpu needs to touch a few bits (disable tiling, the indirect
vram write registers to access vram outside of the bar): "Only the least amount
of HW programming (preferrably none) is done, exceptions would be
disabling tiled/compressed scanout."

> +8. The code must be non-intrusive, and *must not* affect any other
> +   panic handling mechanism (netconsole, ramoops, kexec, etc.)

Hm, I'm not entirely clear on what you mean here. Maybe some more examples
of what would be a bad idea?

> +
> +Rationale follows.
> +
> +
> +Spin locks
> +----------
> +
> +Acquiring a spin lock in a panic() context is potentially lethal:
> +the lock might've been already acquired, _permanently_, by another
> +core that is now fully shut down through an IPI from the panic()-ing
> +core.
> +
> +Moreover, at least on x86, the first target architecture for this
> +work, the panic()-ing core wait by default for *a full second* until
> +all other cores finish their non-preemptible regions and terminate.
> +If that did not work out, it even tries more aggressively with NMIs.
> +
> +So if the other non panic()-ing cores was holding a DRM-related lock
> +through spin_lock_irqsave() for more than one second, then it's a
> +bug in the DRM layer code. Thus, the DRM panic viewer cannot do
> +anything and should just exit.  [A]
> +
> +What if the non panic()-ing core was holding a DRM lock through
> +barebone spin_lock()? Interrupts are enabled there, so the IPI will be
> +handled, and thus that core will effectively die while the lock is
> +*forever held*.  [B]
> +
> +
> +Trylocking
> +----------
> +
> +The DRM panic code always *tries* to acquire the *minimum relevant
> +set* of DRM related locks, through the basic :c:func:`spin_trylock()`
> +mechanism.
> +
> +From case [A] and case [B] above, if the trylock operation fails,
> +there's no point in retrying it multiple times: the relevant locks
> +are in a broken and unrecoverable state anyway.
> +
> +Moreover, The panic code cannot also just ignore the DRM locks and
> +force its way through: a broken non-preemptible DRM region implies
> +either invalid SW state (e.g. broken linked list pointers), or a GPU
> +in an unknown HW state.
> +
> +A GPU in an unknown HW state is quite dangerous: it has access to the
> +full system memory, and if poked incorrectly, there's a really good
> +chance it can kill the entire machine.
> +
> +
> +GPU hardware access
> +-------------------
> +
> +In the current state, a full GPU reset, modesetting, or even disabling
> +GPU planes, is not doable under a panic() context: it implies going
> +through a potentially huge set of DRM call-chains that cannot be
> +sanely verified against the :ref:`drm_panic_design` requirements
> +(e.g. memory allocations, spinlocks, etc.).
> +
> +The current approach is simple: run the minimal amount of code
> +necessary to draw pixels onto the current scanout buffers. Instead
> +of disabling GPU planes, the biggest visible rectangle is just picked.
> +
> +*Usually* there should be a main window that is big enough to show the
> +oops.
> +
> +
> +CI testing
> +----------
> +
> +One of the things that killed earlier linux DRM panic handling efforts,
> +beside getting into deep DRM call-chains that cannot be verified, was
> +that it couldn't be tested except with real oopses.
> +
> +The first set of bug reports was whack-a-molde kind of bugs where the
> +oops displayed was caused by the DRM panic handler itself instead of
> +the real oops causing the panic.
> +
> +Thus, the :ref:`drm_panic_design` requirements was created. Moreover:
> +
> +  - Special hooks are added at the spin_lock() level to complain
> +    loudly if a spin lock operation was tried under the DRM panic
> +    context. This could be easily noticed/reported by CI testing.
> +
> +  - *Non-destructive* testing of the DRM panic code, emulating a
> +    real panic path context as much as possible (e.g. by disabling
> +    irqs and enabling the spin lock hooks earlier mentioned), is
> +    created. This is necessary for heaviling testing the DRM panic
> +    code through `CI machines <https://lwn.net/Articles/735468/>`_.
> +
> +
> +Supported drivers
> +=================
> +
> +* Intel i915-compatible cards

Excellent summary, thanks for typing this up.
-Daniel

> 
> 
> --
> darwi
> http://darwish.chasingpointers.com

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

      reply	other threads:[~2019-03-25  8:42 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-11 17:42 [PATCH v2 0/3] drm: Add panic handling Noralf Trønnes
2019-03-11 17:42 ` [PATCH v2 1/3] drm: Add support for panic message output Noralf Trønnes
2019-03-11 19:23   ` Daniel Vetter
2019-03-11 19:29     ` Daniel Vetter
2019-03-11 22:40       ` Noralf Trønnes
2019-03-12  9:53         ` Daniel Vetter
2019-03-12  9:59           ` Daniel Vetter
2019-03-11 22:33     ` Noralf Trønnes
2019-03-12 10:58       ` Daniel Vetter
2019-03-12 13:29         ` Noralf Trønnes
2019-03-13  3:53         ` Ahmed S. Darwish
2019-03-12 22:13       ` Ahmed S. Darwish
2019-03-13  7:49         ` John Ogness
2019-03-13  8:37           ` Daniel Vetter
2019-03-14  2:51             ` Ahmed S. Darwish
2019-03-14  9:32               ` Daniel Vetter
2019-03-14  9:43                 ` John Ogness
2019-03-14  9:52                   ` John Ogness
2019-03-15 10:56                     ` Daniel Vetter
2019-03-13  8:35         ` Daniel Vetter
2019-03-14  4:45           ` Ahmed S. Darwish
2019-03-14  9:35             ` Daniel Vetter
2019-03-13 10:24         ` Noralf Trønnes
2019-03-13  4:05       ` Ahmed S. Darwish
2019-03-11 19:55   ` Sam Ravnborg
2019-03-12 10:47   ` Michel Dänzer
2019-03-12 16:17     ` [Intel-gfx] " Ville Syrjälä
2019-03-12 17:15       ` Noralf Trønnes
2019-03-12 17:25         ` Ville Syrjälä
2019-03-12 17:37           ` Noralf Trønnes
2019-03-12 17:44             ` Noralf Trønnes
2019-03-12 18:02             ` [Intel-gfx] " Ville Syrjälä
2019-03-13  8:29               ` Christian König
2019-03-13  8:43               ` [Intel-gfx] " Daniel Vetter
2019-03-13  9:35         ` Michel Dänzer
2019-03-13 13:31           ` [Intel-gfx] " Ville Syrjälä
2019-03-13 13:37             ` Christian König
2019-03-13 15:38               ` Michel Dänzer
2019-03-13 15:54                 ` [Intel-gfx] " Christian König
2019-03-13 16:16                   ` Kazlauskas, Nicholas
2019-03-13 17:30                     ` Koenig, Christian
2019-03-13 17:33                     ` Michel Dänzer
2019-03-13 17:41                       ` Kazlauskas, Nicholas
2019-03-14  9:50                         ` Daniel Vetter
2019-03-14 12:44                           ` Kazlauskas, Nicholas
2019-03-15 10:58                             ` [Intel-gfx] " Daniel Vetter
2019-03-13 17:52                       ` Koenig, Christian
2019-03-14  9:40                   ` [Intel-gfx] " Daniel Vetter
2019-03-11 17:42 ` [PATCH v2 2/3] drm/cma-helper: Add support for panic screen Noralf Trønnes
2019-03-11 17:42 ` [PATCH v2 3/3] drm/vc4: Support " Noralf Trønnes
2019-03-11 18:53 ` [PATCH v2 0/3] drm: Add panic handling Daniel Vetter
2019-03-12  9:09 ` ✗ Fi.CI.CHECKPATCH: warning for " Patchwork
2019-03-12  9:12 ` ✗ Fi.CI.SPARSE: " Patchwork
2019-03-12  9:50 ` ✓ Fi.CI.BAT: success " Patchwork
2019-03-12 10:12 ` ✗ Fi.CI.BAT: failure for drm: Add panic handling (rev2) Patchwork
2019-03-17 23:06 ` [PATCH v2 0/3] drm: Add panic handling Ahmed S. Darwish
2019-03-25  8:42   ` Daniel Vetter [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190325084219.GD2665@phenom.ffwll.local \
    --to=daniel@ffwll.ch \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=daniel.vetter@ffwll.ch \
    --cc=darwish.07@gmail.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=harry.wentland@amd.com \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=john.ogness@linutronix.de \
    --cc=michel@daenzer.net \
    --cc=nicholas.kazlauskas@amd.com \
    --cc=noralf@tronnes.org \
    --cc=sam@ravnborg.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.