Re: [Intel-gfx] [PATCH v4 5/5] drm/i915/gt: Make sure that errors are propagated through request chains

Intel-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Andi Shyti <andi.shyti@linux.intel.com>
To: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Andi Shyti <andi.shyti@kernel.org>,
	intel-gfx@lists.freedesktop.org,
	Matthew Auld <matthew.auld@intel.com>,
	dri-devel@lists.freedesktop.org,
	Maciej Patelczyk <maciej.patelczyk@intel.com>,
	stable@vger.kernel.org,
	Chris Wilson <chris.p.wilson@linux.intel.com>,
	"Das, Nirmoy" <nirmoy.das@intel.com>
Subject: Re: [Intel-gfx] [PATCH v4 5/5] drm/i915/gt: Make sure that errors are propagated through request chains
Date: Wed, 12 Apr 2023 12:56:26 +0200	[thread overview]
Message-ID: <ZDaOWhKiG5jD7ftp@ashyti-mobl2.lan> (raw)
In-Reply-To: <ZDVwMawvlOLZ2VZt@intel.com>

Hi Rodrigo,

> > > Currently, when we perform operations such as clearing or copying
> > > large blocks of memory, we generate multiple requests that are
> > > executed in a chain.
> > > 
> > > However, if one of these requests fails, we may not realize it
> > > unless it happens to be the last request in the chain. This is
> > > because errors are not properly propagated.
> > > 
> > > For this we need to keep propagating the chain of fence
> > > notification in order to always reach the final fence associated
> > > to the final request.
> > > 
> > > To address this issue, we need to ensure that the chain of fence
> > > notifications is always propagated so that we can reach the final
> > > fence associated with the last request. By doing so, we will be
> > > able to detect any memory operation  failures and determine
> > > whether the memory is still invalid.
> > > 
> > > On copy and clear migration signal fences upon completion.
> > > 
> > > On copy and clear migration, signal fences upon request
> > > completion to ensure that we have a reliable perpetuation of the
> > > operation outcome.
> > > 
> > > Fixes: cf586021642d80 ("drm/i915/gt: Pipelined page migration")
> > > Reported-by: Matthew Auld <matthew.auld@intel.com>
> > > Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> > > Cc: stable@vger.kernel.org
> > > Reviewed-by: Matthew Auld <matthew.auld@intel.com>
> > With  Matt's comment regarding missing lock in intel_context_migrate_clear
> > addressed, this is:
> > 
> > Acked-by: Nirmoy Das <nirmoy.das@intel.com>
> 
> Nack!
> 
> Please get some ack from Joonas or Tvrtko before merging this series.

There is no architectural change... of course, Joonas and Tvrtko
are more than welcome (and actually invited) to look into this
patch.

And, btw, there are still some discussions ongoing on this whole
series, so that I'm not going to merge it any time soon. I'm just
happy to revive the discussion.

> It is a big series targeting stable o.O where the revisions in the cover
> letter are not helping me to be confident that this is the right approach
> instead of simply reverting the original offending commit:
> 
> cf586021642d ("drm/i915/gt: Pipelined page migration")

Why should we remove all the migration completely? What about the
copy?

> It looks to me that we are adding magic on top of magic to workaround
> the deadlocks, but then adding more waits inside locks... And this with
> the hang checks vs heartbeats, is this really an issue on current upstream
> code? or was only on DII?

There is no real magic happening here. It's just that the error
message was not reaching the end of the operation while this
patch is passing it over.

> Where was the bug report to start with?

Matt has reported this, I will give to you the necessary links to
it offline.

Thanks for looking into this,
Andi

next prev parent reply	other threads:[~2023-04-12 10:57 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-08  9:41 [Intel-gfx] [PATCH v4 0/5] Fix error propagation amongst request Andi Shyti
2023-03-08  9:41 ` [Intel-gfx] [PATCH v4 1/5] drm/i915: Throttle for ringspace prior to taking the timeline mutex Andi Shyti
2023-04-11  8:58   ` Andrzej Hajda
2023-03-08  9:41 ` [Intel-gfx] [PATCH v4 2/5] drm/i915/gt: Add intel_context_timeline_is_locked helper Andi Shyti
2023-04-11  6:30   ` Das, Nirmoy
2023-03-08  9:41 ` [Intel-gfx] [PATCH v4 3/5] drm/i915: Create the locked version of the request create Andi Shyti
2023-04-11  6:30   ` Das, Nirmoy
2023-03-08  9:41 ` [Intel-gfx] [PATCH v4 4/5] drm/i915: Create the locked version of the request add Andi Shyti
2023-03-08  9:41 ` [Intel-gfx] [PATCH v4 5/5] drm/i915/gt: Make sure that errors are propagated through request chains Andi Shyti
2023-03-10 10:03   ` Matthew Auld
2023-04-11  6:39   ` Das, Nirmoy
2023-04-11 14:35     ` Rodrigo Vivi
2023-04-12 10:56       ` Andi Shyti [this message]
2023-04-12 13:10         ` Rodrigo Vivi
2023-04-13 11:25           ` Tvrtko Ursulin
2023-03-08 11:22 ` [Intel-gfx] ✗ Fi.CI.BAT: failure for Fix error propagation amongst request (rev2) Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZDaOWhKiG5jD7ftp@ashyti-mobl2.lan \
    --to=andi.shyti@linux.intel.com \
    --cc=andi.shyti@kernel.org \
    --cc=chris.p.wilson@linux.intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=maciej.patelczyk@intel.com \
    --cc=matthew.auld@intel.com \
    --cc=nirmoy.das@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox