Intel-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Andi Shyti <andi.shyti@linux.intel.com>,
	intel-gfx@lists.freedesktop.org,
	 dri-devel@lists.freedesktop.org, stable@vger.kernel.org
Cc: Maciej Patelczyk <maciej.patelczyk@intel.com>,
	Matthew Auld <matthew.auld@intel.com>,
	Andrzej Hajda <andrzej.hajda@intel.com>,
	Rodrigo Vivi <rodrigo.vivi@intel.com>,
	Nirmoy Das <nirmoy.das@intel.com>,
	Chris Wilson <chris.p.wilson@linux.intel.com>,
	Andi Shyti <andi.shyti@kernel.org>
Subject: Re: [Intel-gfx] [PATCH v5 5/5] drm/i915/gt: Make sure that errors are propagated through request chains
Date: Thu, 13 Apr 2023 13:43:45 +0100	[thread overview]
Message-ID: <5b7f82db-b9dd-e9c9-496c-72995469d699@linux.intel.com> (raw)
In-Reply-To: <ca796c78-67cf-c803-b3bc-7d6eaa542b32@linux.intel.com>


On 13/04/2023 12:56, Tvrtko Ursulin wrote:
> 
> On 12/04/2023 12:33, Andi Shyti wrote:
>> Currently, when we perform operations such as clearing or copying
>> large blocks of memory, we generate multiple requests that are
>> executed in a chain.
>>
>> However, if one of these requests fails, we may not realize it
>> unless it happens to be the last request in the chain. This is
>> because errors are not properly propagated.
>>
>> For this we need to keep propagating the chain of fence
>> notification in order to always reach the final fence associated
>> to the final request.
>>
>> To address this issue, we need to ensure that the chain of fence
>> notifications is always propagated so that we can reach the final
>> fence associated with the last request. By doing so, we will be
>> able to detect any memory operation  failures and determine
>> whether the memory is still invalid.
> 
> Above two paragraphs seems to have redundancy in the message they convey.
> 
>> On copy and clear migration signal fences upon completion.
>>
>> On copy and clear migration, signal fences upon request
>> completion to ensure that we have a reliable perpetuation of the
>> operation outcome.
> 
> These two too. So I think commit message can be a bit polished.
> 
>> Fixes: cf586021642d80 ("drm/i915/gt: Pipelined page migration")
>> Reported-by: Matthew Auld <matthew.auld@intel.com>
>> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
>> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
>> Cc: stable@vger.kernel.org
>> Reviewed-by: Matthew Auld <matthew.auld@intel.com>
>> Acked-by: Nirmoy Das <nirmoy.das@intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_migrate.c | 51 +++++++++++++++++++------
>>   1 file changed, 39 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c 
>> b/drivers/gpu/drm/i915/gt/intel_migrate.c
>> index 3f638f1987968..668c95af8cbcf 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_migrate.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
>> @@ -742,13 +742,19 @@ intel_context_migrate_copy(struct intel_context 
>> *ce,
>>               dst_offset = 2 * CHUNK_SZ;
>>       }
>> +    /*
>> +     * While building the chain of requests, we need to ensure
>> +     * that no one can sneak into the timeline unnoticed.
>> +     */
>> +    mutex_lock(&ce->timeline->mutex);
>> +
>>       do {
>>           int len;
>> -        rq = i915_request_create(ce);
>> +        rq = i915_request_create_locked(ce);
>>           if (IS_ERR(rq)) {
>>               err = PTR_ERR(rq);
>> -            goto out_ce;
>> +            break;
>>           }
>>           if (deps) {
>> @@ -878,10 +884,14 @@ intel_context_migrate_copy(struct intel_context 
>> *ce,
>>           /* Arbitration is re-enabled between requests. */
>>   out_rq:
>> -        if (*out)
>> +        i915_sw_fence_await(&rq->submit);
>> +        i915_request_get(rq);
>> +        i915_request_add_locked(rq);
>> +        if (*out) {
>> +            i915_sw_fence_complete(&(*out)->submit);
>>               i915_request_put(*out);
> 
> Could you help me understand this please. I have a few questions - 
> first, what are the actual mechanics of fence error transfer here? I see 
> the submit fence is being blocked until the next request is submitted - 
> effectively previous request is only allowed to get on the hardware 
> after the next one has been queued up. But I don't immediately see what 
> that does in practice.
> 
> Second question relates to the need to hold the timeline mutex 
> throughout. Presumably this is so two copy or migrate operations on the 
> same context do not interleave, which can otherwise happen?
> 
> Would the error propagation be doable without the lock held by chaining 
> on the previous request _completion_ fence? If so I am sure that would 
> have a performance impact, because chunk by chunk would need a GPU<->CPU 
> round trip to schedule. How much of an impact I don't know. Maybe 
> enlarging CHUNK_SZ to compensate is an option?
> 
> Or if the perf hit would be bearable for stable backports only (much 
> smaller patch) and then for tip we can do this full speed solution.
> 
> But yes, I would first want to understand the actual error propagation 
> mechanism because sadly my working knowledge is a bit rusty.

Another option - maybe - is this related to revert of fence error 
propagation? If it is and having that would avoid the need for this 
invasive fix, maybe we unrevert 3761baae908a7b5012be08d70fa553cc2eb82305 
with edits to limit to special contexts? If doable..

Regards,

Tvrtko

> 
>> -        *out = i915_request_get(rq);
>> -        i915_request_add(rq);
>> +        }
>> +        *out = rq;
>>           if (err)
>>               break;
>> @@ -905,7 +915,10 @@ intel_context_migrate_copy(struct intel_context *ce,
>>           cond_resched();
>>       } while (1);
>> -out_ce:
>> +    mutex_unlock(&ce->timeline->mutex);
>> +
>> +    if (*out)
>> +        i915_sw_fence_complete(&(*out)->submit);
>>       return err;
>>   }
>> @@ -999,13 +1012,19 @@ intel_context_migrate_clear(struct 
>> intel_context *ce,
>>       if (HAS_64K_PAGES(i915) && is_lmem)
>>           offset = CHUNK_SZ;
>> +    /*
>> +     * While building the chain of requests, we need to ensure
>> +     * that no one can sneak into the timeline unnoticed.
>> +     */
>> +    mutex_lock(&ce->timeline->mutex);
>> +
>>       do {
>>           int len;
>> -        rq = i915_request_create(ce);
>> +        rq = i915_request_create_locked(ce);
>>           if (IS_ERR(rq)) {
>>               err = PTR_ERR(rq);
>> -            goto out_ce;
>> +            break;
>>           }
>>           if (deps) {
>> @@ -1056,17 +1075,25 @@ intel_context_migrate_clear(struct 
>> intel_context *ce,
>>           /* Arbitration is re-enabled between requests. */
>>   out_rq:
>> -        if (*out)
>> +        i915_sw_fence_await(&rq->submit);
>> +        i915_request_get(rq);
>> +        i915_request_add_locked(rq);
>> +        if (*out) {
>> +            i915_sw_fence_complete(&(*out)->submit);
>>               i915_request_put(*out);
>> -        *out = i915_request_get(rq);
>> -        i915_request_add(rq);
>> +        }
>> +        *out = rq;
> 
> Btw if all else fails perhaps these two blocks can be consolidated by 
> something like __chain_requests(rq, out) and all these operations in it. 
> Not sure how much would that save in the grand total.
> 
> Regards,
> 
> Tvrtko
> 
>> +
>>           if (err || !it.sg || !sg_dma_len(it.sg))
>>               break;
>>           cond_resched();
>>       } while (1);
>> -out_ce:
>> +    mutex_unlock(&ce->timeline->mutex);
>> +
>> +    if (*out)
>> +        i915_sw_fence_complete(&(*out)->submit);
>>       return err;
>>   }

  reply	other threads:[~2023-04-13 12:43 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-12 11:33 [Intel-gfx] [PATCH v5 0/5] Fix error propagation amongst request Andi Shyti
2023-04-12 11:33 ` [Intel-gfx] [PATCH v5 1/5] drm/i915/gt: Add intel_context_timeline_is_locked helper Andi Shyti
2023-04-12 12:12   ` Andrzej Hajda
2023-04-12 11:33 ` [Intel-gfx] [PATCH v5 2/5] drm/i915: Create the locked version of the request create Andi Shyti
2023-04-12 13:04   ` Andrzej Hajda
2023-04-13  8:53     ` Andi Shyti
2023-04-12 11:33 ` [Intel-gfx] [PATCH v5 3/5] drm/i915: Create the locked version of the request add Andi Shyti
2023-04-12 13:06   ` Andrzej Hajda
2023-04-13  8:57     ` Andi Shyti
2023-04-12 11:33 ` [Intel-gfx] [PATCH v5 4/5] drm/i915: Throttle for ringspace prior to taking the timeline mutex Andi Shyti
2023-04-12 11:33 ` [Intel-gfx] [PATCH v5 5/5] drm/i915/gt: Make sure that errors are propagated through request chains Andi Shyti
2023-04-13 11:56   ` Tvrtko Ursulin
2023-04-13 12:43     ` Tvrtko Ursulin [this message]
2023-05-04  9:45       ` Andi Shyti
2023-05-04  9:37     ` Andi Shyti
2023-04-12 17:28 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Fix error propagation amongst request (rev3) Patchwork
2023-04-12 17:28 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2023-04-12 17:41 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2023-04-13  3:47 ` [Intel-gfx] ✓ Fi.CI.IGT: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5b7f82db-b9dd-e9c9-496c-72995469d699@linux.intel.com \
    --to=tvrtko.ursulin@linux.intel.com \
    --cc=andi.shyti@kernel.org \
    --cc=andi.shyti@linux.intel.com \
    --cc=andrzej.hajda@intel.com \
    --cc=chris.p.wilson@linux.intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=maciej.patelczyk@intel.com \
    --cc=matthew.auld@intel.com \
    --cc=nirmoy.das@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox