AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: "Christian König" <christian.koenig@amd.com>
To: "Liu, Shaoyun" <Shaoyun.Liu@amd.com>,
	"Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Li, Yunxiang (Teddy)" <Yunxiang.Li@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Xiao, Hua" <Hua.Xiao@amd.com>
Subject: Re: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started
Date: Mon, 3 Jun 2024 12:58:43 +0200	[thread overview]
Message-ID: <380c4bd1-ab50-4f42-bb50-69dc0fbd28fe@amd.com> (raw)
In-Reply-To: <CH0PR12MB53721A7C3403982035E48639F4FC2@CH0PR12MB5372.namprd12.prod.outlook.com>

Hi Shaoyun,

yes my thinking goes into the same direction. The basic problem here is 
that we are trying to stuff two different information into the same 
variable.

The first information is if the commands haven been read by the MES from 
the ring buffer. This information is necessary for the normal ring 
buffer and reset handling, e.g. prevents ring buffer overflow, ordering 
of command, lockups during reset etc...

The second information is if a certain operation was successfully or 
not. For example this is necessary to get signaled back if y queue 
map/unmap operation has been successfully or if the CP not responding or 
any other error has happened etc...

Another issue is that while it is in general a good idea to have the 
firmware work in a way where errors are reported instead of completely 
stopping all processing, here we run into trouble because the driver 
usually assumes that work can be scheduled on the ring buffer and a 
subsequent work is processed only when everything previously submitted 
has completed successfully.

So as initial fix for the issue we see I've send Alex a patch on Friday 
to partially revert his change to use an individual writeback for each 
submission. Instead we will submit an addition QUERY_STATUS command 
after the real command and let that one write fence value. This way the 
fence value is always written, independent of the result of the operation.

Additional to that we need to insert something like a dependency between 
submissions, e.g. when you have commands A, B and C on the ring and C 
can only execute when A was successfully then we need to somehow tell 
that the MES. Only other alternative is to not scheduler commands behind 
each other on the ring and that in turn is a bad idea from the 
performance point of view.

Regards,
Christian.

Am 31.05.24 um 16:44 schrieb Liu, Shaoyun:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Hi, Christian
>
> I think we have a discussion about this before . Alex also have a change that allow driver to use different write back address for the fence for each submission for the  original issue .
>  From MES  point of view ,  MES will update the fence when the API can be complete successfully, so if the  API (ex . remove_queue) fails  due to  other component issue (ex , CP hang), the  MES will not update the fence In this situation , but  MES itself still works and can respond to other commands (ex ,,read_reg)  .  Alex's change allow driver to check the fence for each API without mess around them  .  If you expect MES to stop responding  to further commands  after one API fails , that will introduce combability issue since this design already exist on products for customer and MES also need to works for windows .  Also MES  always need to respond to  some commands like  RESET  etc  that might make things worse if we need to change the logic .
>
> One possible solution is MES can  trigger an Interrupt  to indicate which submission has failed with the seq number . In this case driver can get the  failure of the  submission to MES in time and  make its own decision for what to do next , What do you think about this ?
>
> Regards
> Shaoyun.liu
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Christian König
> Sent: Wednesday, May 29, 2024 11:19 AM
> To: Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started
>
> Am 29.05.24 um 16:48 schrieb Li, Yunxiang (Teddy):
>> [AMD Official Use Only - AMD Internal Distribution Only]
>>
>>> Yeah, I know. That's one of the reason I've pointed out on the patch
>>> adding that that this behavior is actually completely broken.
>>>
>>> If you run into issues with the MES because of this then please
>>> suggest a revert of that patch.
>> I think it just need to be improved to allow this force-signal behavior. The current behavior is slow/inconvenient, but the old behavior is wrong. Since MES will continue process submissions even when one submission failed. So with just one fence location there's no way to tell if a command failed or not.
> No the MES behavior is broken. When a submission failed it should stop processing or signal that the operation didn't completed through some other mechanism.
>
> Just not writing the fence and continuing results in tons of problems, from the TLB fence all the way to the ring buffer and reset handling.
>
> This is a hard requirement and really can't be changed.
>
> Regards,
> Christian.


  reply	other threads:[~2024-06-03 10:58 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-28 17:23 [PATCH v2 00/10] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
2024-05-28 17:23 ` [PATCH v2 01/10] drm/amdgpu: add skip_hw_access checks for sriov Yunxiang Li
2024-05-29  6:36   ` Christian König
2024-05-28 17:23 ` [PATCH v2 02/10] drm/amdgpu: fix sriov host flr handler Yunxiang Li
2024-05-29  6:41   ` Christian König
2024-05-28 17:23 ` [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started Yunxiang Li
2024-05-29  6:38   ` Christian König
2024-05-29 13:22     ` Li, Yunxiang (Teddy)
2024-05-29 13:31       ` Christian König
2024-05-29 13:44         ` Li, Yunxiang (Teddy)
2024-05-29 13:55           ` Christian König
2024-05-29 14:31             ` Li, Yunxiang (Teddy)
2024-05-29 14:35               ` Christian König
2024-05-29 14:48                 ` Li, Yunxiang (Teddy)
2024-05-29 15:19                   ` Christian König
2024-05-31 14:44                     ` Liu, Shaoyun
2024-06-03 10:58                       ` Christian König [this message]
2024-06-03 18:28                         ` Liu, Shaoyun
2024-06-04  8:07                           ` Christian König
2024-06-05 12:32                             ` Liu, Shaoyun
2024-05-28 17:23 ` [PATCH v2 04/10] drm/amdgpu/kfd: remove is_hws_hang and is_resetting Yunxiang Li
2024-05-29  6:41   ` Christian König
2024-05-29 23:04   ` Felix Kuehling
2024-05-30  0:06     ` Li, Yunxiang (Teddy)
2024-05-28 17:23 ` [PATCH v2 05/10] drm/amd/amdgpu: remove unnecessary flush when enable gart Yunxiang Li
2024-05-29  6:43   ` Christian König
2024-05-28 17:23 ` [PATCH v2 06/10] drm/amdgpu: remove tlb flush in amdgpu_gtt_mgr_recover Yunxiang Li
2024-05-29  6:45   ` Christian König
2024-05-28 17:23 ` [PATCH v2 07/10] drm/amdgpu: use helper in amdgpu_gart_unbind Yunxiang Li
2024-05-29  6:46   ` Christian König
2024-05-28 17:23 ` [PATCH v2 08/10] drm/amdgpu: fix locking scope when flushing tlb Yunxiang Li
2024-05-29  6:49   ` Christian König
2024-05-28 17:23 ` [PATCH v2 09/10] drm/amdgpu: fix missing reset domain locks Yunxiang Li
2024-05-29  6:55   ` Christian König
2024-05-30 22:02   ` Felix Kuehling
2024-05-30 22:35     ` Li, Yunxiang (Teddy)
2024-05-31  6:52     ` Christian König
2024-05-31 15:47       ` Felix Kuehling
2024-06-04 12:52         ` Li, Yunxiang (Teddy)
2024-05-28 17:23 ` [PATCH v2 10/10] Revert "drm/amdgpu: Queue KFD reset workitem in VF FED" Yunxiang Li
2024-05-28 19:04   ` Skvortsov, Victor
2024-05-30 21:47 ` [PATCH v3 0/8] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
2024-05-30 21:47   ` [PATCH v3 1/8] drm/amdgpu: add skip_hw_access checks for sriov Yunxiang Li
2024-05-30 21:47   ` [PATCH v3 2/8] drm/amdgpu: fix sriov host flr handler Yunxiang Li
2024-06-05  1:12     ` Deng, Emily
2024-05-30 21:48   ` [PATCH v3 3/8] drm/amdgpu/kfd: remove is_hws_hang and is_resetting Yunxiang Li
2024-05-30 21:48   ` [PATCH v3 4/8] drm/amd/amdgpu: remove unnecessary flush when enable gart Yunxiang Li
2024-05-30 21:48   ` [PATCH v3 5/8] drm/amdgpu: remove tlb flush in amdgpu_gtt_mgr_recover Yunxiang Li
2024-05-30 21:48   ` [PATCH v3 6/8] drm/amdgpu: use helper in amdgpu_gart_unbind Yunxiang Li
2024-05-30 21:48   ` [PATCH v3 7/8] drm/amdgpu: fix locking scope when flushing tlb Yunxiang Li
2024-05-30 21:48   ` [PATCH v3 8/8] drm/amdgpu: fix missing reset domain locks Yunxiang Li
2024-05-31  6:50     ` Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=380c4bd1-ab50-4f42-bb50-69dc0fbd28fe@amd.com \
    --to=christian.koenig@amd.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Hua.Xiao@amd.com \
    --cc=Shaoyun.Liu@amd.com \
    --cc=Yunxiang.Li@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=ckoenig.leichtzumerken@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox