Re: [PATCH] drm/amd/amdgpu: change the flush gpu tlb mode to sync mode.

AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: "Christian König" <christian.koenig@amd.com>
To: "Li, Chong(Alan)" <Chong.Li@amd.com>
Cc: "cao, lin" <lin.cao@amd.com>,
	"Yin, ZhenGuo (Chris)" <ZhenGuo.Yin@amd.com>,
	"Zhang, Tiantian (Celine)" <Tiantian.Zhang@amd.com>,
	"Andjelkovic, Dejan" <Dejan.Andjelkovic@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] drm/amd/amdgpu: change the flush gpu tlb mode to sync mode.
Date: Thu, 31 Oct 2024 11:03:49 +0100	[thread overview]
Message-ID: <f668d404-ab2a-43c4-9f16-571f733bb9a7@amd.com> (raw)
In-Reply-To: <DS7PR12MB5768E43AA139B90D4DE7C0019B552@DS7PR12MB5768.namprd12.prod.outlook.com>

[-- Attachment #1: Type: text/plain, Size: 12379 bytes --]

Hi Chong,

Am 31.10.24 um 10:54 schrieb Li, Chong(Alan):
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>
> Hi, Christian.
>
> Share the process of the page fault issue in rocblas benchmark.
>

finally some progress here. Thanks for the update.

> Find when there are multithreads read register “regIH_VMID_0_LUT” to 
> get pasid,
>
> This register will return error pasid value randomly, sometimes is 0, 
> sometimes is 32768, (the real value is 32770).
>
> After check the invalid pasid, code will “continue” and not flush the 
> gpu tlb.
>

That is really disturbing, concurrent register access is mandatory to 
work correctly.

Not only the TLB flush but many other operations depend on stuff like 
that as well.

> That’s why the page fault accours.
>
> After add the lock, the register not return invalid value, and the 
> rocblas benchmark passed.
>
> You have submit a patch "implement TLB flush fence", in this patch you 
> create a kernel thread to flush gpu tlb.
>
> And in main thread the function “svm_range_map_to_gpus” will call 
> function “kfd_flush_tlb” and then flush gpu tlb as well.
>
> Means that both the two threads will call function 
> “gmc_v11_0_flush_gpu_tlb_pasid”.
>
> So after you merge your patch, the page fault issue accours.
>
> My first patch change flush gpu tlb to sync mode,
>
> means the one thread flush the gpu tlb twice, so my first patch passed 
> the rocblas benchmark.
>

I will have to reject such patches, you need to find the underlying 
problem and not mitigate the symptoms.

> I already submit an email to firmware team to ask why the register 
> will return wrong value.
>
> But if the firmware team not able to solve this issue, or need a long 
> time to solve this issue,
>
> I will submit the patch like below to do the workaround.
>

Well that basically means a complete stop for any deliverable.

The driver stack simply won't work correctly when register reads return 
random values like that.

Regards,
Christian.

> Thanks,
>
> Chong.
>
> *From:*Li, Chong(Alan)
> *Sent:* Friday, October 25, 2024 2:46 PM
> *To:* Koenig, Christian <Christian.Koenig@amd.com>; Andjelkovic, Dejan 
> <Dejan.Andjelkovic@amd.com>
> *Cc:* cao, lin <lin.cao@amd.com>; Yin, ZhenGuo (Chris) 
> <ZhenGuo.Yin@amd.com>; Zhang, Tiantian (Celine) 
> <Tiantian.Zhang@amd.com>; amd-gfx@lists.freedesktop.org
> *Subject:* RE: [PATCH] drm/amd/amdgpu: change the flush gpu tlb mode 
> to sync mode.
>
> Hi, Christian.
>
> The size of log file so large, can’t paste in the Email.
>
> I copy the log file in directory “\\ark\incoming\chong\log 
> <file://ark/incoming/chong/log>”, the log file name is “kern.log”.
>
> Can you access this directory ?
>
> Thanks,
>
> Chong.
>
> *From:*Koenig, Christian <Christian.Koenig@amd.com 
> <mailto:Christian.Koenig@amd.com>>
> *Sent:* Thursday, October 24, 2024 7:22 PM
> *To:* Li, Chong(Alan) <Chong.Li@amd.com <mailto:Chong.Li@amd.com>>; 
> Andjelkovic, Dejan <Dejan.Andjelkovic@amd.com 
> <mailto:Dejan.Andjelkovic@amd.com>>
> *Cc:* cao, lin <lin.cao@amd.com <mailto:lin.cao@amd.com>>; Yin, 
> ZhenGuo (Chris) <ZhenGuo.Yin@amd.com <mailto:ZhenGuo.Yin@amd.com>>; 
> Zhang, Tiantian (Celine) <Tiantian.Zhang@amd.com 
> <mailto:Tiantian.Zhang@amd.com>>; Raina, Yera <Yera.Raina@amd.com 
> <mailto:Yera.Raina@amd.com>>
> *Subject:* Re: [PATCH] drm/amd/amdgpu: change the flush gpu tlb mode 
> to sync mode.
>
> Do you have the full log as text file? As image it's pretty much useless.
>
> Regards,
> Christian.
>
> Am 24.10.24 um 09:41 schrieb Li, Chong(Alan):
>
>     [AMD Official Use Only - AMD Internal Distribution Only]
>
>     Hi, Christian.
>
>     We can see the dmesg log,
>
>     After address “7ef90be00” already update the ptes, page fault
>     still happen.
>
>     Thanks,
>
>     Chong.
>
>     *From:*Koenig, Christian <Christian.Koenig@amd.com>
>     <mailto:Christian.Koenig@amd.com>
>     *Sent:* Wednesday, October 23, 2024 5:26 PM
>     *To:* Li, Chong(Alan) <Chong.Li@amd.com>
>     <mailto:Chong.Li@amd.com>; Andjelkovic, Dejan
>     <Dejan.Andjelkovic@amd.com> <mailto:Dejan.Andjelkovic@amd.com>
>     *Cc:* cao, lin <lin.cao@amd.com> <mailto:lin.cao@amd.com>; Yin,
>     ZhenGuo (Chris) <ZhenGuo.Yin@amd.com>
>     <mailto:ZhenGuo.Yin@amd.com>; Zhang, Tiantian (Celine)
>     <Tiantian.Zhang@amd.com> <mailto:Tiantian.Zhang@amd.com>; Raina,
>     Yera <Yera.Raina@amd.com> <mailto:Yera.Raina@amd.com>
>     *Subject:* Re: [PATCH] drm/amd/amdgpu: change the flush gpu tlb
>     mode to sync mode.
>
>     Hi Chong,
>
>     oh that could indeed be.
>
>     I suggest to add a trace point for the page fault so that we can
>     guarantee that we use the same time basis for both events.
>
>     That should make it trivial to compare them.
>
>     Regards,
>     Christian.
>
>     Am 23.10.24 um 10:17 schrieb Li, Chong(Alan):
>
>         [AMD Official Use Only - AMD Internal Distribution Only]
>
>         Hi, Christian.
>
>         *I add a log in kernel, and prove the timestamp in tracing log
>         is slower than dmesg log, *
>
>         *so we can’t give a conclusion that the issue in rocm.*
>
>         ------------------------ the information I sync with
>         Andjelkovic, Dejan ----------------------------------------
>
>         dmesg shows that the page fault happens address
>         “0x000072e5f4401000” at time “6587.772178”,
>
>         tracing log shows that the function “amdgpu_vm_update_ptes” be
>         called at time “6587.790869”,
>
>         ------------------------ the information I sync with
>         Andjelkovic, Dejan ----------------------------------------
>
>         From the log time stamp, you give a conclusion that “The test
>         tries to access memory before it is probably mapped and that
>         is provable by looking into the tracelogs.”.
>
>         But after I review the code, the function
>         “amdgpu_vm_ptes_update” be called in function
>         “svm_range_set_attr”,
>
>         So, after this log in above dmesg print “[ 6587.772136]
>         amdgpu: pasid 0x8002 svms 0x000000008b03ff39 [0x72e5f4400
>         0x72e5fc3ff] done, r=0”,
>
>         the function “svm_range_set_attr” will leave, in that time
>         “amdgpu_vm_ptes_update” is already be called, the timestamp is
>         not reasonable.
>
>         I think maybe the timestamp in tracing log has some delay, and
>         I add a line of log in kernel to verify my guess,
>
>         The below is the result:
>
>         tracing log shows the address “ffffffc00” at time “227.298607”,
>
>         dmesg log print the address “ffffffc00” at time “226.756137”.
>
>         traing log:
>
>         dmesg log:
>
>         Thanks,
>
>         Chong.
>
>         *From:*Li, Chong(Alan)
>         *Sent:* Monday, October 21, 2024 6:38 PM
>         *To:* Koenig, Christian <Christian.Koenig@amd.com>
>         <mailto:Christian.Koenig@amd.com>; Raina, Yera
>         <Yera.Raina@amd.com> <mailto:Yera.Raina@amd.com>; Andjelkovic,
>         Dejan <Dejan.Andjelkovic@amd.com>
>         <mailto:Dejan.Andjelkovic@amd.com>
>         *Cc:* cao, lin <lin.cao@amd.com> <mailto:lin.cao@amd.com>;
>         Yin, ZhenGuo (Chris) <ZhenGuo.Yin@amd.com>
>         <mailto:ZhenGuo.Yin@amd.com>; Zhang, Tiantian (Celine)
>         <Tiantian.Zhang@amd.com> <mailto:Tiantian.Zhang@amd.com>
>         *Subject:* RE: [PATCH] drm/amd/amdgpu: change the flush gpu
>         tlb mode to sync mode.
>
>         Hi, Christian.
>
>         Thanks for your reply,
>
>         And do you have any advice about this issue?
>
>         Hi, Raina, Year.
>         Share I assign this ticket SWDEV-459983
>         <https://ontrack-internal.amd.com/browse/SWDEV-459983>to rocm
>         team?
>
>         Thanks,
>
>         Chong.
>
>         *From:*Koenig, Christian <Christian.Koenig@amd.com
>         <mailto:Christian.Koenig@amd.com>>
>         *Sent:* Monday, October 21, 2024 6:08 PM
>         *To:* Li, Chong(Alan) <Chong.Li@amd.com
>         <mailto:Chong.Li@amd.com>>; Raina, Yera <Yera.Raina@amd.com
>         <mailto:Yera.Raina@amd.com>>
>         *Cc:* cao, lin <lin.cao@amd.com <mailto:lin.cao@amd.com>>;
>         amd-gfx@lists.freedesktop.org
>         <mailto:amd-gfx@lists.freedesktop.org>
>         *Subject:* Re: [PATCH] drm/amd/amdgpu: change the flush gpu
>         tlb mode to sync mode.
>
>         Hi Chong,
>
>         Andjelkovic just shared a bunch of traces from rocm on teams
>         with me which I analyzed.
>
>         When you know what you look for it's actually pretty obvious
>         what's going on. Just look at the timestamp of the fault and
>         compare that with the timestamp of the operation mapping
>         something at the given address.
>
>         When mapping an address happens only after accessing an
>         address then there is clearly something wrong in the code
>         which coordinates this and that is the ROCm stress test tool
>         in this case.
>
>         Regards,
>         Christian.
>
>         Am 21.10.24 um 11:02 schrieb Li, Chong(Alan):
>
>             [AMD Official Use Only - AMD Internal Distribution Only]
>
>             Hi, Christian, Raina, Yera.
>
>             If this issue in rocm, I need assign my ticket
>             SWDEV-459983
>             <https://ontrack-internal.amd.com/browse/SWDEV-459983>to
>             rocm team.
>
>             Is there anything to share with the rocm pm?
>
>             Such as the Email or chat history or the ticket you talk
>             with Andjelkovic.
>
>             Thanks,
>
>             Chong.
>
>             *From:*Koenig, Christian <Christian.Koenig@amd.com>
>             <mailto:Christian.Koenig@amd.com>
>             *Sent:* Monday, October 21, 2024 4:00 PM
>             *To:* Li, Chong(Alan) <Chong.Li@amd.com>
>             <mailto:Chong.Li@amd.com>; amd-gfx@lists.freedesktop.org
>             <mailto:amd-gfx@lists.freedesktop.org>
>             *Cc:* cao, lin <lin.cao@amd.com> <mailto:lin.cao@amd.com>
>             *Subject:* Re: [PATCH] drm/amd/amdgpu: change the flush
>             gpu tlb mode to sync mode.
>
>             Am 21.10.24 um 07:56 schrieb Chong Li:
>
>
>                 change the gpu tlb flush mode to sync mode to
>
>                 solve the issue in the rocm stress test.
>
>
>             And again complete NAK to this.
>
>             I've already proven together with Andjelkovic that the
>             problem is that the rocm stress test is broken.
>
>             The test tries to access memory before it is probably
>             mapped and that is provable by looking into the tracelogs.
>
>             Regards,
>             Christian.
>
>
>
>                   
>
>                   
>
>                 Signed-off-by: Chong Li<chongli2@amd.com>  <mailto:chongli2@amd.com>
>
>                 ---
>
>                   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c | 4 ++--
>
>                   1 file changed, 2 insertions(+), 2 deletions(-)
>
>                   
>
>                 diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c
>
>                 index 51cddfa3f1e8..4d9ff7b31618 100644
>
>                 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c
>
>                 +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c
>
>                 @@ -98,7 +98,6 @@ void amdgpu_vm_tlb_fence_create(struct amdgpu_device *adev, struct amdgpu_vm *vm
>
>                    f->adev = adev;
>
>                    f->dependency = *fence;
>
>                    f->pasid = vm->pasid;
>
>                 - INIT_WORK(&f->work, amdgpu_tlb_fence_work);
>
>                    spin_lock_init(&f->lock);
>
>                   
>
>                    dma_fence_init(&f->base, &amdgpu_tlb_fence_ops, &f->lock,
>
>                 @@ -106,7 +105,8 @@ void amdgpu_vm_tlb_fence_create(struct amdgpu_device *adev, struct amdgpu_vm *vm
>
>                   
>
>                    /* TODO: We probably need a separate wq here */
>
>                    dma_fence_get(&f->base);
>
>                 - schedule_work(&f->work);
>
>                   
>
>                    *fence = &f->base;
>
>                 +
>
>                 + amdgpu_tlb_fence_work(&f->work);
>
>                   }
>

[-- Attachment #2.1: Type: text/html, Size: 52917 bytes --]

[-- Attachment #2.2: image008.png --]
[-- Type: image/png, Size: 61864 bytes --]

[-- Attachment #2.3: image001.png --]
[-- Type: image/png, Size: 83231 bytes --]

[-- Attachment #2.4: image002.png --]
[-- Type: image/png, Size: 74007 bytes --]

[-- Attachment #2.5: image003.png --]
[-- Type: image/png, Size: 34357 bytes --]

[-- Attachment #2.6: image004.png --]
[-- Type: image/png, Size: 7556 bytes --]

[-- Attachment #2.7: image005.png --]
[-- Type: image/png, Size: 15396 bytes --]

[-- Attachment #2.8: image006.png --]
[-- Type: image/png, Size: 37011 bytes --]

[-- Attachment #2.9: image007.png --]
[-- Type: image/png, Size: 24298 bytes --]

next prev parent reply	other threads:[~2024-10-31 10:04 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-21  5:56 [PATCH] drm/amd/amdgpu: change the flush gpu tlb mode to sync mode Chong Li
2024-10-21  8:00 ` Christian König
2024-10-21  9:02   ` Li, Chong(Alan)
2024-10-21 10:08     ` Christian König
     [not found]       ` <DS7PR12MB57687EE5E003280FA4BBCDCD9B432@DS7PR12MB5768.namprd12.prod.outlook.com>
     [not found]         ` <DS7PR12MB57686BB80A2550A59BBEAE2A9B4D2@DS7PR12MB5768.namprd12.prod.outlook.com>
     [not found]           ` <ed87757e-2f1f-4084-a863-cea858648d31@amd.com>
     [not found]             ` <DS7PR12MB57680B09EDA58A48ECEAD0019B4E2@DS7PR12MB5768.namprd12.prod.outlook.com>
     [not found]               ` <24840999-9eb3-4e9c-a134-9eb78f13f8f8@amd.com>
2024-10-25  6:46                 ` Li, Chong(Alan)
2024-10-31  9:54                   ` Li, Chong(Alan)
2024-10-31 10:03                     ` Christian König [this message]
2024-11-04  6:43                       ` [PATCH] drm/amdgpu: fix return ramdom value when multiple threads read registers via mes Li, Chong(Alan)
2024-11-04 10:22                         ` Christian König
2024-11-04 10:54                           ` Li, Chong(Alan)
2024-11-04 13:15                             ` Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f668d404-ab2a-43c4-9f16-571f733bb9a7@amd.com \
    --to=christian.koenig@amd.com \
    --cc=Chong.Li@amd.com \
    --cc=Dejan.Andjelkovic@amd.com \
    --cc=Tiantian.Zhang@amd.com \
    --cc=ZhenGuo.Yin@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=lin.cao@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox