From: ebiederm@xmission.com (Eric W. Biederman)
To: Alex Deucher <alexdeucher@gmail.com>
Cc: kexec@lists.infradead.org,
amd-gfx list <amd-gfx@lists.freedesktop.org>,
Dave Young <dyoung@redhat.com>,
"Alexander E. Patrakov" <patrakov@gmail.com>,
Baoquan He <bhe@redhat.com>
Subject: Re: amdgpu problem after kexec
Date: Wed, 03 Feb 2021 18:54:41 -0600 [thread overview]
Message-ID: <87wnvoodny.fsf@x220.int.ebiederm.org> (raw)
In-Reply-To: <CADnq5_MdLTLvVdwFQJxuRaQcQFNkLUNRt267zaxULNH0FUvFeA@mail.gmail.com> (Alex Deucher's message of "Wed, 3 Feb 2021 09:46:56 -0500")
Alex Deucher <alexdeucher@gmail.com> writes:
> On Wed, Feb 3, 2021 at 3:36 AM Dave Young <dyoung@redhat.com> wrote:
>>
>> Hi Baoquan,
>>
>> Thanks for ccing.
>> On 01/28/21 at 01:29pm, Baoquan He wrote:
>> > On 01/11/21 at 01:17pm, Alexander E. Patrakov wrote:
>> > > Hello,
>> > >
>> > > I was trying out kexec on my new laptop, which is a HP EliteBook 735
>> > > G6. The problem is, amdgpu does not have hardware acceleration after
>> > > kexec. Also, strangely, the lines about BlueTooth are missing from
>> > > dmesg after kexec, but I have not tried to use BlueTooth on this
>> > > laptop yet. I don't know how to debug this, the relevant amdgpu lines
>> > > in dmesg are:
>> > >
>> > > amdgpu 0000:04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB
>> > > test failed on gfx (-110).
>> > > [drm:process_one_work] *ERROR* ib ring test failed (-110).
>> > >
>> > > The good and bad dmesg files are attached. Is it a kexec problem (and
>> > > amdgpu is only a victim), or should I take it to amdgpu lists? Do I
>> > > need to provide some extra kernel arguments for debugging?
The best debugging I can think of is can you arrange to have the amdgpu
modules removed before the final kexec -e?
That would tell us if the code to shutdown the gpu exist in the rmmod
path aka the .remove method and is simply missing in the kexec path aka
the .shutdown method.
>> > I am not familiar with graphical component. Add Dave to CC to see if
>> > he has some comments. It would be great if amdgpu expert can have a look.
>>
>> It needs amdgpu driver people to help. Since kexec bypass
>> bios/UEFI initialization so we requires drivers to implement .shutdown
>> method and test it to make 2nd kernel to work correctly.
>
> kexec is tricky to make work properly on our GPUs. The problem is
> that there are some engines on the GPU that cannot be re-initialized
> once they have been initialized without an intervening device reset.
> APUs are even trickier because they share a lot of hardware state with
> the CPU. Doing lots of extra resets adds latency. The driver has
> code to try and detect if certain engines are running at driver load
> time and do a reset before initialization to make this work, but it
> apparently is not working properly on your system.
There are two cases that I think sometimes get mixed up.
There is kexec-on-panic in which case all of the work needs to happen in
the driver initialization.
There is also a simple kexec in which case some of the work can happen
in the kernel that is being shutdown and sometimes that is easer.
Does it make sense to reset your device unconditionally on driver removal?
Would it make sense to reset your device unconditionally on driver add?
How can someone debug the smart logic of reset on driver load?
Eric
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
WARNING: multiple messages have this Message-ID (diff)
From: ebiederm@xmission.com (Eric W. Biederman)
To: Alex Deucher <alexdeucher@gmail.com>
Cc: kexec@lists.infradead.org,
amd-gfx list <amd-gfx@lists.freedesktop.org>,
Dave Young <dyoung@redhat.com>,
"Alexander E. Patrakov" <patrakov@gmail.com>,
Baoquan He <bhe@redhat.com>
Subject: Re: amdgpu problem after kexec
Date: Wed, 03 Feb 2021 18:54:41 -0600 [thread overview]
Message-ID: <87wnvoodny.fsf@x220.int.ebiederm.org> (raw)
In-Reply-To: <CADnq5_MdLTLvVdwFQJxuRaQcQFNkLUNRt267zaxULNH0FUvFeA@mail.gmail.com> (Alex Deucher's message of "Wed, 3 Feb 2021 09:46:56 -0500")
Alex Deucher <alexdeucher@gmail.com> writes:
> On Wed, Feb 3, 2021 at 3:36 AM Dave Young <dyoung@redhat.com> wrote:
>>
>> Hi Baoquan,
>>
>> Thanks for ccing.
>> On 01/28/21 at 01:29pm, Baoquan He wrote:
>> > On 01/11/21 at 01:17pm, Alexander E. Patrakov wrote:
>> > > Hello,
>> > >
>> > > I was trying out kexec on my new laptop, which is a HP EliteBook 735
>> > > G6. The problem is, amdgpu does not have hardware acceleration after
>> > > kexec. Also, strangely, the lines about BlueTooth are missing from
>> > > dmesg after kexec, but I have not tried to use BlueTooth on this
>> > > laptop yet. I don't know how to debug this, the relevant amdgpu lines
>> > > in dmesg are:
>> > >
>> > > amdgpu 0000:04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB
>> > > test failed on gfx (-110).
>> > > [drm:process_one_work] *ERROR* ib ring test failed (-110).
>> > >
>> > > The good and bad dmesg files are attached. Is it a kexec problem (and
>> > > amdgpu is only a victim), or should I take it to amdgpu lists? Do I
>> > > need to provide some extra kernel arguments for debugging?
The best debugging I can think of is can you arrange to have the amdgpu
modules removed before the final kexec -e?
That would tell us if the code to shutdown the gpu exist in the rmmod
path aka the .remove method and is simply missing in the kexec path aka
the .shutdown method.
>> > I am not familiar with graphical component. Add Dave to CC to see if
>> > he has some comments. It would be great if amdgpu expert can have a look.
>>
>> It needs amdgpu driver people to help. Since kexec bypass
>> bios/UEFI initialization so we requires drivers to implement .shutdown
>> method and test it to make 2nd kernel to work correctly.
>
> kexec is tricky to make work properly on our GPUs. The problem is
> that there are some engines on the GPU that cannot be re-initialized
> once they have been initialized without an intervening device reset.
> APUs are even trickier because they share a lot of hardware state with
> the CPU. Doing lots of extra resets adds latency. The driver has
> code to try and detect if certain engines are running at driver load
> time and do a reset before initialization to make this work, but it
> apparently is not working properly on your system.
There are two cases that I think sometimes get mixed up.
There is kexec-on-panic in which case all of the work needs to happen in
the driver initialization.
There is also a simple kexec in which case some of the work can happen
in the kernel that is being shutdown and sometimes that is easer.
Does it make sense to reset your device unconditionally on driver removal?
Would it make sense to reset your device unconditionally on driver add?
How can someone debug the smart logic of reset on driver load?
Eric
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
next prev parent reply other threads:[~2021-02-04 3:03 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-11 8:17 amdgpu problem after kexec Alexander E. Patrakov
2021-01-28 5:29 ` Baoquan He
2021-01-28 5:29 ` Baoquan He
2021-02-03 6:48 ` Dave Young
2021-02-03 14:46 ` Alex Deucher
2021-02-04 0:54 ` Eric W. Biederman [this message]
2021-02-04 0:54 ` Eric W. Biederman
2021-02-04 4:31 ` Alex Deucher
2021-02-04 4:31 ` Alex Deucher
2021-02-08 3:32 ` Alexander E. Patrakov
2021-02-08 3:32 ` Alexander E. Patrakov
2021-02-08 6:33 ` Alexander E. Patrakov
2021-02-08 21:43 ` Alex Deucher
2021-02-08 21:43 ` Alex Deucher
2021-02-09 0:21 ` Alexander E. Patrakov
2021-02-09 0:21 ` Alexander E. Patrakov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87wnvoodny.fsf@x220.int.ebiederm.org \
--to=ebiederm@xmission.com \
--cc=alexdeucher@gmail.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=bhe@redhat.com \
--cc=dyoung@redhat.com \
--cc=kexec@lists.infradead.org \
--cc=patrakov@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.