intel-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* Re: Second kexec_file_load (but not kexec_load) fails on i915 if CONFIG_INTEL_IOMMU_DEFAULT_ON=n
       [not found] <197d1dc3bff.c01ddb9024897.1898328361232711826@zohomail.com>
@ 2025-07-04  8:29 ` Jani Nikula
  2025-07-15  6:37   ` Baoquan He
       [not found]   ` <197d710ac39.10e2c241536088.2706332519040181850@zohomail.com>
  0 siblings, 2 replies; 4+ messages in thread
From: Jani Nikula @ 2025-07-04  8:29 UTC (permalink / raw)
  To: Askar Safin, regressions, intel-gfx, kexec, dri-devel, iommu,
	Ben Hutchings
  Cc: anushasrivatsa, joonaslahtinen, tvrtkoursulin, josesouza,
	davehansen

On Thu, 03 Jul 2025, Askar Safin <safinaskar@zohomail.com> wrote:
> TL;DR: I found a bug in strange interaction in kexec_file_load (but not kexec_load) and i915
> TL;DR#2: Second (sometimes third or forth) kexec (using kexec_file_load) fails on my particular hardware
> TL;DR#3: I did 55 expirements, each of them required a lot of boots, in total I did 1908 boots

Thanks for the detailed debug info. I'm afraid all I can say at this
point is, please file all of this in a bug report as described in
[1]. Please add the drm.debug related options, and attach the dmesgs and
configs in the bug instead of pointing at external sites.

BR,
Jani.


[1] https://drm.pages.freedesktop.org/intel-docs/how-to-file-i915-bugs.html


>
> Okay, so I found a bug. Steps to reproduce:
> - I have Dell Precision 7780
> - I have recent Debian x86_64 sid installed (bug reproducible with both Debian kernels and mainline ones)
> - Bug is reproducible on many kernels, including very recent ones, for example 6.15.4
> - Boot system, then do kexec into the same system using kexec_file_load. I. e. pass --kexec-file-syscall to "kexec" command
> - Then kexec from this kexec'ed system again (i. e. you should do two kexec's in a row)
> - Then do 3rd kexec, etc
> - Repeat kexec's until you do 100 kexec's or your system start to misbehave
>
> On my computer the system starts to misbehave after some number of kexec's. This always happens after 2nd kexec attempt.
> I. e. the first kexec is always successful. But second sometimes is not.
> I never was able to perform 100 kexec's in a row.
> After some kexec attempt the system starts to misbehave: oopses, panics, locked system, etc.
>
> Notes:
>
> - I tried to bisect "kexec-tools" package, but bisect merely gave me commit, which switched to kexec_file_load as a default.
> Bug is reproducible if we use kexec_file_load, but doesn't reproduce if we use kexec_load
>
> - Bug is reproducible even if we boot via init=/bin/bash (note: this means that initramfs is still part of the boot process). (If we boot to normal GUI, bug is reproducible, too)
>
> - When I reproduce I use this command line: "root=UUID=... rootflags=subvol=... ro init=..."
>
> - Debian package "plymouth" is required for reproducing. (It reproduces with plymouth, but doesn't reproduce without plymouth.) But note that I never see actual plymouth screen! I. e. presence of
> "plymouth" on the system somehow affects bug reproduciblity despite plymouth animation never actually shown. I don't know why this happens, but I suspect that I don't pass "splash" to kernel command line, and thus don't see plymouth screen. But I suspect that plymouth is still included to initramfs and from there somehow affects boot process
>
> - Bug reproduces in Debian, but doesn't reproduce in Ubuntu. After a lot of expirementing I finally understood why: Ubuntu kernel has CONFIG_INTEL_IOMMU_DEFAULT_ON=y, and Debian kernel has not. Additional expirements found that it is culpit. I. e. the bug is reproducible with CONFIG_INTEL_IOMMU_DEFAULT_ON=n and not reproducbile with CONFIG_INTEL_IOMMU_DEFAULT_ON=y . (So advice for distributions: do what Ubuntu does, i. e. set CONFIG_INTEL_IOMMU_DEFAULT_ON=y to hide this bug)
>
> - Bug is not reproducible in old enough kernels, so I did bisect on Linux. Bisect showed me these commits: d4a2393049..4a75f32fc7. I. e. bug is reproducible in 4a75f32fc7, but doesn't reproduce in d4a2393049. Between them there is a middle commit 52407c220c44c8dcc6a, which is not testable. Here are these commits:
>
> commit 4a75f32fc783128d0c42ef73fa62a20379a66828
> Author: Anusha Srivatsa <anusha.srivatsa@intel.com>
>
>    drm/i915/rpl-s: Add PCH Support for Raptor Lake S
>
> commit 52407c220c44c8dcc6aa8aa35ffc8a2db3c849a9
> Author: Anusha Srivatsa <anusha.srivatsa@intel.com>
>
>    drm/i915/rpl-s: Add PCI IDS for Raptor Lake S
>
> It seems these commits merely added support for my Intel GPU model. So this is fake regression. I'm not sure this should be treated as proper regression and whether regzbot should be notified. (What do you think?)
>
> Still formally this is regression: I did expirements and they show that bug present in 4a75f32fc783128d0c42 and not present before. (Side note: in latest kernels both wayland and x11 work, in d4a2393049 x11 works and wayland doesn't.)
>
> I tried to reproduce the bug in Qemu, but I was unable to do so. It seems Intel GPU is required, maybe even my particular model.
>
> Here is "lspci -vnn -d :*:0300" for my GPU:
>
> 00:02.0 VGA compatible controller [0300]: Intel Corporation Raptor Lake-S UHD Graphics [8086:a788] (rev 04) (prog-if 00 [VGA controller])
>         Subsystem: Dell Raptor Lake-S UHD Graphics [1028:0c42]
>         Flags: bus master, fast devsel, latency 0, IRQ 202, IOMMU group 0
>         Memory at 604b000000 (64-bit, non-prefetchable) [size=16M]
>         Memory at 4000000000 (64-bit, prefetchable) [size=256M]
>         I/O ports at 3000 [size=64]
>         Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
>         Capabilities: [40] Vendor Specific Information: Len=0c <?>
>         Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
>         Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit-
>         Capabilities: [d0] Power Management version 2
>         Capabilities: [100] Process Address Space ID (PASID)
>         Capabilities: [200] Address Translation Service (ATS)
>         Capabilities: [300] Page Request Interface (PRI)
>         Capabilities: [320] Single Root I/O Virtualization (SR-IOV)
>         Kernel driver in use: i915
>         Kernel modules: i915
>
> dmidecode:
> https://zerobin.net/?aebea072b93d8122#z4W9URnV+k9ZZErhP4etQkxlfpyRKf++uKMNoO5PGjs=
>
> - I use "root=UUID=... rootflags=subvol=... ro init=..." as a command line for reproducing. If I add "recovery nomodeset dis_ucode_ldr" (this is options used by Ubuntu in recovery mode), the bug stops to reproduce
>
> Again, in short, full list of things required for successful reproducing:
> - Intel GPU, possibly my particular model
> - Kernel with support for my model (4a75f32fc783128d0c42 and later up to 6.15.4)
> - Kexec at least two times. (One kexec never fails, 100 kexec's in a row never succeed)
> - kexec_file_load as opposed to kexec_load
> - Initramfs
> - Lack of parameters "recovery nomodeset dis_ucode_ldr" (i. e. one of them stops reproducing)
> - plymouth
> - CONFIG_INTEL_IOMMU_DEFAULT_ON=n
>
> Removing of ANY of them stops the bug, and I proved this by lots of expirements.
>
> In total I did 55+ expirements, each of them required up to 100 boots. In total I did 1908 (!!!!!!) boots on my physical laptop (I mean kexec boots here). No, I'm not faking this number, here is my actual directories with results:
>
> user@subvolume:~$ ls /rbt/kx-results/
> @rec-2025-06-29T201723Z-bad-4    @rec-2025-06-29T214650Z-good-60  @rec-2025-07-03T050626Z-bad-41    @rec-2025-07-03T104125Z-bad-28    @rec-2025-07-03T133705Z-bad-3
> @rec-2025-06-29T203429Z-good-60  @rec-2025-06-29T215558Z-bad-8    @rec-2025-07-03T060107Z-good-100  @rec-2025-07-03T111727Z-bad-13    @rec-2025-07-03T141647Z-good-100
> @rec-2025-06-29T205626Z-good-60  @rec-2025-07-01T042949Z-bad-12   @rec-2025-07-03T074810Z-good-100  @rec-2025-07-03T122242Z-good-100  @rec-2025-07-03T145705Z-good-100
> @rec-2025-06-29T211612Z-bad-6    @rec-2025-07-02T120101Z-good-60  @rec-2025-07-03T082914Z-good-100  @rec-2025-07-03T123958Z-bad-12    @rec-2025-07-03T152406Z-bad-50
> @rec-2025-06-29T212932Z-good-60  @rec-2025-07-03T031038Z-good-60  @rec-2025-07-03T100615Z-good-100  @rec-2025-07-03T132116Z-good-100  @rec-2025-07-03T154204Z-bad-15
> user@subvolume:~$ ls /rbt/kx-manual-testing/
> 2025-07-01-03-19-good-6  2025-07-01-03-56-good-4  2025-07-01-05-28-bad-3  2025-07-01-06-35-bad-2  2025-07-01-09-46-good-8
> 2025-07-01-03-44-good-3  2025-07-01-04-47-good-3  2025-07-01-06-19-bad-2  2025-07-01-09-21-bad-2  2025-07-02-13-09-good
> user@subvolume:~$ ls /rbt/kx-vanilla-results/
> 2025-06-30T005219Z_5.16.0-kx-df0cc57e057f18e4-3e17eec5ff024b63_1626_good_60      2025-06-30T023542Z_5.16.0-rc2-kx-87bb2a410dcfb617-9f30253daecd39e5_1663_bad_4
> 2025-06-30T012313Z_5.17.0-kx-f443e374ae131c16-91b07dce12a83fab_1674_bad_1        2025-06-30T032312Z_5.16.0-rc2-kx-c9ee950a2ca55ea0-854a1f40ce042801_1662_bad_6
> 2025-06-30T013555Z_5.16.0-kx-22ef12195e13c5ec-9aaf880b25942f2a_1668_bad_7        2025-06-30T033528Z_5.16.0-rc2-kx-ba884a411700dc56-854a1f40ce042801_1662_good_60
> 2025-06-30T014106Z_5.16.0-kx-9bcbf894b6872216-b828905f3cf12050_1664_bad_2        2025-06-30T034645Z_5.16.0-rc2-kx-d4a23930490df39f-854a1f40ce042801_1662_good_60
> 2025-06-30T014634Z_5.16.0-rc5-kx-cb6846fbb83b574c-83e7c6cf2ede57b4_1663_bad_6    2025-06-30T035232Z_5.16.0-rc2-kx-4a75f32fc783128d-854a1f40ce042801_1662_bad_5
> 2025-06-30T015713Z_5.16.0-rc2-kx-15bb79910fe734ad-9f30253daecd39e5_1663_good_60  2025-06-30T042058Z_5.16.0-rc2-kx-4a75f32fc783128d-854a1f40ce042801_1662_bad_1
> 2025-06-30T020235Z_5.16.0-rc5-kx-b06103b5325364e0-26176b9b704a5c24_1664_bad_6    2025-06-30T050000Z_6.15.4-kx-e60eb441596d1c70-2378f4efc5e956e5_2366_bad_2
> 2025-06-30T020717Z_5.16.0-rc5-kx-eacef9fd61dcf5ea-26176b9b704a5c24_1664_bad_1    2025-06-30T053011Z_6.15.4-kx-e60eb441596d1c70-2378f4efc5e956e5_2366_good_60
> 2025-06-30T021738Z_5.16.0-rc2-kx-67b858dd89932086-8d2f1d17f1e1933c_1662_good_60  2025-06-30T060619Z_5.16.0-rc2-kx-d4a23930490df39f-854a1f40ce042801_1662_good_60
> 2025-06-30T022759Z_5.16.0-rc2-kx-17815f624a90579a-854a1f40ce042801_1662_good_60  2025-06-30T061448Z_5.16.0-rc2-kx-4a75f32fc783128d-854a1f40ce042801_1662_bad_1
>
> Each number in the end of file/directory name is number of boots. In total we have 1908 boots. Testing was mostly automatical, using my script.
>
> Here is one example dmesg from mainline commit e60eb441596d1c70 (somewhere around 6.15.4):
>
> https://zerobin.net/?119ff118fd47b363#BpziYs6dNz5PaT7H8w2hlveoEYa4DDtITGkyd9o57LE=
>
> This is was dmesg from 2nd (and in the same time last) boot. The next boot (i. e. kexec) was unsuccessful. Corresponding config:
>
> https://zerobin.net/?009c807e1df41af8#gnmrswlbaFbdPTuzNq6NFkQd/Jhb3Ds0ZlLiwNanXnc=
>
> If you want results from all expirements, here is a link: https://filebin.net/45g2757b2iwaeen7 (1 Mb, expires after 7 days). Usually expirements come with full reproducer script.
>
> But what I described above is already enough, I think this link is not needed.
>
> I will be available for testing in coming days, then I will switch to other things, and so will not be available for testing.
> If you want more time, then, please, ask for it, i. e. say me something like "Please, be available for testing in more 10 days".
>
> --
> Askar Safin
> https://types.pl/@safinaskar
>

-- 
Jani Nikula, Intel

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Second kexec_file_load (but not kexec_load) fails on i915 if CONFIG_INTEL_IOMMU_DEFAULT_ON=n
  2025-07-04  8:29 ` Second kexec_file_load (but not kexec_load) fails on i915 if CONFIG_INTEL_IOMMU_DEFAULT_ON=n Jani Nikula
@ 2025-07-15  6:37   ` Baoquan He
       [not found]   ` <197d710ac39.10e2c241536088.2706332519040181850@zohomail.com>
  1 sibling, 0 replies; 4+ messages in thread
From: Baoquan He @ 2025-07-15  6:37 UTC (permalink / raw)
  To: Askar Safin, Jani Nikula
  Cc: regressions, intel-gfx, kexec, dri-devel, iommu, Ben Hutchings,
	anushasrivatsa, joonaslahtinen, tvrtkoursulin, josesouza,
	davehansen

On 07/04/25 at 11:29am, Jani Nikula wrote:
> On Thu, 03 Jul 2025, Askar Safin <safinaskar@zohomail.com> wrote:
> > TL;DR: I found a bug in strange interaction in kexec_file_load (but not kexec_load) and i915
> > TL;DR#2: Second (sometimes third or forth) kexec (using kexec_file_load) fails on my particular hardware
> > TL;DR#3: I did 55 expirements, each of them required a lot of boots, in total I did 1908 boots
> 
> Thanks for the detailed debug info. I'm afraid all I can say at this
> point is, please file all of this in a bug report as described in
> [1]. Please add the drm.debug related options, and attach the dmesgs and
> configs in the bug instead of pointing at external sites.

Yeah, that's very great example people can refer to when reporting
issues to upstream, thanks for the details.

For the bug itself, I would hope Intel GPU people can have a look, see
what's happened and how to fix. For kexec reboot, we have got problems
on Nvidia GPU and amdgpu which makes kexec reboot hard to do continuous
switching to 2nd kernel. In Redhat, we have met this several years ago,
and we tried to contact GPU dev, while there's no way to fix it. Finaly
we have to declare not supporting kexec reboot formally. This Intel GPU
issue could be a different one, I still hope GPU dev can have a look.

Currently, many people are investing much effort on KHO, K-state, etc
in upstream to make kexec reboot versatile and flexible. I am very glad
to see that. And I guess people possiblly have met the same GPU issues on
Nvidia and AMD gpu as I mentioned, and trying to solve them. Otherwise,
no matter how wonderful KHO, K-state or K-anything are, they are just sky
scraper on sand.

Personal opinion.

Thanks
Baoquan


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Second kexec_file_load (but not kexec_load) fails on i915 if CONFIG_INTEL_IOMMU_DEFAULT_ON=n
       [not found]   ` <197d710ac39.10e2c241536088.2706332519040181850@zohomail.com>
@ 2025-07-21 14:18     ` Pingfan Liu
  2025-07-22  1:28       ` Askar Safin
  0 siblings, 1 reply; 4+ messages in thread
From: Pingfan Liu @ 2025-07-21 14:18 UTC (permalink / raw)
  To: Askar Safin
  Cc: Jani Nikula, regressions, intel-gfx, kexec, dri-devel, iommu,
	Ben Hutchings, joonaslahtinen, josesouza, davehansen

On Sat, Jul 5, 2025 at 4:12 AM Askar Safin <safinaskar@zohomail.com> wrote:
>
>  ---- On Fri, 04 Jul 2025 12:29:01 +0400  Jani Nikula <jani.nikula@linux.intel.com> wrote ---
>  > Thanks for the detailed debug info. I'm afraid all I can say at this
>  > point is, please file all of this in a bug report as described in
>  > [1]. Please add the drm.debug related options, and attach the dmesgs and
>  > configs in the bug instead of pointing at external sites.
>
> Okay, now let me speculate how to fix this bug. :) I think someone with moderate kexec understanding
> and with Intel GPU should do this: reproduce the bug and then slowly modify kexec_file_load code until it
> becomes kexec_load code. (Or vice versa.) In the middle of this modification the bug stops to reproduce,
> and so we will know what exactly causes it.
>
> kexec_file_load and kexec_load should behave the same. If they do not, then we should
> understand, why. We should closely review their code.
>
> Also, in case of kexec_load kernel uncompressing and parsing performed by "kexec" userspace
> tool, and in case of kexec_file_load by kernel. So we should closely review this two uncompressing/parsing code fragments.
> I think that this bug is related to kexec, not to i915. And thus it should be fixed by kexec people, not by i915 people. (But I may be wrong.)
>

I tend to agree with Baoquan on this scene when kexec rebooted with a
graphic card.
I heard that this was due to the missed initialization on the graphic
card by the firmware in the kexec reboot process. But it is not an
official explanation. If any experts could enlighten me on this, I'd
really appreciate it.

IMHO, you could try blacklisting the i915 module to see if
kexec_file_load works without issues - this would help narrow down the
culprit.

Thanks,

Pingfan

> But okay, I reported it to that bug tracker anyway: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14598
>
> Maybe there is separate kexec bug tracker?
>
> Also, your bug tracker is cool. One can attach files in the middle of report. Why not whole kernel uses it? :)
>
> --
> Askar Safin
> https://types.pl/@safinaskar
>
>


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Second kexec_file_load (but not kexec_load) fails on i915 if CONFIG_INTEL_IOMMU_DEFAULT_ON=n
  2025-07-21 14:18     ` Pingfan Liu
@ 2025-07-22  1:28       ` Askar Safin
  0 siblings, 0 replies; 4+ messages in thread
From: Askar Safin @ 2025-07-22  1:28 UTC (permalink / raw)
  To: Pingfan Liu
  Cc: Jani Nikula, regressions, intel-gfx, kexec, dri-devel, iommu,
	Ben Hutchings, joonaslahtinen, josesouza, davehansen

 ---- On Mon, 21 Jul 2025 18:18:48 +0400  Pingfan Liu <piliu@redhat.com> wrote --- 
 > IMHO, you could try blacklisting the i915 module to see if
I did this. Problem is in i915.
Here you can see our discussion with i915 devs: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14598
--
Askar Safin
https://types.pl/@safinaskar


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-07-22 12:47 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <197d1dc3bff.c01ddb9024897.1898328361232711826@zohomail.com>
2025-07-04  8:29 ` Second kexec_file_load (but not kexec_load) fails on i915 if CONFIG_INTEL_IOMMU_DEFAULT_ON=n Jani Nikula
2025-07-15  6:37   ` Baoquan He
     [not found]   ` <197d710ac39.10e2c241536088.2706332519040181850@zohomail.com>
2025-07-21 14:18     ` Pingfan Liu
2025-07-22  1:28       ` Askar Safin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).