From: "Marek Marczykowski-Górecki" <marmarek@invisiblethingslab.com>
To: "Jürgen Groß" <jgross@suse.com>
Cc: xen-devel <xen-devel@lists.xenproject.org>,
Andrew Cooper <andrew.cooper3@citrix.com>
Subject: Re: Linux 6.13-rc3 many different panics in Xen PV dom0
Date: Fri, 3 Jan 2025 01:18:31 +0100 [thread overview]
Message-ID: <Z3cs1-wG5WJ9FrAR@mail-itl> (raw)
In-Reply-To: <Z3brZQmYhx-QTnga@mail-itl>
[-- Attachment #1: Type: text/plain, Size: 11197 bytes --]
On Thu, Jan 02, 2025 at 08:39:16PM +0100, Marek Marczykowski-Górecki wrote:
> On Thu, Jan 02, 2025 at 08:17:00PM +0100, Jürgen Groß wrote:
> > On 02.01.25 19:54, Marek Marczykowski-Górecki wrote:
> > > On Thu, Jan 02, 2025 at 01:24:21PM +0100, Marek Marczykowski-Górecki wrote:
> > > > On Thu, Jan 02, 2025 at 12:30:10PM +0100, Juergen Gross wrote:
> > > > > On 02.01.25 11:20, Jürgen Groß wrote:
> > > > > > On 19.12.24 17:14, Marek Marczykowski-Górecki wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > It crashes on boot like below, most of the times. But sometimes (rarely)
> > > > > > > it manages to stay alive. Below I'm pasting few of the crashes that look
> > > > > > > distinctly different, if you follow the links, you can find more of
> > > > > > > them. IMHO it looks like some memory corruption bug somewhere. I tested
> > > > > > > also Linux 6.13-rc2 before, and it had very similar issue.
> > > > > >
> > > > > > ...
> > > > > >
> > > > > > >
> > > > > > > Full log:
> > > > > > > https://openqa.qubes-os.org/tests/122879/logfile?filename=serial0.txt
> > > > > >
> > > > > > I can reproduce a crash with 6.13-rc5 PV dom0.
> > > > > >
> > > > > > What is really interesting in the logs: most crashes seem to happen right
> > > > > > after a module being loaded (in my reproducer it was right after loading
> > > > > > the first module).
> > > > > >
> > > > > > I need to go through the 6.13 commits, but I think I remember having seen
> > > > > > a patch optimizing module loading by using large pages for addressing the
> > > > > > loaded modules. Maybe the case of no large pages being available isn't
> > > > > > handled properly.
> > > > >
> > > > > Seems I was right.
> > > > >
> > > > > For me the following diff fixes the issue. Marek, can you please confirm
> > > > > it fixes your crashes, too?
> > > >
> > > > Thanks for looking into it!
> > > > Will do, I've pushed it to
> > > > https://github.com/QubesOS/qubes-linux-kernel/pull/662, CI will build it
> > > > and then I'll post it to openQA.
> > >
> > > It is much better!
> > >
> > > Tests are still running, but I already see that many are green.
> >
> > So are you fine with me adding your "Tested-by:"?
>
> Yes.
>
> > > There is
> > > one issue (likely unrelated to this change) - sys-usb (HVM domU with USB
> > > controllers passed through) crashes on a system with Raptor Lake CPU
> > > (only, others, including ADL and MTL look fine):
Correction, it does happen on some others too, just got the crash on the ADL
system, although looks a bit different ("Corrupted page table at ..."):
sys-usb login: [2025-01-02 23:44:58] [ 7.295556] Bluetooth: hci0: Waiting for firmware download to complete
[ 7.296996] Bluetooth: hci0: Firmware loaded in 2882606 usecs
[ 7.297276] Bluetooth: hci0: Waiting for device to boot
[ 7.313074] Bluetooth: hci0: Device booted in 15473 usecs
[ 7.318447] Bluetooth: hci0: Found Intel DDC parameters: intel/ibt-1040-0041.ddc
[ 7.321060] Bluetooth: hci0: Applying Intel DDC parameters completed
[ 7.322057] Bluetooth: hci0: No support for BT device in ACPI firmware
[ 7.324037] Bluetooth: hci0: Firmware timestamp 2024.33 buildtype 1 build 81755
[ 7.324085] Bluetooth: hci0: Firmware SHA1: 0xd028ffe4
[ 7.327995] Bluetooth: hci0: Fseq status: Success (0x00)
[ 7.328017] Bluetooth: hci0: Fseq executed: 00.00.02.41
[ 7.328032] Bluetooth: hci0: Fseq BT Top: 00.00.02.41
[ 7.396950] Bluetooth: MGMT ver 1.23
[ 9.352650] kauditd_printk_skb: 82 callbacks suppressed
[ 9.352655] audit: type=1131 audit(1735861500.506:81): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-rfkill comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[ 15.808157] audit: type=1100 audit(1735861506.961:82): pid=867 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:xdm_t:s0-s0:c0.c1023 msg='op=PAM:authentication grantors=pam_rootok acct="user" exe="/usr/bin/qubes-gui-runuser" hostname=sys-usb addr=? terminal=/dev/tty7 res=success'
[ 15.808860] audit: type=1100 audit(1735861506.962:83): pid=866 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:local_login_t:s0-s0:c0.c1023 msg='op=PAM:authentication grantors=pam_rootok acct="user" exe="/usr/lib/qubes/qrexec-agent" hostname=? addr=? terminal=? res=success'
[ 15.814137] audit: type=1103 audit(1735861506.967:84): pid=867 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:xdm_t:s0-s0:c0.c1023 msg='op=PAM:setcred grantors=pam_rootok acct="user" exe="/usr/bin/qubes-gui-runuser" hostname=sys-usb addr=? terminal=/dev/tty7 res=success'
[ 15.814816] audit: type=1006 audit(1735861506.968:85): pid=867 uid=0 subj=system_u:system_r:xdm_t:s0-s0:c0.c1023 old-auid=4294967295 auid=1000 tty=tty7 old-ses=4294967295 ses=1 res=1
[ 15.815078] audit: type=1300 audit(1735861506.968:85): arch=c000003e syscall=1 success=yes exit=4 a0=3 a1=7ffe29c03a70 a2=4 a3=0 items=0 ppid=712 pid=867 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=tty7 ses=1 comm="qubes-gui-runus" exe="/usr/bin/qubes-gui-runuser" subj=system_u:system_r:xdm_t:s0-s0:c0.c1023 key=(null)
[ 15.815164] audit: type=1327 audit(1735861506.968:85): proctitle=2F7573722F62696E2F71756265732D6775692D72756E757365720075736572002F62696E2F7368002D6C002D630065786563202F7573722F62696E2F78696E6974202F6574632F5831312F78696E69742F78696E69747263202D2D202F7573722F6C69622F71756265732F71756265732D786F72672D77726170706572203A30
[ 15.815420] audit: type=1103 audit(1735861506.969:86): pid=866 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:local_login_t:s0-s0:c0.c1023 msg='op=PAM:setcred grantors=pam_rootok acct="user" exe="/usr/lib/qubes/qrexec-agent" hostname=? addr=? terminal=? res=success'
[ 15.816039] audit: type=1006 audit(1735861506.969:87): pid=866 uid=0 subj=system_u:system_r:local_login_t:s0-s0:c0.c1023 old-auid=4294967295 auid=1000 tty=(none) old-ses=4294967295 ses=2 res=1
[ 15.817029] audit: type=1300 audit(1735861506.969:87): arch=c000003e syscall=1 success=yes exit=4 a0=3 a1=7ffe550c1c30 a2=4 a3=0 items=0 ppid=864 pid=866 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=2 comm="qrexec-agent" exe="/usr/lib/qubes/qrexec-agent" subj=system_u:system_r:local_login_t:s0-s0:c0.c1023 key=(null)
[ 15.817160] audit: type=1327 audit(1735861506.969:87): proctitle="/usr/lib/qubes/qrexec-agent"
[ 16.111133] systemd-journald[366]: Time jumped backwards, rotating.
th: RFCOMM TTY layer initialized
[ 18.286026] Bluetooth: RFCOMM socket layer initialized
[ 18.286035] Bluetooth: RFCOMM ver 1.11
[ 18.469074] abrt-dump-journ: Corrupted page table at address 78c64b600010
[ 18.469096] PGD 14980067 P4D 14980067 PUD 14981067 PMD 38c8047 PTE 243c8b48ffffff57
[ 18.469117] Oops: Bad pagetable: 000d [#1] PREEMPT SMP NOPTI
[ 18.469132] CPU: 1 UID: 0 PID: 657 Comm: abrt-dump-journ Not tainted 6.13.0-0.rc5.2.qubes.1.fc41.x86_64 #1
[ 18.469152] Hardware name: Xen HVM domU, BIOS 4.19.0 01/02/2025
[ 18.469165] RIP: 0033:0x78c64e1bc9a0
[ 18.469177] Code: 86 f5 01 00 00 49 8b 7c 24 38 48 85 ff 0f 84 08 03 00 00 48 8d 0d 40 e6 ff ff ba 18 00 00 00 e8 46 c7 fa ff e9 d1 01 00 00 90 <0f> b6 50 10 38 96 c8 01 00 00 0f 85 63 fd ff ff 80 fa 02 0f 84 4c
[ 18.469211] RSP: 002b:00007ffcdc67a8b0 EFLAGS: 00010246
[ 18.469223] RAX: 000078c64b600000 RBX: 00006045c444c890 RCX: 0000000000000048
[ 18.469238] RDX: 0000000000000000 RSI: 00006045c444c890 RDI: 00006045c444f040
[ 18.469253] RBP: 00007ffcdc67a930 R08: 00006045c43a1010 R09: 0000000000000001
[ 18.469268] R10: 00006045c44098b0 R11: 0000000000000246 R12: 00006045c444f040
[ 18.469284] R13: 00006045c4409890 R14: 00006045c444c890 R15: 0000000000000000
[ 18.469299] FS: 000078c64d675400 GS: 0000000000000000
[ 18.469310] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device snd_timer snd soundcore rfcomm bnep btusb btrtl btintel btbcm btmtk bluetooth rfkill nft_reject_ipv6 nf_reject_ipv6 nft_reject_ipv4 nf_reject_ipv4 nft_reject nft_ct nft_masq nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 joydev nf_tables intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core intel_vsec pmt_telemetry pmt_class crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 xhci_pci ehci_pci xhci_hcd ehci_hcd pcspkr i2c_piix4 i2c_smbus ata_generic pata_acpi serio_raw xen_scsiback target_core_mod xen_netback xen_privcmd xen_gntdev xen_gntalloc xen_blkback xen_evtchn loop fuse nfnetlink overlay xen_blkfront
[ 18.469484] ---[ end trace 0000000000000000 ]---
[ 18.469495] RIP: 0033:0x78c64e1bc9a0
[ 18.469504] RSP: 002b:00007ffcdc67a8b0 EFLAGS: 00010246
[ 18.469516] RAX: 000078c64b600000 RBX: 00006045c444c890 RCX: 0000000000000048
[ 18.469531] RDX: 0000000000000000 RSI: 00006045c444c890 RDI: 00006045c444f040
[ 18.469547] RBP: 00007ffcdc67a930 R08: 00006045c43a1010 R09: 0000000000000001
[ 18.469562] R10: 00006045c44098b0 R11: 0000000000000246 R12: 00006045c444f040
[ 18.469577] R13: 00006045c4409890 R14: 00006045c444c890 R15: 0000000000000000
[ 18.469593] FS: 000078c64d675400(0000) GS:ffff9de397100000(0000) knlGS:0000000000000000
[ 18.469609] CS: 0033 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 18.469623] CR2: 000078c64b600010 CR3: 0000000000164004 CR4: 0000000000770ef0
[ 18.469640] PKRU: 55555554
[ 18.469646] Kernel panic - not syncing: Fatal exception
[ 18.469706] Kernel Offset: 0x2ec00000 from 0xffffffff80200000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> > > [ 75.770849] Bluetooth: Core ver 2.22
> > > [ 75.770866] Oops: general protection fault, probably for non-canonical address 0xc9d2315bc82c3bbd: 0000 [#1] PREEMPT SMP NOPTI
> > > [ 75.770880] CPU: 0 UID: 0 PID: 923 Comm: (udev-worker) Not tainted 6.13.0-0.rc5.2.qubes.1.fc41.x86_64 #1
> > > [ 75.770890] Hardware name: Xen HVM domU, BIOS 4.19.0 01/02/2025
> > > [ 75.770897] RIP: 0010:msft_monitor_device_del+0x93/0x170 [bluetooth]
> > > [ 75.770924] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0 65 21 <26> 2b 8b ad 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >
> > This code is looking suspicious. Large areas of binary 0 in a normal function?
> > And the code itself is nonsense, as it is using a memory access via ES:, which
> > doesn't make any sense in 64-bit kernel.
>
> Could it be still something related to modules layout in memory?
> It seems it's not 100% reliable crash, I see in at least one instance
> sys-usb remained running (unfortunately I don't have collected full
> sys-usb console log from successful test...).
>
> I just checked again that this crash didn't happen with any 6.12 or 6.11
> kernels.
>
> --
> Best Regards,
> Marek Marczykowski-Górecki
> Invisible Things Lab
--
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
next prev parent reply other threads:[~2025-01-03 0:19 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-19 16:14 Linux 6.13-rc3 many different panics in Xen PV dom0 Marek Marczykowski-Górecki
2024-12-20 1:48 ` Marek Marczykowski-Górecki
2024-12-26 18:48 ` Marek Marczykowski-Górecki
2025-01-02 10:20 ` Jürgen Groß
2025-01-02 11:30 ` Juergen Gross
2025-01-02 12:24 ` Marek Marczykowski-Górecki
2025-01-02 18:54 ` Marek Marczykowski-Górecki
2025-01-02 19:04 ` Andrew Cooper
2025-01-02 19:17 ` Jürgen Groß
2025-01-02 19:39 ` Marek Marczykowski-Górecki
2025-01-03 0:18 ` Marek Marczykowski-Górecki [this message]
2025-01-03 0:42 ` Marek Marczykowski-Górecki
2025-01-03 2:00 ` Andrew Cooper
2025-01-03 18:09 ` Linux 6.13-rc5 Xen HVM with PCI passthrough (USB controller) crash Marek Marczykowski-Górecki
2025-01-03 18:32 ` Geert Uytterhoeven
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z3cs1-wG5WJ9FrAR@mail-itl \
--to=marmarek@invisiblethingslab.com \
--cc=andrew.cooper3@citrix.com \
--cc=jgross@suse.com \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.