From: Conor Dooley <conor@kernel.org>
To: pmladek@suse.com, senozhatsky@chromium.org, rostedt@goodmis.org,
john.ogness@linutronix.de, linux-kernel@vger.kernel.org
Subject: [resend][bug] low-probability console lockups since 5.19
Date: Fri, 23 Sep 2022 17:24:17 +0100 [thread overview]
Message-ID: <Yy3dsSgpDnovBdEn@spud> (raw)
Hey all,
Been bisecting a bug that is causing a boot failure in my CI & have
ended up here.. The bug in question is a low(ish) probability lock up
of the serial console, I would estimate about 1-in-5 chance on the
boards I could actually trigger it on which it has taken me so long
to realise that this was an actual problem. Thinking back on it, there
were other failures that I would retroactively attribute to this
problem too, but I had earlycon disabled and I chalked the hangs at
"Starting kernel" down to pre-release hardware since the other boots
in a CI job would not fail (we boot the same config multiple times),
Silly me I guess!
In a boot failure (with earlycon & keep_bootcon enabled) I see the
likes of [0]:
[ 0.485063] NFS: Registering the id_resolver key type
[ 0.491051] Key type id_resolver registered
[ 0.495743] Key type id_legacy registered
[ 0.500528] nfs4filelayout_init: NFSv4 File Layout Driver Registering...
[ 0.508010] nfs4flexfilelayout_init: NFSv4 Flexfile Layout Driver Registering...
[ 0.516857] 9p: Installing v9fs 9p2000 file system support
[ 0.524032] NET: Registered PF_ALG protocol family
[ 0.529562] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 247)
[ 0.537817] io scheduler mq-deadline registered
[ 0.542889] io scheduler kyber registered
[ 0.559682] microchip-pcie 2000000000.pcie: host bridge /soc/pcie@2000000000 ranges:
[ 0.568431] microchip-pcie 2000000000.pcie: MEM 0x2008000000..0x2087ffffff -> 0x0008000000
[ 0.635633] microchip-pcie 2000000000.pcie: axi read request error
[ 0.642602] microchip-pcie 2000000000.pcie: axi read timeout
[ 0.776630] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
[ 0.784133] microchip-pcie 2000000000.pcie: axi write request error
[ 0.791100] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
[ 0.798543] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
[ 0.805981] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
[ 0.813421] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
[ 0.820859] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
[ 0.828295] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
[ 1.579668] Freeing initrd memory: 7316K
The sec error stuff is unrelated & happens on good boots too - although
there are far fewer lines of "sec error" there [1] (excuse the message
duplication from the bootcon).
I was able to bisect the behaviour to this set of reverts, but I have
low confidence as to which commit specifically caused it - it's hard
to know that doing 20+ reboots on every commmit I built is enough to
make sure I did not make a mistake. I had initially started the
bisection between v6.0-rc6 & v6.0-rc1, and that lead me down a sub-tree
of net which is why the bisect log starts at a "random" commit as that
merge commit was the one blamed by the inital bisection.
git bisect start
# status: waiting for both good and bad commits
# bad: [5e8379351dbde61ea383e514f0f9ecb2c047cf4e] Merge tag 'net-5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
git bisect bad 5e8379351dbde61ea383e514f0f9ecb2c047cf4e
# good: [4b0986a3613c92f4ec1bdc7f60ec66fea135991f] Linux 5.18
git bisect good 4b0986a3613c92f4ec1bdc7f60ec66fea135991f
# good: [fbe86daca0ba878b04fa241b85e26e54d17d4229] Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
git bisect good fbe86daca0ba878b04fa241b85e26e54d17d4229
# good: [2b8c612c6102f751e6e3e1bd425f64e9d3d3f638] kernel/reboot: Fix powering off using a non-syscall code paths
git bisect good 2b8c612c6102f751e6e3e1bd425f64e9d3d3f638
# good: [54c2cc79194c961a213c1d375fe3aa4165664cc4] Merge tag 'usb-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
git bisect good 54c2cc79194c961a213c1d375fe3aa4165664cc4
# skip: [8f7ac50c97d30ae5ae48da3b516510d05a67b9e5] Merge tag 'sound-5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
git bisect skip 8f7ac50c97d30ae5ae48da3b516510d05a67b9e5
# good: [cdd85786f4b3b9273e4376e69aa95a2d71722764] x86/tdx: Clarify RIP adjustments in #VE handler
git bisect good cdd85786f4b3b9273e4376e69aa95a2d71722764
# good: [8cc683833726912498130a0130fc3bd0d169ef59] Merge branch 'net-dp83822-fix-interrupt-floods'
git bisect good 8cc683833726912498130a0130fc3bd0d169ef59
# bad: [0840a7914caa14315a3191178a9f72c742477860] Merge tag 'char-misc-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
git bisect bad 0840a7914caa14315a3191178a9f72c742477860
# bad: [38bc4ac431684498126f9baa3a530e5a132f0173] Merge tag 'drm-fixes-2022-06-24' of git://anongit.freedesktop.org/drm/drm
git bisect bad 38bc4ac431684498126f9baa3a530e5a132f0173
# good: [399bd66e219e331976fe6fa6ab81a023c0c97870] Merge tag 'net-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
git bisect good 399bd66e219e331976fe6fa6ab81a023c0c97870
# bad: [a237cfd6b7469d6f5eeaa45f30128ab78b5281a5] Merge tag 'block-5.19-2022-06-24' of git://git.kernel.dk/linux-block
git bisect bad a237cfd6b7469d6f5eeaa45f30128ab78b5281a5
# good: [92f20ff72066d8d7e2ffb655c2236259ac9d1c5d] Merge tag 'pm-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
git bisect good 92f20ff72066d8d7e2ffb655c2236259ac9d1c5d
# bad: [598f2404879336277a4320ac5000394b873e049a] Merge tag 'io_uring-5.19-2022-06-24' of git://git.kernel.dk/linux-block
git bisect bad 598f2404879336277a4320ac5000394b873e049a
# good: [386e4fb6962b9f248a80f8870aea0870ca603e89] io_uring: use original request task for inflight tracking
git bisect good 386e4fb6962b9f248a80f8870aea0870ca603e89
# good: [2d9ef940f89e0ab4fde7ba6f769d82f2a450c35a] Revert "printk: extend console_lock for per-console locking"
git bisect good 2d9ef940f89e0ab4fde7ba6f769d82f2a450c35a
# bad: [07a22b61946f0b80065b0ddcc703b715f84355f5] Revert "printk: add functions to prefer direct printing"
git bisect bad 07a22b61946f0b80065b0ddcc703b715f84355f5
# bad: [5831788afb17b89c5b531fb60cbd798613ccbb63] Revert "printk: add kthread console printers"
git bisect bad 5831788afb17b89c5b531fb60cbd798613ccbb63
# first bad commit: [5831788afb17b89c5b531fb60cbd798613ccbb63] Revert "printk: add kthread console printers"
On 5831788afb17 ("Revert "printk: add kthread console printers"") or on
07a22b61946f ("Revert "printk: add functions to prefer direct printing"")
I was not able to reproduce the exact error conditions that I got on
the current HEAD or any of the commits along my bisection, but instead I
got something different [2]:
[ 0.471992] 9p: Installing v9fs 9p2000 file system support
[ 0.479049] NET: Registered PF_ALG protocol family
[ 0.484585] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 247)
[ 0.492813] io scheduler mq-deadline registered
[ 0.497871] io scheduler kyber registered
[ 0.513472] microchip-pcie 2000000000.pcie: host bridge /soc/pcie@2000000000 ranges:
[ 0.522194] microchip-pcie 2000000000.pcie: MEM 0x2008000000..0x2087ffffff -> 0x0008000000
[ 0.588518] microchip-pcie 2000000000.pcie: sec error in pcie2axi buffer
[ 0.596027] microchip-pcie 2000000000.pcie: ded error in pcie2axi buffer
[ 0.603467] microchip-pcie 2000000000.pcie: axi read request error
[ 0.610322] microchip-pcie 2000000000.pcie: axi read timeout
[ 0.616603] microchip-pcie 2000000000.pcie: sec error in pcie2axi buffer
[ 0.624025] microchip-pcie 2000000000.pcie: ded error in pcie2axi buffer
[ 0.631450] microchip-pcie 2000000000.pcie: sec error in pcie2axi buffer
[ 0.638877] microchip-pcie 2000000000.pcie: ded error in pcie2axi buffer
[ 0.646300] microchip-pcie 2000000000.pcie: sec error in pcie2axi buffer
[ 0.653725] microchip-pcie 2000000000.pcie: ded error in pcie2axi buffer
[ 1.516378] Freeing initrd memory: 7336K
[ 5.591881] mc_event_handler: 667402 callbacks suppressed
[ 5.591909] microchip-pcie 2000000000.pcie: sec error in pcie2axi buffer
[ 5.605290] microchip-pcie 2000000000.pcie: ded error in pcie2axi buffer
[ 5.612704] microchip-pcie 2000000000.pcie: sec error in pcie2axi buffer
<truncated>
[ 21.539878] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 21.546651] rcu: 0-...0: (1 GPs behind) idle=19f/1/0x4000000000000002 softirq=34/36 fqs=2626
[ 21.556177] (detected by 1, t=5256 jiffies, g=-1151, q=1143 ncpus=4)
[ 21.563293] Task dump for CPU 0:
[ 21.566860] task:swapper/0 state:R running task stack: 0 pid: 1 ppid: 0 flags:0x00000008
[ 21.577823] Call Trace:
Unfortunately I do not really have any more information than that for
you, I am hoping that you (plural) could be of some help as to where
I should start looking - although after spending every evening this
week bisecting/setting up so that I could automate my bisection it'll
be a few days before I could really focus on this again.
My system is an embedded RISC-V dev board (PolarFire SoC Icicle kit).
There is also a failing build in the ClangBuiltLinux CI [3] that is
seeing similar symptoms, but I was unable to reproduce this problem
doing 1000s of reboots of a known-bad commit in QEMU myself.
LMK if you need any more info,
Conor.
[0] https://gist.githubusercontent.com/ConchuOD/575f4bfe3d68b9b32227740365b04401/raw/c69dfffd21183898edaeb223a9455ebc28edcf65/gistfile1.txt
[1] https://gist.github.com/ConchuOD/575f4bfe3d68b9b32227740365b04401?permalink_comment_id=4313185#gistcomment-4313185
[2] https://gist.github.com/ConchuOD/575f4bfe3d68b9b32227740365b04401?permalink_comment_id=4313202#gistcomment-4313202
[3] https://github.com/ClangBuiltLinux/continuous-integration2/actions/runs/3090070308/jobs/4999131449#step:5:270
next reply other threads:[~2022-09-23 16:24 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-23 16:24 Conor Dooley [this message]
2022-09-26 10:32 ` [resend][bug] low-probability console lockups since 5.19 Petr Mladek
2022-09-26 13:07 ` Conor Dooley
2022-09-28 16:55 ` Conor Dooley
2022-09-29 9:06 ` Thorsten Leemhuis
2022-09-29 9:29 ` Conor Dooley
2022-09-29 10:12 ` Petr Mladek
2022-09-29 10:52 ` Conor Dooley
2022-09-29 14:13 ` John Ogness
2022-09-29 21:22 ` Conor Dooley
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Yy3dsSgpDnovBdEn@spud \
--to=conor@kernel.org \
--cc=john.ogness@linutronix.de \
--cc=linux-kernel@vger.kernel.org \
--cc=pmladek@suse.com \
--cc=rostedt@goodmis.org \
--cc=senozhatsky@chromium.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.