From: "Rafał Miłecki" <zajec5@gmail.com>
To: Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
Waiman Long <longman@redhat.com>,
Boqun Feng <boqun.feng@gmail.com>,
Russell King <linux@armlinux.org.uk>,
Daniel Lezcano <daniel.lezcano@linaro.org>,
Thomas Gleixner <tglx@linutronix.de>,
Florian Fainelli <f.fainelli@gmail.com>,
linux-clk@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: openwrt-devel@lists.openwrt.org, bcm-kernel-feedback-list@broadcom.com
Subject: ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes
Date: Mon, 4 Sep 2023 10:33:59 +0200 [thread overview]
Message-ID: <a03a6e1d-e99c-40a3-bdac-0075b5339beb@gmail.com> (raw)
I made a second attempt on debugging some longstanding stability issues
affecting BCM53753 SoCs. Those are single CPU core ARM Cortex-A7 boards
with a pretty slow arch timer running at 36,8 kHz.
After 0 to 20 minutes of close to zero activity I experience hangs and I
need to wait a minute for watchdog to kick in and reboot device.
First debugging attempt:
https://lore.kernel.org/netdev/0f9d0cd6-d344-7915-7bc1-7a090b8305d2@gmail.com/T/ ("ARM board lockups/hangs triggered by locks and mutexes")
After a lot of bisecting, testing & hacking I believe there are 3 types
of kernel aspects that affect BCM53573 stability. I'd like to describe
them below to document my debugging work. I'm clueless at this point.
Maybe someone can come up with an idea of actual issue & ideally a
solution.
#####
1. Locking
During my first bisecting attempts I found multiple locking-related
commit that regressed stability.
Bisected commits:
131287ff833d ("once: add DO_ONCE_SLOW() for sleepable contexts").
and a following group:
d0d583484d2e ("locking/refcount: Consolidate implementations of refcount_t")
dab787c73f6e ("locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions")
0d3182fbe689 ("locking/refcount: Move saturation warnings out of line")
809554147d60 ("locking/refcount: Improve performance of generic REFCOUNT_FULL code")
9c9269977f03 ("locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the <linux/refcount.h> header")
04bff7d7b808 ("locking/refcount: Remove unused refcount_*_checked() variants")
513b19a43bec ("locking/refcount: Ensure integer operands are treated as signed")
68b4ee68e8c8 ("locking/refcount: Define constants for saturation and max refcount values")
I don't believe there is actually anything wrong about above changes.
Maybe it's some tiny timing thing that my board just doesn't like?
#####
2. Clock (arm,armv7-timer)
While comparing main clock in Broadcom's SDK with upstream one I noticed
a tiny difference: mask value. I don't know it it makes any sense but
switching from CLOCKSOURCE_MASK(56) to CLOCKSOURCE_MASK(64) in
arm_arch_timer.c (to match SDK) increases average uptime (time before a
hang/lockup happens) from 4 minutes to 36 minutes.
#####
3. Random code changes
During my bisecting attempts I found one commit that regressed kernel
stability but actual changes were meaningless in context of locking. It
was commit ad9b10d1eaad ("mtd: core: introduce of support for dynamic
partitions"):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ad9b10d1eaada169bd764abcab58f08538877e26
I thought that maybe it was all about making add_mtd_device() bigger and
changing addresses of a lot of symbols (looking at System.map). So I
reverted that mtd commit and developed a dummy change relocating as few
symbols (System.map) as possible while still breaking stability:
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -94,6 +94,21 @@ void __cpuidle default_idle_call(void)
arch_cpu_idle();
start_critical_timings();
}
+
+ if (cpu_idle_force_poll == 1234)
+ arch_cpu_idle();
+ if (cpu_idle_force_poll == 5678)
+ arch_cpu_idle();
+ if (cpu_idle_force_poll == 1234)
+ arch_cpu_idle();
+ if (cpu_idle_force_poll == 5678)
+ arch_cpu_idle();
+ if (cpu_idle_force_poll == 1234)
+ arch_cpu_idle();
+ if (cpu_idle_force_poll == 5678)
+ arch_cpu_idle();
+ if (cpu_idle_force_poll == 1234)
+ arch_cpu_idle();
}
static int call_cpuidle(struct cpuidle_driver *drv, struct cpuidle_device *dev,
Above dummy change didn't relocate thousands of symbols but only about
20 of them. They happened to be lock symbols however. Does it make any
sense for above diff to regress kernel stability for me and cause
hangs/lockups?
--- System.map.good
+++ System.map.bad
@@ -22214,36 +22214,36 @@
c062e7e0 T __cpuidle_text_start
c062e7e0 t cpu_idle_poll
c062e860 T default_idle_call
-c062e884 T __cpuidle_text_end
-c062e888 T __lock_text_start
-c062e8a0 T _raw_spin_unlock_irqrestore
-c062e8c0 T _raw_spin_trylock
-c062e900 T _raw_write_unlock_irqrestore
-c062e920 T _raw_read_trylock
-c062e960 T _raw_write_trylock
-c062e9a0 T _raw_spin_lock_bh
-c062ea00 T _raw_read_lock_bh
-c062ea40 T _raw_write_lock_bh
-c062ea80 T _raw_spin_trylock_bh
-c062eb00 T _raw_spin_unlock_bh
-c062eb40 T _raw_write_unlock_bh
-c062eb80 T _raw_read_unlock_bh
-c062ebc0 T _raw_read_unlock_irqrestore
-c062ec00 T _raw_write_lock
-c062ec40 T _raw_write_lock_irq
-c062ec80 T _raw_write_lock_irqsave
-c062ecc0 T _raw_read_lock
-c062ed00 T _raw_spin_lock
-c062ed40 T _raw_read_lock_irq
-c062ed80 T _raw_spin_lock_irq
-c062ede0 T _raw_spin_lock_irqsave
-c062ee40 T _raw_read_lock_irqsave
-c062ee70 T __hyp_text_end
-c062ee70 T __hyp_text_start
-c062ee70 T __kprobes_text_end
-c062ee70 T __kprobes_text_start
-c062ee70 T __lock_text_end
-c062ee70 T _etext
+c062e954 T __cpuidle_text_end
+c062e958 T __lock_text_start
+c062e960 T _raw_spin_unlock_irqrestore
+c062e980 T _raw_spin_trylock
+c062e9c0 T _raw_write_unlock_irqrestore
+c062e9e0 T _raw_read_trylock
+c062ea20 T _raw_write_trylock
+c062ea60 T _raw_spin_lock_bh
+c062eac0 T _raw_read_lock_bh
+c062eb00 T _raw_write_lock_bh
+c062eb40 T _raw_spin_trylock_bh
+c062ebc0 T _raw_spin_unlock_bh
+c062ec00 T _raw_write_unlock_bh
+c062ec40 T _raw_read_unlock_bh
+c062ec80 T _raw_read_unlock_irqrestore
+c062ecc0 T _raw_write_lock
+c062ed00 T _raw_write_lock_irq
+c062ed40 T _raw_write_lock_irqsave
+c062ed80 T _raw_read_lock
+c062edc0 T _raw_spin_lock
+c062ee00 T _raw_read_lock_irq
+c062ee40 T _raw_spin_lock_irq
+c062eea0 T _raw_spin_lock_irqsave
+c062ef00 T _raw_read_lock_irqsave
+c062ef30 T __hyp_text_end
+c062ef30 T __hyp_text_start
+c062ef30 T __kprobes_text_end
+c062ef30 T __kprobes_text_start
+c062ef30 T __lock_text_end
+c062ef30 T _etext
c062f000 D __start_rodata
c062f000 D sigreturn_codes
c062f044 d cpu_arch_name
###
As those hangs/lockups are related to so many different changes it's
really hard to debug them.
This bug seems to be specific to the slow arch clock that affects
stability only when kernel locking code and symbols layout trigger some
very specific timing.
Enabling CONFIG_PROVE_LOCKING seems to make issue go away but it affects
so much code it's hard to tell why it actually matters.
Same for disabling CONFIG_SMP. I noticed Broadcom's SDK keeps it
disabled. I tried it and it improves stability (I had 3 devices with 6
days of uptime and counting) indeed. Again it affects a lot of kernel
parts so it's hard to tell why it helps.
Unless someone comes up with some magic solution I'll probably try
building BCM53573 images without CONFIG_SMP for my personal needs.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next reply other threads:[~2023-09-04 8:34 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-04 8:33 Rafał Miłecki [this message]
2023-09-04 8:58 ` ARM BCM53573 SoC hangs/lockups caused by locks/clock/random changes Geert Uytterhoeven
2023-09-04 15:25 ` Waiman Long
2023-09-04 15:40 ` Russell King (Oracle)
2023-09-04 20:16 ` Waiman Long
2023-09-05 20:07 ` Florian Fainelli
2023-09-06 2:17 ` Waiman Long
2023-09-08 8:10 ` Linus Walleij
2023-11-29 21:20 ` Rafał Miłecki
2023-11-29 21:33 ` Linus Walleij
2023-11-29 21:42 ` Florian Fainelli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a03a6e1d-e99c-40a3-bdac-0075b5339beb@gmail.com \
--to=zajec5@gmail.com \
--cc=bcm-kernel-feedback-list@broadcom.com \
--cc=boqun.feng@gmail.com \
--cc=daniel.lezcano@linaro.org \
--cc=f.fainelli@gmail.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-clk@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux@armlinux.org.uk \
--cc=longman@redhat.com \
--cc=mingo@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=openwrt-devel@lists.openwrt.org \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).