* [PATCH v2 0/2] sysfs: add counters for lockups and stalls
@ 2025-05-04 18:08 Max Kellermann
2025-05-04 18:08 ` [PATCH v2 1/2] kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count Max Kellermann
2025-05-04 18:08 ` [PATCH v2 2/2] kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count Max Kellermann
0 siblings, 2 replies; 6+ messages in thread
From: Max Kellermann @ 2025-05-04 18:08 UTC (permalink / raw)
To: akpm, song, joel.granados, dianders, cminyard, linux-kernel
Cc: Max Kellermann
Commits 9db89b411170 ("exit: Expose "oops_count" to sysfs") and
8b05aa263361 ("panic: Expose "warn_count" to sysfs") added counters
for oopses and warnings to sysfs, and these two patches do the same
for hard/soft lockups and RCU stalls.
All of these counters are useful for monitoring tools to detect
whether the machine is healthy. If the kernel has experienced a
lockup or a stall, it's probably due to a kernel bug, and I'd like to
detect that quickly and easily. There is currently no way to detect
that, other than parsing dmesg. Or observing indirect effects: such
as certain tasks not responding, but then I need to observe all tasks,
and it may take a while until these effects become visible/measurable.
I'd rather be able to detect the primary cause more quickly, possibly
before everything falls apart.
Max Kellermann (2):
kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count
kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
.../ABI/testing/sysfs-kernel-hardlockup_count | 7 +++
.../ABI/testing/sysfs-kernel-rcu_stall_count | 6 +++
.../ABI/testing/sysfs-kernel-softlockup_count | 7 +++
kernel/rcu/tree_stall.h | 26 +++++++++
kernel/watchdog.c | 53 +++++++++++++++++++
5 files changed, 99 insertions(+)
create mode 100644 Documentation/ABI/testing/sysfs-kernel-hardlockup_count
create mode 100644 Documentation/ABI/testing/sysfs-kernel-rcu_stall_count
create mode 100644 Documentation/ABI/testing/sysfs-kernel-softlockup_count
--
2.47.2
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v2 1/2] kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count
2025-05-04 18:08 [PATCH v2 0/2] sysfs: add counters for lockups and stalls Max Kellermann
@ 2025-05-04 18:08 ` Max Kellermann
2025-05-04 18:08 ` [PATCH v2 2/2] kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count Max Kellermann
1 sibling, 0 replies; 6+ messages in thread
From: Max Kellermann @ 2025-05-04 18:08 UTC (permalink / raw)
To: akpm, song, joel.granados, dianders, cminyard, linux-kernel
Cc: Max Kellermann
There is /proc/sys/kernel/hung_task_detect_count,
/sys/kernel/warn_count and /sys/kernel/oops_count but there is no
userspace-accessible counter for hard/soft lockups. Having this is
useful for monitoring tools.
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
---
v1 -> v2: added documentation; added patch set cover letter with
justification
---
.../ABI/testing/sysfs-kernel-hardlockup_count | 7 +++
.../ABI/testing/sysfs-kernel-softlockup_count | 7 +++
kernel/watchdog.c | 53 +++++++++++++++++++
3 files changed, 67 insertions(+)
create mode 100644 Documentation/ABI/testing/sysfs-kernel-hardlockup_count
create mode 100644 Documentation/ABI/testing/sysfs-kernel-softlockup_count
diff --git a/Documentation/ABI/testing/sysfs-kernel-hardlockup_count b/Documentation/ABI/testing/sysfs-kernel-hardlockup_count
new file mode 100644
index 000000000000..dfdd4078b077
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-hardlockup_count
@@ -0,0 +1,7 @@
+What: /sys/kernel/hardlockup_count
+Date: May 2025
+KernelVersion: 6.16
+Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description:
+ Shows how many times the system has detected a hard lockup since last boot.
+ Available only if CONFIG_HARDLOCKUP_DETECTOR is enabled.
diff --git a/Documentation/ABI/testing/sysfs-kernel-softlockup_count b/Documentation/ABI/testing/sysfs-kernel-softlockup_count
new file mode 100644
index 000000000000..337ff5531b5f
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-softlockup_count
@@ -0,0 +1,7 @@
+What: /sys/kernel/softlockup_count
+Date: May 2025
+KernelVersion: 6.16
+Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description:
+ Shows how many times the system has detected a soft lockup since last boot.
+ Available only if CONFIG_SOFTLOCKUP_DETECTOR is enabled.
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 9fa2af9dbf2c..09994bfb47af 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -63,6 +63,29 @@ int __read_mostly sysctl_hardlockup_all_cpu_backtrace;
*/
unsigned int __read_mostly hardlockup_panic =
IS_ENABLED(CONFIG_BOOTPARAM_HARDLOCKUP_PANIC);
+
+#ifdef CONFIG_SYSFS
+
+static unsigned int hardlockup_count;
+
+static ssize_t hardlockup_count_show(struct kobject *kobj, struct kobj_attribute *attr,
+ char *page)
+{
+ return sysfs_emit(page, "%u\n", hardlockup_count);
+}
+
+static struct kobj_attribute hardlockup_count_attr = __ATTR_RO(hardlockup_count);
+
+static __init int kernel_hardlockup_sysfs_init(void)
+{
+ sysfs_add_file_to_group(kernel_kobj, &hardlockup_count_attr.attr, NULL);
+ return 0;
+}
+
+late_initcall(kernel_hardlockup_sysfs_init);
+
+#endif // CONFIG_SYSFS
+
/*
* We may not want to enable hard lockup detection by default in all cases,
* for example when running the kernel as a guest on a hypervisor. In these
@@ -169,6 +192,10 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
unsigned int this_cpu = smp_processor_id();
unsigned long flags;
+#ifdef CONFIG_SYSFS
+ ++hardlockup_count;
+#endif
+
/* Only print hardlockups once. */
if (per_cpu(watchdog_hardlockup_warned, cpu))
return;
@@ -311,6 +338,28 @@ unsigned int __read_mostly softlockup_panic =
static bool softlockup_initialized __read_mostly;
static u64 __read_mostly sample_period;
+#ifdef CONFIG_SYSFS
+
+static unsigned int softlockup_count;
+
+static ssize_t softlockup_count_show(struct kobject *kobj, struct kobj_attribute *attr,
+ char *page)
+{
+ return sysfs_emit(page, "%u\n", softlockup_count);
+}
+
+static struct kobj_attribute softlockup_count_attr = __ATTR_RO(softlockup_count);
+
+static __init int kernel_softlockup_sysfs_init(void)
+{
+ sysfs_add_file_to_group(kernel_kobj, &softlockup_count_attr.attr, NULL);
+ return 0;
+}
+
+late_initcall(kernel_softlockup_sysfs_init);
+
+#endif // CONFIG_SYSFS
+
/* Timestamp taken after the last successful reschedule. */
static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts);
/* Timestamp of the last softlockup report. */
@@ -742,6 +791,10 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
touch_ts = __this_cpu_read(watchdog_touch_ts);
duration = is_softlockup(touch_ts, period_ts, now);
if (unlikely(duration)) {
+#ifdef CONFIG_SYSFS
+ ++softlockup_count;
+#endif
+
/*
* Prevent multiple soft-lockup reports if one cpu is already
* engaged in dumping all cpu back traces.
--
2.47.2
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v2 2/2] kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
2025-05-04 18:08 [PATCH v2 0/2] sysfs: add counters for lockups and stalls Max Kellermann
2025-05-04 18:08 ` [PATCH v2 1/2] kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count Max Kellermann
@ 2025-05-04 18:08 ` Max Kellermann
2025-06-03 16:39 ` Sourabh Jain
1 sibling, 1 reply; 6+ messages in thread
From: Max Kellermann @ 2025-05-04 18:08 UTC (permalink / raw)
To: akpm, song, joel.granados, dianders, cminyard, linux-kernel
Cc: Max Kellermann
Exposing a simple counter to userspace for monitoring tools.
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
---
v1 -> v2: added documentation
---
.../ABI/testing/sysfs-kernel-rcu_stall_count | 6 +++++
kernel/rcu/tree_stall.h | 26 +++++++++++++++++++
2 files changed, 32 insertions(+)
create mode 100644 Documentation/ABI/testing/sysfs-kernel-rcu_stall_count
diff --git a/Documentation/ABI/testing/sysfs-kernel-rcu_stall_count b/Documentation/ABI/testing/sysfs-kernel-rcu_stall_count
new file mode 100644
index 000000000000..a4a97a7f4a4d
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-rcu_stall_count
@@ -0,0 +1,6 @@
+What: /sys/kernel/rcu_stall_count
+Date: May 2025
+KernelVersion: 6.16
+Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description:
+ Shows how many times the system has detected an RCU stall since last boot.
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index 925fcdad5dea..158330524795 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -20,6 +20,28 @@
int sysctl_panic_on_rcu_stall __read_mostly;
int sysctl_max_rcu_stall_to_panic __read_mostly;
+#ifdef CONFIG_SYSFS
+
+static unsigned int rcu_stall_count;
+
+static ssize_t rcu_stall_count_show(struct kobject *kobj, struct kobj_attribute *attr,
+ char *page)
+{
+ return sysfs_emit(page, "%u\n", rcu_stall_count);
+}
+
+static struct kobj_attribute rcu_stall_count_attr = __ATTR_RO(rcu_stall_count);
+
+static __init int kernel_rcu_stall_sysfs_init(void)
+{
+ sysfs_add_file_to_group(kernel_kobj, &rcu_stall_count_attr.attr, NULL);
+ return 0;
+}
+
+late_initcall(kernel_rcu_stall_sysfs_init);
+
+#endif // CONFIG_SYSFS
+
#ifdef CONFIG_PROVE_RCU
#define RCU_STALL_DELAY_DELTA (5 * HZ)
#else
@@ -784,6 +806,10 @@ static void check_cpu_stall(struct rcu_data *rdp)
if (kvm_check_and_clear_guest_paused())
return;
+#ifdef CONFIG_SYSFS
+ ++rcu_stall_count;
+#endif
+
rcu_stall_notifier_call_chain(RCU_STALL_NOTIFY_NORM, (void *)j - gps);
if (READ_ONCE(csd_lock_suppress_rcu_stall) && csd_lock_is_stuck()) {
pr_err("INFO: %s detected stall, but suppressed full report due to a stuck CSD-lock.\n", rcu_state.name);
--
2.47.2
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v2 2/2] kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
2025-05-04 18:08 ` [PATCH v2 2/2] kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count Max Kellermann
@ 2025-06-03 16:39 ` Sourabh Jain
2025-06-04 0:16 ` Andrew Morton
0 siblings, 1 reply; 6+ messages in thread
From: Sourabh Jain @ 2025-06-03 16:39 UTC (permalink / raw)
To: akpm, Max Kellermann, song, joel.granados, dianders, cminyard,
linux-kernel
Hello Andrew,
On 04/05/25 23:38, Max Kellermann wrote:
> Exposing a simple counter to userspace for monitoring tools.
>
> Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
> ---
> v1 -> v2: added documentation
> ---
> .../ABI/testing/sysfs-kernel-rcu_stall_count | 6 +++++
> kernel/rcu/tree_stall.h | 26 +++++++++++++++++++
> 2 files changed, 32 insertions(+)
> create mode 100644 Documentation/ABI/testing/sysfs-kernel-rcu_stall_count
>
> diff --git a/Documentation/ABI/testing/sysfs-kernel-rcu_stall_count b/Documentation/ABI/testing/sysfs-kernel-rcu_stall_count
> new file mode 100644
> index 000000000000..a4a97a7f4a4d
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-kernel-rcu_stall_count
> @@ -0,0 +1,6 @@
> +What: /sys/kernel/rcu_stall_count
> +Date: May 2025
> +KernelVersion: 6.16
> +Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
> +Description:
> + Shows how many times the system has detected an RCU stall since last boot.
> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
> index 925fcdad5dea..158330524795 100644
> --- a/kernel/rcu/tree_stall.h
> +++ b/kernel/rcu/tree_stall.h
> @@ -20,6 +20,28 @@
> int sysctl_panic_on_rcu_stall __read_mostly;
> int sysctl_max_rcu_stall_to_panic __read_mostly;
>
> +#ifdef CONFIG_SYSFS
> +
> +static unsigned int rcu_stall_count;
> +
> +static ssize_t rcu_stall_count_show(struct kobject *kobj, struct kobj_attribute *attr,
> + char *page)
> +{
> + return sysfs_emit(page, "%u\n", rcu_stall_count);
> +}
> +
> +static struct kobj_attribute rcu_stall_count_attr = __ATTR_RO(rcu_stall_count);
> +
> +static __init int kernel_rcu_stall_sysfs_init(void)
> +{
> + sysfs_add_file_to_group(kernel_kobj, &rcu_stall_count_attr.attr, NULL);
> + return 0;
> +}
> +
> +late_initcall(kernel_rcu_stall_sysfs_init);
> +
> +#endif // CONFIG_SYSFS
> +
> #ifdef CONFIG_PROVE_RCU
> #define RCU_STALL_DELAY_DELTA (5 * HZ)
> #else
> @@ -784,6 +806,10 @@ static void check_cpu_stall(struct rcu_data *rdp)
> if (kvm_check_and_clear_guest_paused())
> return;
>
> +#ifdef CONFIG_SYSFS
> + ++rcu_stall_count;
> +#endif
> +
> rcu_stall_notifier_call_chain(RCU_STALL_NOTIFY_NORM, (void *)j - gps);
> if (READ_ONCE(csd_lock_suppress_rcu_stall) && csd_lock_is_stuck()) {
> pr_err("INFO: %s detected stall, but suppressed full report due to a stuck CSD-lock.\n", rcu_state.name);
It seems like this patch was not applied properly to the upstream tree.
Out of the three hunks in this patch, only the first one is applied; the
second
and third hunks are missing.
commit 2536c5c7d6ae5e1d844aa21f28b326b5e7f815ef
Author: Max Kellermann <max.kellermann@ionos.com>
Date: Sun May 4 20:08:31 2025 +0200
kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
Expose a simple counter to userspace for monitoring tools.
Thanks,
Sourabh Jain
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2 2/2] kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
2025-06-03 16:39 ` Sourabh Jain
@ 2025-06-04 0:16 ` Andrew Morton
2025-06-04 13:55 ` Sourabh Jain
0 siblings, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2025-06-04 0:16 UTC (permalink / raw)
To: Sourabh Jain
Cc: Max Kellermann, song, joel.granados, dianders, cminyard,
linux-kernel
On Tue, 3 Jun 2025 22:09:30 +0530 Sourabh Jain <sourabhjain@linux.ibm.com> wrote:
> Hello Andrew,
>
> > +#endif
> > +
> > rcu_stall_notifier_call_chain(RCU_STALL_NOTIFY_NORM, (void *)j - gps);
> > if (READ_ONCE(csd_lock_suppress_rcu_stall) && csd_lock_is_stuck()) {
> > pr_err("INFO: %s detected stall, but suppressed full report due to a stuck CSD-lock.\n", rcu_state.name);
>
> It seems like this patch was not applied properly to the upstream tree.
>
> Out of the three hunks in this patch, only the first one is applied; the
> second
> and third hunks are missing.
>
> commit 2536c5c7d6ae5e1d844aa21f28b326b5e7f815ef
> Author: Max Kellermann <max.kellermann@ionos.com>
> Date: Sun May 4 20:08:31 2025 +0200
>
> kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
>
> Expose a simple counter to userspace for monitoring tools.
OK. iirc there was quite a lot of churn and conflicts here :)
Please send a fixup against latest -linus?
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2 2/2] kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
2025-06-04 0:16 ` Andrew Morton
@ 2025-06-04 13:55 ` Sourabh Jain
0 siblings, 0 replies; 6+ messages in thread
From: Sourabh Jain @ 2025-06-04 13:55 UTC (permalink / raw)
To: Andrew Morton, Max Kellermann
Cc: song, joel.granados, dianders, cminyard, linux-kernel
On 04/06/25 05:46, Andrew Morton wrote:
> On Tue, 3 Jun 2025 22:09:30 +0530 Sourabh Jain <sourabhjain@linux.ibm.com> wrote:
>
>> Hello Andrew,
>>
>>> +#endif
>>> +
>>> rcu_stall_notifier_call_chain(RCU_STALL_NOTIFY_NORM, (void *)j - gps);
>>> if (READ_ONCE(csd_lock_suppress_rcu_stall) && csd_lock_is_stuck()) {
>>> pr_err("INFO: %s detected stall, but suppressed full report due to a stuck CSD-lock.\n", rcu_state.name);
>> It seems like this patch was not applied properly to the upstream tree.
>>
>> Out of the three hunks in this patch, only the first one is applied; the
>> second
>> and third hunks are missing.
>>
>> commit 2536c5c7d6ae5e1d844aa21f28b326b5e7f815ef
>> Author: Max Kellermann <max.kellermann@ionos.com>
>> Date: Sun May 4 20:08:31 2025 +0200
>>
>> kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
>>
>> Expose a simple counter to userspace for monitoring tools.
> OK. iirc there was quite a lot of churn and conflicts here :)
>
> Please send a fixup against latest -linus?
Sure, I will wait for a day or two to see if Max is interested in
sending the fix-up patch. Otherwise, I will send it.
Thanks,
Sourabh Jain
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-06-04 13:55 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-04 18:08 [PATCH v2 0/2] sysfs: add counters for lockups and stalls Max Kellermann
2025-05-04 18:08 ` [PATCH v2 1/2] kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count Max Kellermann
2025-05-04 18:08 ` [PATCH v2 2/2] kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count Max Kellermann
2025-06-03 16:39 ` Sourabh Jain
2025-06-04 0:16 ` Andrew Morton
2025-06-04 13:55 ` Sourabh Jain
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).