linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] sysfs: add counters for lockups and stalls
@ 2025-05-04 18:08 Max Kellermann
  2025-05-04 18:08 ` [PATCH v2 1/2] kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count Max Kellermann
  2025-05-04 18:08 ` [PATCH v2 2/2] kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count Max Kellermann
  0 siblings, 2 replies; 6+ messages in thread
From: Max Kellermann @ 2025-05-04 18:08 UTC (permalink / raw)
  To: akpm, song, joel.granados, dianders, cminyard, linux-kernel
  Cc: Max Kellermann

Commits 9db89b411170 ("exit: Expose "oops_count" to sysfs") and
8b05aa263361 ("panic: Expose "warn_count" to sysfs") added counters
for oopses and warnings to sysfs, and these two patches do the same
for hard/soft lockups and RCU stalls.

All of these counters are useful for monitoring tools to detect
whether the machine is healthy.  If the kernel has experienced a
lockup or a stall, it's probably due to a kernel bug, and I'd like to
detect that quickly and easily.  There is currently no way to detect
that, other than parsing dmesg.  Or observing indirect effects: such
as certain tasks not responding, but then I need to observe all tasks,
and it may take a while until these effects become visible/measurable.
I'd rather be able to detect the primary cause more quickly, possibly
before everything falls apart.

Max Kellermann (2):
  kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count
  kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count

 .../ABI/testing/sysfs-kernel-hardlockup_count |  7 +++
 .../ABI/testing/sysfs-kernel-rcu_stall_count  |  6 +++
 .../ABI/testing/sysfs-kernel-softlockup_count |  7 +++
 kernel/rcu/tree_stall.h                       | 26 +++++++++
 kernel/watchdog.c                             | 53 +++++++++++++++++++
 5 files changed, 99 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-hardlockup_count
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-rcu_stall_count
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-softlockup_count

-- 
2.47.2


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 1/2] kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count
  2025-05-04 18:08 [PATCH v2 0/2] sysfs: add counters for lockups and stalls Max Kellermann
@ 2025-05-04 18:08 ` Max Kellermann
  2025-05-04 18:08 ` [PATCH v2 2/2] kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count Max Kellermann
  1 sibling, 0 replies; 6+ messages in thread
From: Max Kellermann @ 2025-05-04 18:08 UTC (permalink / raw)
  To: akpm, song, joel.granados, dianders, cminyard, linux-kernel
  Cc: Max Kellermann

There is /proc/sys/kernel/hung_task_detect_count,
/sys/kernel/warn_count and /sys/kernel/oops_count but there is no
userspace-accessible counter for hard/soft lockups.  Having this is
useful for monitoring tools.

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
---
v1 -> v2: added documentation; added patch set cover letter with
  justification
---
 .../ABI/testing/sysfs-kernel-hardlockup_count |  7 +++
 .../ABI/testing/sysfs-kernel-softlockup_count |  7 +++
 kernel/watchdog.c                             | 53 +++++++++++++++++++
 3 files changed, 67 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-hardlockup_count
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-softlockup_count

diff --git a/Documentation/ABI/testing/sysfs-kernel-hardlockup_count b/Documentation/ABI/testing/sysfs-kernel-hardlockup_count
new file mode 100644
index 000000000000..dfdd4078b077
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-hardlockup_count
@@ -0,0 +1,7 @@
+What:		/sys/kernel/hardlockup_count
+Date:		May 2025
+KernelVersion:	6.16
+Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description:
+		Shows how many times the system has detected a hard lockup since last boot.
+		Available only if CONFIG_HARDLOCKUP_DETECTOR is enabled.
diff --git a/Documentation/ABI/testing/sysfs-kernel-softlockup_count b/Documentation/ABI/testing/sysfs-kernel-softlockup_count
new file mode 100644
index 000000000000..337ff5531b5f
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-softlockup_count
@@ -0,0 +1,7 @@
+What:		/sys/kernel/softlockup_count
+Date:		May 2025
+KernelVersion:	6.16
+Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description:
+		Shows how many times the system has detected a soft lockup since last boot.
+		Available only if CONFIG_SOFTLOCKUP_DETECTOR is enabled.
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 9fa2af9dbf2c..09994bfb47af 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -63,6 +63,29 @@ int __read_mostly sysctl_hardlockup_all_cpu_backtrace;
  */
 unsigned int __read_mostly hardlockup_panic =
 			IS_ENABLED(CONFIG_BOOTPARAM_HARDLOCKUP_PANIC);
+
+#ifdef CONFIG_SYSFS
+
+static unsigned int hardlockup_count;
+
+static ssize_t hardlockup_count_show(struct kobject *kobj, struct kobj_attribute *attr,
+				     char *page)
+{
+	return sysfs_emit(page, "%u\n", hardlockup_count);
+}
+
+static struct kobj_attribute hardlockup_count_attr = __ATTR_RO(hardlockup_count);
+
+static __init int kernel_hardlockup_sysfs_init(void)
+{
+	sysfs_add_file_to_group(kernel_kobj, &hardlockup_count_attr.attr, NULL);
+	return 0;
+}
+
+late_initcall(kernel_hardlockup_sysfs_init);
+
+#endif // CONFIG_SYSFS
+
 /*
  * We may not want to enable hard lockup detection by default in all cases,
  * for example when running the kernel as a guest on a hypervisor. In these
@@ -169,6 +192,10 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
 		unsigned int this_cpu = smp_processor_id();
 		unsigned long flags;
 
+#ifdef CONFIG_SYSFS
+		++hardlockup_count;
+#endif
+
 		/* Only print hardlockups once. */
 		if (per_cpu(watchdog_hardlockup_warned, cpu))
 			return;
@@ -311,6 +338,28 @@ unsigned int __read_mostly softlockup_panic =
 static bool softlockup_initialized __read_mostly;
 static u64 __read_mostly sample_period;
 
+#ifdef CONFIG_SYSFS
+
+static unsigned int softlockup_count;
+
+static ssize_t softlockup_count_show(struct kobject *kobj, struct kobj_attribute *attr,
+				     char *page)
+{
+	return sysfs_emit(page, "%u\n", softlockup_count);
+}
+
+static struct kobj_attribute softlockup_count_attr = __ATTR_RO(softlockup_count);
+
+static __init int kernel_softlockup_sysfs_init(void)
+{
+	sysfs_add_file_to_group(kernel_kobj, &softlockup_count_attr.attr, NULL);
+	return 0;
+}
+
+late_initcall(kernel_softlockup_sysfs_init);
+
+#endif // CONFIG_SYSFS
+
 /* Timestamp taken after the last successful reschedule. */
 static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts);
 /* Timestamp of the last softlockup report. */
@@ -742,6 +791,10 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 	touch_ts = __this_cpu_read(watchdog_touch_ts);
 	duration = is_softlockup(touch_ts, period_ts, now);
 	if (unlikely(duration)) {
+#ifdef CONFIG_SYSFS
+		++softlockup_count;
+#endif
+
 		/*
 		 * Prevent multiple soft-lockup reports if one cpu is already
 		 * engaged in dumping all cpu back traces.
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 2/2] kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
  2025-05-04 18:08 [PATCH v2 0/2] sysfs: add counters for lockups and stalls Max Kellermann
  2025-05-04 18:08 ` [PATCH v2 1/2] kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count Max Kellermann
@ 2025-05-04 18:08 ` Max Kellermann
  2025-06-03 16:39   ` Sourabh Jain
  1 sibling, 1 reply; 6+ messages in thread
From: Max Kellermann @ 2025-05-04 18:08 UTC (permalink / raw)
  To: akpm, song, joel.granados, dianders, cminyard, linux-kernel
  Cc: Max Kellermann

Exposing a simple counter to userspace for monitoring tools.

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
---
v1 -> v2: added documentation
---
 .../ABI/testing/sysfs-kernel-rcu_stall_count  |  6 +++++
 kernel/rcu/tree_stall.h                       | 26 +++++++++++++++++++
 2 files changed, 32 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-rcu_stall_count

diff --git a/Documentation/ABI/testing/sysfs-kernel-rcu_stall_count b/Documentation/ABI/testing/sysfs-kernel-rcu_stall_count
new file mode 100644
index 000000000000..a4a97a7f4a4d
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-rcu_stall_count
@@ -0,0 +1,6 @@
+What:		/sys/kernel/rcu_stall_count
+Date:		May 2025
+KernelVersion:	6.16
+Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description:
+		Shows how many times the system has detected an RCU stall since last boot.
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index 925fcdad5dea..158330524795 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -20,6 +20,28 @@
 int sysctl_panic_on_rcu_stall __read_mostly;
 int sysctl_max_rcu_stall_to_panic __read_mostly;
 
+#ifdef CONFIG_SYSFS
+
+static unsigned int rcu_stall_count;
+
+static ssize_t rcu_stall_count_show(struct kobject *kobj, struct kobj_attribute *attr,
+				    char *page)
+{
+	return sysfs_emit(page, "%u\n", rcu_stall_count);
+}
+
+static struct kobj_attribute rcu_stall_count_attr = __ATTR_RO(rcu_stall_count);
+
+static __init int kernel_rcu_stall_sysfs_init(void)
+{
+	sysfs_add_file_to_group(kernel_kobj, &rcu_stall_count_attr.attr, NULL);
+	return 0;
+}
+
+late_initcall(kernel_rcu_stall_sysfs_init);
+
+#endif // CONFIG_SYSFS
+
 #ifdef CONFIG_PROVE_RCU
 #define RCU_STALL_DELAY_DELTA		(5 * HZ)
 #else
@@ -784,6 +806,10 @@ static void check_cpu_stall(struct rcu_data *rdp)
 		if (kvm_check_and_clear_guest_paused())
 			return;
 
+#ifdef CONFIG_SYSFS
+		++rcu_stall_count;
+#endif
+
 		rcu_stall_notifier_call_chain(RCU_STALL_NOTIFY_NORM, (void *)j - gps);
 		if (READ_ONCE(csd_lock_suppress_rcu_stall) && csd_lock_is_stuck()) {
 			pr_err("INFO: %s detected stall, but suppressed full report due to a stuck CSD-lock.\n", rcu_state.name);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 2/2] kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
  2025-05-04 18:08 ` [PATCH v2 2/2] kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count Max Kellermann
@ 2025-06-03 16:39   ` Sourabh Jain
  2025-06-04  0:16     ` Andrew Morton
  0 siblings, 1 reply; 6+ messages in thread
From: Sourabh Jain @ 2025-06-03 16:39 UTC (permalink / raw)
  To: akpm, Max Kellermann, song, joel.granados, dianders, cminyard,
	linux-kernel

Hello Andrew,

On 04/05/25 23:38, Max Kellermann wrote:
> Exposing a simple counter to userspace for monitoring tools.
>
> Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
> ---
> v1 -> v2: added documentation
> ---
>   .../ABI/testing/sysfs-kernel-rcu_stall_count  |  6 +++++
>   kernel/rcu/tree_stall.h                       | 26 +++++++++++++++++++
>   2 files changed, 32 insertions(+)
>   create mode 100644 Documentation/ABI/testing/sysfs-kernel-rcu_stall_count
>
> diff --git a/Documentation/ABI/testing/sysfs-kernel-rcu_stall_count b/Documentation/ABI/testing/sysfs-kernel-rcu_stall_count
> new file mode 100644
> index 000000000000..a4a97a7f4a4d
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-kernel-rcu_stall_count
> @@ -0,0 +1,6 @@
> +What:		/sys/kernel/rcu_stall_count
> +Date:		May 2025
> +KernelVersion:	6.16
> +Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
> +Description:
> +		Shows how many times the system has detected an RCU stall since last boot.
> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
> index 925fcdad5dea..158330524795 100644
> --- a/kernel/rcu/tree_stall.h
> +++ b/kernel/rcu/tree_stall.h
> @@ -20,6 +20,28 @@
>   int sysctl_panic_on_rcu_stall __read_mostly;
>   int sysctl_max_rcu_stall_to_panic __read_mostly;
>   
> +#ifdef CONFIG_SYSFS
> +
> +static unsigned int rcu_stall_count;
> +
> +static ssize_t rcu_stall_count_show(struct kobject *kobj, struct kobj_attribute *attr,
> +				    char *page)
> +{
> +	return sysfs_emit(page, "%u\n", rcu_stall_count);
> +}
> +
> +static struct kobj_attribute rcu_stall_count_attr = __ATTR_RO(rcu_stall_count);
> +
> +static __init int kernel_rcu_stall_sysfs_init(void)
> +{
> +	sysfs_add_file_to_group(kernel_kobj, &rcu_stall_count_attr.attr, NULL);
> +	return 0;
> +}
> +
> +late_initcall(kernel_rcu_stall_sysfs_init);
> +
> +#endif // CONFIG_SYSFS
> +
>   #ifdef CONFIG_PROVE_RCU
>   #define RCU_STALL_DELAY_DELTA		(5 * HZ)
>   #else
> @@ -784,6 +806,10 @@ static void check_cpu_stall(struct rcu_data *rdp)
>   		if (kvm_check_and_clear_guest_paused())
>   			return;
>   
> +#ifdef CONFIG_SYSFS
> +		++rcu_stall_count;
> +#endif
> +
>   		rcu_stall_notifier_call_chain(RCU_STALL_NOTIFY_NORM, (void *)j - gps);
>   		if (READ_ONCE(csd_lock_suppress_rcu_stall) && csd_lock_is_stuck()) {
>   			pr_err("INFO: %s detected stall, but suppressed full report due to a stuck CSD-lock.\n", rcu_state.name);

It seems like this patch was not applied properly to the upstream tree.

Out of the three hunks in this patch, only the first one is applied; the 
second
and third hunks are missing.

commit 2536c5c7d6ae5e1d844aa21f28b326b5e7f815ef
Author: Max Kellermann <max.kellermann@ionos.com>
Date:   Sun May 4 20:08:31 2025 +0200

     kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count

     Expose a simple counter to userspace for monitoring tools.


Thanks,
Sourabh Jain

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 2/2] kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
  2025-06-03 16:39   ` Sourabh Jain
@ 2025-06-04  0:16     ` Andrew Morton
  2025-06-04 13:55       ` Sourabh Jain
  0 siblings, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2025-06-04  0:16 UTC (permalink / raw)
  To: Sourabh Jain
  Cc: Max Kellermann, song, joel.granados, dianders, cminyard,
	linux-kernel

On Tue, 3 Jun 2025 22:09:30 +0530 Sourabh Jain <sourabhjain@linux.ibm.com> wrote:

> Hello Andrew,
> 
> > +#endif
> > +
> >   		rcu_stall_notifier_call_chain(RCU_STALL_NOTIFY_NORM, (void *)j - gps);
> >   		if (READ_ONCE(csd_lock_suppress_rcu_stall) && csd_lock_is_stuck()) {
> >   			pr_err("INFO: %s detected stall, but suppressed full report due to a stuck CSD-lock.\n", rcu_state.name);
> 
> It seems like this patch was not applied properly to the upstream tree.
> 
> Out of the three hunks in this patch, only the first one is applied; the 
> second
> and third hunks are missing.
> 
> commit 2536c5c7d6ae5e1d844aa21f28b326b5e7f815ef
> Author: Max Kellermann <max.kellermann@ionos.com>
> Date:   Sun May 4 20:08:31 2025 +0200
> 
>      kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
> 
>      Expose a simple counter to userspace for monitoring tools.

OK.  iirc there was quite a lot of churn and conflicts here :)

Please send a fixup against latest -linus?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 2/2] kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
  2025-06-04  0:16     ` Andrew Morton
@ 2025-06-04 13:55       ` Sourabh Jain
  0 siblings, 0 replies; 6+ messages in thread
From: Sourabh Jain @ 2025-06-04 13:55 UTC (permalink / raw)
  To: Andrew Morton, Max Kellermann
  Cc: song, joel.granados, dianders, cminyard, linux-kernel



On 04/06/25 05:46, Andrew Morton wrote:
> On Tue, 3 Jun 2025 22:09:30 +0530 Sourabh Jain <sourabhjain@linux.ibm.com> wrote:
>
>> Hello Andrew,
>>
>>> +#endif
>>> +
>>>    		rcu_stall_notifier_call_chain(RCU_STALL_NOTIFY_NORM, (void *)j - gps);
>>>    		if (READ_ONCE(csd_lock_suppress_rcu_stall) && csd_lock_is_stuck()) {
>>>    			pr_err("INFO: %s detected stall, but suppressed full report due to a stuck CSD-lock.\n", rcu_state.name);
>> It seems like this patch was not applied properly to the upstream tree.
>>
>> Out of the three hunks in this patch, only the first one is applied; the
>> second
>> and third hunks are missing.
>>
>> commit 2536c5c7d6ae5e1d844aa21f28b326b5e7f815ef
>> Author: Max Kellermann <max.kellermann@ionos.com>
>> Date:   Sun May 4 20:08:31 2025 +0200
>>
>>       kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
>>
>>       Expose a simple counter to userspace for monitoring tools.
> OK.  iirc there was quite a lot of churn and conflicts here :)
>
> Please send a fixup against latest -linus?

Sure, I will wait for a day or two to see if Max is interested in 
sending the fix-up patch. Otherwise, I will send it.

Thanks,
Sourabh Jain

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-06-04 13:55 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-04 18:08 [PATCH v2 0/2] sysfs: add counters for lockups and stalls Max Kellermann
2025-05-04 18:08 ` [PATCH v2 1/2] kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count Max Kellermann
2025-05-04 18:08 ` [PATCH v2 2/2] kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count Max Kellermann
2025-06-03 16:39   ` Sourabh Jain
2025-06-04  0:16     ` Andrew Morton
2025-06-04 13:55       ` Sourabh Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).