* [PATCH v2 0/2] sysfs: add counters for lockups and stalls @ 2025-05-04 18:08 Max Kellermann 2025-05-04 18:08 ` [PATCH v2 1/2] kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count Max Kellermann 2025-05-04 18:08 ` [PATCH v2 2/2] kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count Max Kellermann 0 siblings, 2 replies; 6+ messages in thread From: Max Kellermann @ 2025-05-04 18:08 UTC (permalink / raw) To: akpm, song, joel.granados, dianders, cminyard, linux-kernel Cc: Max Kellermann Commits 9db89b411170 ("exit: Expose "oops_count" to sysfs") and 8b05aa263361 ("panic: Expose "warn_count" to sysfs") added counters for oopses and warnings to sysfs, and these two patches do the same for hard/soft lockups and RCU stalls. All of these counters are useful for monitoring tools to detect whether the machine is healthy. If the kernel has experienced a lockup or a stall, it's probably due to a kernel bug, and I'd like to detect that quickly and easily. There is currently no way to detect that, other than parsing dmesg. Or observing indirect effects: such as certain tasks not responding, but then I need to observe all tasks, and it may take a while until these effects become visible/measurable. I'd rather be able to detect the primary cause more quickly, possibly before everything falls apart. Max Kellermann (2): kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count .../ABI/testing/sysfs-kernel-hardlockup_count | 7 +++ .../ABI/testing/sysfs-kernel-rcu_stall_count | 6 +++ .../ABI/testing/sysfs-kernel-softlockup_count | 7 +++ kernel/rcu/tree_stall.h | 26 +++++++++ kernel/watchdog.c | 53 +++++++++++++++++++ 5 files changed, 99 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-hardlockup_count create mode 100644 Documentation/ABI/testing/sysfs-kernel-rcu_stall_count create mode 100644 Documentation/ABI/testing/sysfs-kernel-softlockup_count -- 2.47.2 ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v2 1/2] kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count 2025-05-04 18:08 [PATCH v2 0/2] sysfs: add counters for lockups and stalls Max Kellermann @ 2025-05-04 18:08 ` Max Kellermann 2025-05-04 18:08 ` [PATCH v2 2/2] kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count Max Kellermann 1 sibling, 0 replies; 6+ messages in thread From: Max Kellermann @ 2025-05-04 18:08 UTC (permalink / raw) To: akpm, song, joel.granados, dianders, cminyard, linux-kernel Cc: Max Kellermann There is /proc/sys/kernel/hung_task_detect_count, /sys/kernel/warn_count and /sys/kernel/oops_count but there is no userspace-accessible counter for hard/soft lockups. Having this is useful for monitoring tools. Signed-off-by: Max Kellermann <max.kellermann@ionos.com> --- v1 -> v2: added documentation; added patch set cover letter with justification --- .../ABI/testing/sysfs-kernel-hardlockup_count | 7 +++ .../ABI/testing/sysfs-kernel-softlockup_count | 7 +++ kernel/watchdog.c | 53 +++++++++++++++++++ 3 files changed, 67 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-hardlockup_count create mode 100644 Documentation/ABI/testing/sysfs-kernel-softlockup_count diff --git a/Documentation/ABI/testing/sysfs-kernel-hardlockup_count b/Documentation/ABI/testing/sysfs-kernel-hardlockup_count new file mode 100644 index 000000000000..dfdd4078b077 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-hardlockup_count @@ -0,0 +1,7 @@ +What: /sys/kernel/hardlockup_count +Date: May 2025 +KernelVersion: 6.16 +Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org> +Description: + Shows how many times the system has detected a hard lockup since last boot. + Available only if CONFIG_HARDLOCKUP_DETECTOR is enabled. diff --git a/Documentation/ABI/testing/sysfs-kernel-softlockup_count b/Documentation/ABI/testing/sysfs-kernel-softlockup_count new file mode 100644 index 000000000000..337ff5531b5f --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-softlockup_count @@ -0,0 +1,7 @@ +What: /sys/kernel/softlockup_count +Date: May 2025 +KernelVersion: 6.16 +Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org> +Description: + Shows how many times the system has detected a soft lockup since last boot. + Available only if CONFIG_SOFTLOCKUP_DETECTOR is enabled. diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 9fa2af9dbf2c..09994bfb47af 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -63,6 +63,29 @@ int __read_mostly sysctl_hardlockup_all_cpu_backtrace; */ unsigned int __read_mostly hardlockup_panic = IS_ENABLED(CONFIG_BOOTPARAM_HARDLOCKUP_PANIC); + +#ifdef CONFIG_SYSFS + +static unsigned int hardlockup_count; + +static ssize_t hardlockup_count_show(struct kobject *kobj, struct kobj_attribute *attr, + char *page) +{ + return sysfs_emit(page, "%u\n", hardlockup_count); +} + +static struct kobj_attribute hardlockup_count_attr = __ATTR_RO(hardlockup_count); + +static __init int kernel_hardlockup_sysfs_init(void) +{ + sysfs_add_file_to_group(kernel_kobj, &hardlockup_count_attr.attr, NULL); + return 0; +} + +late_initcall(kernel_hardlockup_sysfs_init); + +#endif // CONFIG_SYSFS + /* * We may not want to enable hard lockup detection by default in all cases, * for example when running the kernel as a guest on a hypervisor. In these @@ -169,6 +192,10 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs) unsigned int this_cpu = smp_processor_id(); unsigned long flags; +#ifdef CONFIG_SYSFS + ++hardlockup_count; +#endif + /* Only print hardlockups once. */ if (per_cpu(watchdog_hardlockup_warned, cpu)) return; @@ -311,6 +338,28 @@ unsigned int __read_mostly softlockup_panic = static bool softlockup_initialized __read_mostly; static u64 __read_mostly sample_period; +#ifdef CONFIG_SYSFS + +static unsigned int softlockup_count; + +static ssize_t softlockup_count_show(struct kobject *kobj, struct kobj_attribute *attr, + char *page) +{ + return sysfs_emit(page, "%u\n", softlockup_count); +} + +static struct kobj_attribute softlockup_count_attr = __ATTR_RO(softlockup_count); + +static __init int kernel_softlockup_sysfs_init(void) +{ + sysfs_add_file_to_group(kernel_kobj, &softlockup_count_attr.attr, NULL); + return 0; +} + +late_initcall(kernel_softlockup_sysfs_init); + +#endif // CONFIG_SYSFS + /* Timestamp taken after the last successful reschedule. */ static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts); /* Timestamp of the last softlockup report. */ @@ -742,6 +791,10 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) touch_ts = __this_cpu_read(watchdog_touch_ts); duration = is_softlockup(touch_ts, period_ts, now); if (unlikely(duration)) { +#ifdef CONFIG_SYSFS + ++softlockup_count; +#endif + /* * Prevent multiple soft-lockup reports if one cpu is already * engaged in dumping all cpu back traces. -- 2.47.2 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v2 2/2] kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count 2025-05-04 18:08 [PATCH v2 0/2] sysfs: add counters for lockups and stalls Max Kellermann 2025-05-04 18:08 ` [PATCH v2 1/2] kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count Max Kellermann @ 2025-05-04 18:08 ` Max Kellermann 2025-06-03 16:39 ` Sourabh Jain 1 sibling, 1 reply; 6+ messages in thread From: Max Kellermann @ 2025-05-04 18:08 UTC (permalink / raw) To: akpm, song, joel.granados, dianders, cminyard, linux-kernel Cc: Max Kellermann Exposing a simple counter to userspace for monitoring tools. Signed-off-by: Max Kellermann <max.kellermann@ionos.com> --- v1 -> v2: added documentation --- .../ABI/testing/sysfs-kernel-rcu_stall_count | 6 +++++ kernel/rcu/tree_stall.h | 26 +++++++++++++++++++ 2 files changed, 32 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-rcu_stall_count diff --git a/Documentation/ABI/testing/sysfs-kernel-rcu_stall_count b/Documentation/ABI/testing/sysfs-kernel-rcu_stall_count new file mode 100644 index 000000000000..a4a97a7f4a4d --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-rcu_stall_count @@ -0,0 +1,6 @@ +What: /sys/kernel/rcu_stall_count +Date: May 2025 +KernelVersion: 6.16 +Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org> +Description: + Shows how many times the system has detected an RCU stall since last boot. diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index 925fcdad5dea..158330524795 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -20,6 +20,28 @@ int sysctl_panic_on_rcu_stall __read_mostly; int sysctl_max_rcu_stall_to_panic __read_mostly; +#ifdef CONFIG_SYSFS + +static unsigned int rcu_stall_count; + +static ssize_t rcu_stall_count_show(struct kobject *kobj, struct kobj_attribute *attr, + char *page) +{ + return sysfs_emit(page, "%u\n", rcu_stall_count); +} + +static struct kobj_attribute rcu_stall_count_attr = __ATTR_RO(rcu_stall_count); + +static __init int kernel_rcu_stall_sysfs_init(void) +{ + sysfs_add_file_to_group(kernel_kobj, &rcu_stall_count_attr.attr, NULL); + return 0; +} + +late_initcall(kernel_rcu_stall_sysfs_init); + +#endif // CONFIG_SYSFS + #ifdef CONFIG_PROVE_RCU #define RCU_STALL_DELAY_DELTA (5 * HZ) #else @@ -784,6 +806,10 @@ static void check_cpu_stall(struct rcu_data *rdp) if (kvm_check_and_clear_guest_paused()) return; +#ifdef CONFIG_SYSFS + ++rcu_stall_count; +#endif + rcu_stall_notifier_call_chain(RCU_STALL_NOTIFY_NORM, (void *)j - gps); if (READ_ONCE(csd_lock_suppress_rcu_stall) && csd_lock_is_stuck()) { pr_err("INFO: %s detected stall, but suppressed full report due to a stuck CSD-lock.\n", rcu_state.name); -- 2.47.2 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v2 2/2] kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count 2025-05-04 18:08 ` [PATCH v2 2/2] kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count Max Kellermann @ 2025-06-03 16:39 ` Sourabh Jain 2025-06-04 0:16 ` Andrew Morton 0 siblings, 1 reply; 6+ messages in thread From: Sourabh Jain @ 2025-06-03 16:39 UTC (permalink / raw) To: akpm, Max Kellermann, song, joel.granados, dianders, cminyard, linux-kernel Hello Andrew, On 04/05/25 23:38, Max Kellermann wrote: > Exposing a simple counter to userspace for monitoring tools. > > Signed-off-by: Max Kellermann <max.kellermann@ionos.com> > --- > v1 -> v2: added documentation > --- > .../ABI/testing/sysfs-kernel-rcu_stall_count | 6 +++++ > kernel/rcu/tree_stall.h | 26 +++++++++++++++++++ > 2 files changed, 32 insertions(+) > create mode 100644 Documentation/ABI/testing/sysfs-kernel-rcu_stall_count > > diff --git a/Documentation/ABI/testing/sysfs-kernel-rcu_stall_count b/Documentation/ABI/testing/sysfs-kernel-rcu_stall_count > new file mode 100644 > index 000000000000..a4a97a7f4a4d > --- /dev/null > +++ b/Documentation/ABI/testing/sysfs-kernel-rcu_stall_count > @@ -0,0 +1,6 @@ > +What: /sys/kernel/rcu_stall_count > +Date: May 2025 > +KernelVersion: 6.16 > +Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org> > +Description: > + Shows how many times the system has detected an RCU stall since last boot. > diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h > index 925fcdad5dea..158330524795 100644 > --- a/kernel/rcu/tree_stall.h > +++ b/kernel/rcu/tree_stall.h > @@ -20,6 +20,28 @@ > int sysctl_panic_on_rcu_stall __read_mostly; > int sysctl_max_rcu_stall_to_panic __read_mostly; > > +#ifdef CONFIG_SYSFS > + > +static unsigned int rcu_stall_count; > + > +static ssize_t rcu_stall_count_show(struct kobject *kobj, struct kobj_attribute *attr, > + char *page) > +{ > + return sysfs_emit(page, "%u\n", rcu_stall_count); > +} > + > +static struct kobj_attribute rcu_stall_count_attr = __ATTR_RO(rcu_stall_count); > + > +static __init int kernel_rcu_stall_sysfs_init(void) > +{ > + sysfs_add_file_to_group(kernel_kobj, &rcu_stall_count_attr.attr, NULL); > + return 0; > +} > + > +late_initcall(kernel_rcu_stall_sysfs_init); > + > +#endif // CONFIG_SYSFS > + > #ifdef CONFIG_PROVE_RCU > #define RCU_STALL_DELAY_DELTA (5 * HZ) > #else > @@ -784,6 +806,10 @@ static void check_cpu_stall(struct rcu_data *rdp) > if (kvm_check_and_clear_guest_paused()) > return; > > +#ifdef CONFIG_SYSFS > + ++rcu_stall_count; > +#endif > + > rcu_stall_notifier_call_chain(RCU_STALL_NOTIFY_NORM, (void *)j - gps); > if (READ_ONCE(csd_lock_suppress_rcu_stall) && csd_lock_is_stuck()) { > pr_err("INFO: %s detected stall, but suppressed full report due to a stuck CSD-lock.\n", rcu_state.name); It seems like this patch was not applied properly to the upstream tree. Out of the three hunks in this patch, only the first one is applied; the second and third hunks are missing. commit 2536c5c7d6ae5e1d844aa21f28b326b5e7f815ef Author: Max Kellermann <max.kellermann@ionos.com> Date: Sun May 4 20:08:31 2025 +0200 kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count Expose a simple counter to userspace for monitoring tools. Thanks, Sourabh Jain ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2 2/2] kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count 2025-06-03 16:39 ` Sourabh Jain @ 2025-06-04 0:16 ` Andrew Morton 2025-06-04 13:55 ` Sourabh Jain 0 siblings, 1 reply; 6+ messages in thread From: Andrew Morton @ 2025-06-04 0:16 UTC (permalink / raw) To: Sourabh Jain Cc: Max Kellermann, song, joel.granados, dianders, cminyard, linux-kernel On Tue, 3 Jun 2025 22:09:30 +0530 Sourabh Jain <sourabhjain@linux.ibm.com> wrote: > Hello Andrew, > > > +#endif > > + > > rcu_stall_notifier_call_chain(RCU_STALL_NOTIFY_NORM, (void *)j - gps); > > if (READ_ONCE(csd_lock_suppress_rcu_stall) && csd_lock_is_stuck()) { > > pr_err("INFO: %s detected stall, but suppressed full report due to a stuck CSD-lock.\n", rcu_state.name); > > It seems like this patch was not applied properly to the upstream tree. > > Out of the three hunks in this patch, only the first one is applied; the > second > and third hunks are missing. > > commit 2536c5c7d6ae5e1d844aa21f28b326b5e7f815ef > Author: Max Kellermann <max.kellermann@ionos.com> > Date: Sun May 4 20:08:31 2025 +0200 > > kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count > > Expose a simple counter to userspace for monitoring tools. OK. iirc there was quite a lot of churn and conflicts here :) Please send a fixup against latest -linus? ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2 2/2] kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count 2025-06-04 0:16 ` Andrew Morton @ 2025-06-04 13:55 ` Sourabh Jain 0 siblings, 0 replies; 6+ messages in thread From: Sourabh Jain @ 2025-06-04 13:55 UTC (permalink / raw) To: Andrew Morton, Max Kellermann Cc: song, joel.granados, dianders, cminyard, linux-kernel On 04/06/25 05:46, Andrew Morton wrote: > On Tue, 3 Jun 2025 22:09:30 +0530 Sourabh Jain <sourabhjain@linux.ibm.com> wrote: > >> Hello Andrew, >> >>> +#endif >>> + >>> rcu_stall_notifier_call_chain(RCU_STALL_NOTIFY_NORM, (void *)j - gps); >>> if (READ_ONCE(csd_lock_suppress_rcu_stall) && csd_lock_is_stuck()) { >>> pr_err("INFO: %s detected stall, but suppressed full report due to a stuck CSD-lock.\n", rcu_state.name); >> It seems like this patch was not applied properly to the upstream tree. >> >> Out of the three hunks in this patch, only the first one is applied; the >> second >> and third hunks are missing. >> >> commit 2536c5c7d6ae5e1d844aa21f28b326b5e7f815ef >> Author: Max Kellermann <max.kellermann@ionos.com> >> Date: Sun May 4 20:08:31 2025 +0200 >> >> kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count >> >> Expose a simple counter to userspace for monitoring tools. > OK. iirc there was quite a lot of churn and conflicts here :) > > Please send a fixup against latest -linus? Sure, I will wait for a day or two to see if Max is interested in sending the fix-up patch. Otherwise, I will send it. Thanks, Sourabh Jain ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-06-04 13:55 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-05-04 18:08 [PATCH v2 0/2] sysfs: add counters for lockups and stalls Max Kellermann 2025-05-04 18:08 ` [PATCH v2 1/2] kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count Max Kellermann 2025-05-04 18:08 ` [PATCH v2 2/2] kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count Max Kellermann 2025-06-03 16:39 ` Sourabh Jain 2025-06-04 0:16 ` Andrew Morton 2025-06-04 13:55 ` Sourabh Jain
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).