[PATCH 0/3] *** Detect interrupt storm in softlockup ***

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] *** Detect interrupt storm in softlockup ***
@ 2024-01-23 12:12 Bitao Hu
  2024-01-23 12:12 ` [PATCH 1/3] watchdog/softlockup: low-overhead detection of interrupt storm Bitao Hu
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Bitao Hu @ 2024-01-23 12:12 UTC (permalink / raw)
  To: dianders, akpm, pmladek, tglx, maz, liusong; +Cc: linux-kernel, Bitao Hu

Hi guys,
    I have previously encountered an issue where an NVMe interrupt
storm caused a softlockup, but the call tree did not provide useful
information. This is because the call tree is merely a snapshot and
does not fully reflect the CPU's state over the duration of the
softlockup_thresh period. Consequently, I think that reporting CPU
utilization (system, softirq, hardirq, idle) during a softlockup would
be beneficial for identifying issues related to interrupt storms, as
well as assisting in the analysis of other causes of softlockup.
    Furthermore, reporting the most time-consuming hardirqs during a
softlockup could directly pinpoint which interrupt is responsible
for the issue.

Bitao Hu (3):
  watchdog/softlockup: low-overhead detection of interrupt storm
  watchdog/softlockup: report the most time-consuming hardirq
  watchdog/softlockup: add parameter to control the reporting of
    time-consuming hardirq

 include/linux/irq.h     |   9 ++
 include/linux/irqdesc.h |   2 +
 kernel/irq/irqdesc.c    |   9 +-
 kernel/watchdog.c       | 289 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 308 insertions(+), 1 deletion(-)

-- 
2.37.1 (Apple Git-137.1)


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/3] watchdog/softlockup: low-overhead detection of interrupt storm
  2024-01-23 12:12 [PATCH 0/3] *** Detect interrupt storm in softlockup *** Bitao Hu
@ 2024-01-23 12:12 ` Bitao Hu
  2024-01-24  1:43   ` Liu Song
                     ` (2 more replies)
  2024-01-23 12:12 ` [PATCH 2/3] watchdog/softlockup: report the most time-consuming hardirq Bitao Hu
  2024-01-23 12:12 ` [PATCH 3/3] watchdog/softlockup: add parameter to control the reporting of " Bitao Hu
  2 siblings, 3 replies; 18+ messages in thread
From: Bitao Hu @ 2024-01-23 12:12 UTC (permalink / raw)
  To: dianders, akpm, pmladek, tglx, maz, liusong; +Cc: linux-kernel, Bitao Hu

The following softlockup is caused by interrupt storm, but it cannot be
identified from the call tree. Because the call tree is just a snapshot
and doesn't fully capture the behavior of the CPU during the soft lockup.
  watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
  ...
  Call trace:
    __do_softirq+0xa0/0x37c
    __irq_exit_rcu+0x108/0x140
    irq_exit+0x14/0x20
    __handle_domain_irq+0x84/0xe0
    gic_handle_irq+0x80/0x108
    el0_irq_naked+0x50/0x58

Therefore，I think it is necessary to report CPU utilization during the
softlockup_thresh period (report once every sample_period, for a total
of 5 reportings), like this:
  watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
  CPU#28 Utilization every 4s during lockup:
    #1: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
    #2: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
    #3: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
    #4: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
    #5: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
  ...

This would be helpful in determining whether an interrupt storm has
occurred or in identifying the cause of the softlockup. The criteria for
determination are as follows:
  a. If the hardirq utilization is high, then interrupt storm should be
  considered and the root cause cannot be determined from the call tree.
  b. If the softirq utilization is high, then we could analyze the call
  tree but it may cannot reflect the root cause.
  c. If the system utilization is high, then we could analyze the root
  cause from the call tree.

Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com>
---
 kernel/watchdog.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 81a8862295d6..9fad10e0a147 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -23,6 +23,8 @@
 #include <linux/sched/debug.h>
 #include <linux/sched/isolation.h>
 #include <linux/stop_machine.h>
+#include <linux/kernel_stat.h>
+#include <linux/math64.h>
 
 #include <asm/irq_regs.h>
 #include <linux/kvm_para.h>
@@ -441,6 +443,58 @@ static int is_softlockup(unsigned long touch_ts,
 	return 0;
 }
 
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static DEFINE_PER_CPU(u64, cpustat_old[NR_STATS]);
+static DEFINE_PER_CPU(u64, cpustat_diff[5][NR_STATS]);
+static DEFINE_PER_CPU(int, cpustat_tail);
+
+static void update_cpustat(void)
+{
+	u64 *old = this_cpu_ptr(cpustat_old);
+	u64 (*diff)[NR_STATS] = this_cpu_ptr(cpustat_diff);
+	int tail = this_cpu_read(cpustat_tail), i;
+	struct kernel_cpustat kcpustat;
+	u64 *cpustat = kcpustat.cpustat;
+
+	kcpustat_cpu_fetch(&kcpustat, smp_processor_id());
+	for (i = 0; i < NR_STATS; i++) {
+		diff[tail][i] = cpustat[i] - old[i];
+		old[i] = cpustat[i];
+	}
+	this_cpu_write(cpustat_tail, (tail + 1) % 5);
+}
+
+static void print_cpustat(void)
+{
+	int i, j, k;
+	u64 a[5][NR_STATS], b[5][NR_STATS];
+	u64 (*diff)[NR_STATS] = this_cpu_ptr(cpustat_diff);
+	int tail = this_cpu_read(cpustat_tail);
+	u32 period_us = sample_period / 1000;
+
+	for (i = 0; i < 5; i++) {
+		for (j = 0; j < NR_STATS; j++) {
+			a[i][j] = 100 * (diff[i][j] / 1000);
+			b[i][j] = 10 * do_div(a[i][j], period_us);
+			do_div(b[i][j], period_us);
+		}
+	}
+	printk(KERN_CRIT "CPU#%d Utilization every %us during lockup:\n",
+		smp_processor_id(), period_us/1000000);
+	for (k = 0, i = tail; k < 5; k++, i = (i + 1) % 5) {
+		printk(KERN_CRIT "\t#%d: %llu.%llu%% system,\t%llu.%llu%% softirq,\t"
+			"%llu.%llu%% hardirq,\t%llu.%llu%% idle\n", k+1,
+			a[i][CPUTIME_SYSTEM], b[i][CPUTIME_SYSTEM],
+			a[i][CPUTIME_SOFTIRQ], b[i][CPUTIME_SOFTIRQ],
+			a[i][CPUTIME_IRQ], b[i][CPUTIME_IRQ],
+			a[i][CPUTIME_IDLE], b[i][CPUTIME_IDLE]);
+	}
+}
+#else
+static inline void update_cpustat(void) { }
+static inline void print_cpustat(void) { }
+#endif
+
 /* watchdog detector functions */
 static DEFINE_PER_CPU(struct completion, softlockup_completion);
 static DEFINE_PER_CPU(struct cpu_stop_work, softlockup_stop_work);
@@ -504,6 +558,9 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 	 */
 	period_ts = READ_ONCE(*this_cpu_ptr(&watchdog_report_ts));
 
+	/* update cpu usage stat */
+	update_cpustat();
+
 	/* Reset the interval when touched by known problematic code. */
 	if (period_ts == SOFTLOCKUP_DELAY_REPORT) {
 		if (unlikely(__this_cpu_read(softlockup_touch_sync))) {
@@ -539,6 +596,7 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 		pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
 			smp_processor_id(), duration,
 			current->comm, task_pid_nr(current));
+		print_cpustat();
 		print_modules();
 		print_irqtrace_events(current);
 		if (regs)
-- 
2.37.1 (Apple Git-137.1)


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/3] watchdog/softlockup: report the most time-consuming hardirq
  2024-01-23 12:12 [PATCH 0/3] *** Detect interrupt storm in softlockup *** Bitao Hu
  2024-01-23 12:12 ` [PATCH 1/3] watchdog/softlockup: low-overhead detection of interrupt storm Bitao Hu
@ 2024-01-23 12:12 ` Bitao Hu
  2024-01-25  0:19   ` Doug Anderson
  2024-02-14 23:36   ` Thomas Gleixner
  2024-01-23 12:12 ` [PATCH 3/3] watchdog/softlockup: add parameter to control the reporting of " Bitao Hu
  2 siblings, 2 replies; 18+ messages in thread
From: Bitao Hu @ 2024-01-23 12:12 UTC (permalink / raw)
  To: dianders, akpm, pmladek, tglx, maz, liusong; +Cc: linux-kernel, Bitao Hu

When the watchdog determines that the current soft lockup is due
to an interrupt storm based on CPU utilization, reporting the
top three most time-consuming hardirq can be quite useful for
further troubleshooting.
Below is an example of interrupt storm. The call tree does not
provide useful information, but we can analyze which interrupt
caused the soft lockup by using the time-consuming information
of hardirq.

[   67.714044] watchdog: BUG: soft lockup - CPU#9 stuck for 28s! [swapper/9:0]
[   67.714548] CPU#9 Utilization every 4s during lockup:
[   67.714549]  #1: 0.0% system,        0.0% softirq,   95.0% hardirq,  0.0% idle
[   67.714551]  #2: 0.0% system,        0.0% softirq,   90.0% hardirq,  0.0% idle
[   67.714553]  #3: 0.0% system,        0.0% softirq,   90.0% hardirq,  0.0% idle
[   67.714555]  #4: 0.0% system,        0.0% softirq,   95.0% hardirq,  0.0% idle
[   67.714556]  #5: 0.0% system,        0.0% softirq,   90.0% hardirq,  0.0% idle
[   67.714570] CPU#9 Detect HardIRQ Time exceeds 50% since 45s. Most time-consuming HardIRQs:
[   67.714571]  #1: 99.9% irq#7(IPI)
[   67.714575]  #2: 0.0% irq#10(arch_timer)
[   67.714578]  #3: 0.0% irq#2(IPI)
...
[   67.714654] Call trace:
[   67.714656]  __do_softirq+0xa8/0x364

Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com>
---
 include/linux/irq.h     |   9 ++
 include/linux/irqdesc.h |   2 +
 kernel/irq/irqdesc.c    |   9 +-
 kernel/watchdog.c       | 186 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 205 insertions(+), 1 deletion(-)

diff --git a/include/linux/irq.h b/include/linux/irq.h
index 90081afa10ce..3a3e8b1e227e 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -251,6 +251,15 @@ enum {
 	IRQD_RESEND_WHEN_IN_PROGRESS    = BIT(30),
 };
 
+
+/*
+ * struct per_irqtime - per irq stats for interrupt storm detection
+ */
+struct per_irqtime {
+	u64	start_time;
+	u64	delta;
+};
+
 #define __irqd_to_state(d) ACCESS_PRIVATE((d)->common, state_use_accessors)
 
 static inline bool irqd_is_setaffinity_pending(struct irq_data *d)
diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
index d9451d456a73..c5407cdcf2d6 100644
--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -29,6 +29,7 @@ struct pt_regs;
  * @wake_depth:		enable depth, for multiple irq_set_irq_wake() callers
  * @tot_count:		stats field for non-percpu irqs
  * @irq_count:		stats field to detect stalled irqs
+ * @irq_time:		stats field to detect interrupt storm
  * @last_unhandled:	aging timer for unhandled count
  * @irqs_unhandled:	stats field for spurious unhandled interrupts
  * @threads_handled:	stats field for deferred spurious detection of threaded handlers
@@ -64,6 +65,7 @@ struct irq_desc {
 	unsigned int		wake_depth;	/* nested wake enables */
 	unsigned int		tot_count;
 	unsigned int		irq_count;	/* For detecting broken IRQs */
+	struct per_irqtime __percpu *irq_time;
 	unsigned long		last_unhandled;	/* Aging timer for unhandled count */
 	unsigned int		irqs_unhandled;
 	atomic_t		threads_handled;
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 27ca1c866f29..a4d7c324cba2 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -422,9 +422,12 @@ static struct irq_desc *alloc_desc(int irq, int node, unsigned int flags,
 	desc->kstat_irqs = alloc_percpu(unsigned int);
 	if (!desc->kstat_irqs)
 		goto err_desc;
+	desc->irq_time = alloc_percpu(struct per_irqtime);
+	if (!desc->irq_time)
+		goto err_kstat;
 
 	if (alloc_masks(desc, node))
-		goto err_kstat;
+		goto err_irqtime;
 
 	raw_spin_lock_init(&desc->lock);
 	lockdep_set_class(&desc->lock, &irq_desc_lock_class);
@@ -439,6 +442,8 @@ static struct irq_desc *alloc_desc(int irq, int node, unsigned int flags,
 
 	return desc;
 
+err_irqtime:
+	free_percpu(desc->irq_time);
 err_kstat:
 	free_percpu(desc->kstat_irqs);
 err_desc:
@@ -451,6 +456,7 @@ static void irq_kobj_release(struct kobject *kobj)
 	struct irq_desc *desc = container_of(kobj, struct irq_desc, kobj);
 
 	free_masks(desc);
+	free_percpu(desc->irq_time);
 	free_percpu(desc->kstat_irqs);
 	kfree(desc);
 }
@@ -594,6 +600,7 @@ int __init early_irq_init(void)
 
 	for (i = 0; i < count; i++) {
 		desc[i].kstat_irqs = alloc_percpu(unsigned int);
+		desc[i].irq_time = alloc_percpu(struct per_irqtime);
 		alloc_masks(&desc[i], node);
 		raw_spin_lock_init(&desc[i].lock);
 		lockdep_set_class(&desc[i].lock, &irq_desc_lock_class);
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 9fad10e0a147..f347c5d8c5c1 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -25,6 +25,12 @@
 #include <linux/stop_machine.h>
 #include <linux/kernel_stat.h>
 #include <linux/math64.h>
+#include <linux/tracepoint.h>
+#include <linux/irq.h>
+#include <linux/bitops.h>
+#include <trace/events/irq.h>
+#include <linux/irqdesc.h>
+#include <linux/timekeeping.h>
 
 #include <asm/irq_regs.h>
 #include <linux/kvm_para.h>
@@ -431,11 +437,15 @@ void touch_softlockup_watchdog_sync(void)
 	__this_cpu_write(watchdog_report_ts, SOFTLOCKUP_DELAY_REPORT);
 }
 
+static void set_potential_softlockup(unsigned long now, unsigned long touch_ts);
+
 static int is_softlockup(unsigned long touch_ts,
 			 unsigned long period_ts,
 			 unsigned long now)
 {
 	if ((watchdog_enabled & WATCHDOG_SOFTOCKUP_ENABLED) && watchdog_thresh) {
+		/* Softlockup may occur in the current period */
+		set_potential_softlockup(now, period_ts);
 		/* Warn about unreasonable delays. */
 		if (time_after(now, period_ts + get_softlockup_thresh()))
 			return now - touch_ts;
@@ -448,6 +458,8 @@ static DEFINE_PER_CPU(u64, cpustat_old[NR_STATS]);
 static DEFINE_PER_CPU(u64, cpustat_diff[5][NR_STATS]);
 static DEFINE_PER_CPU(int, cpustat_tail);
 
+static void print_hardirq_time(void);
+
 static void update_cpustat(void)
 {
 	u64 *old = this_cpu_ptr(cpustat_old);
@@ -489,10 +501,179 @@ static void print_cpustat(void)
 			a[i][CPUTIME_IRQ], b[i][CPUTIME_IRQ],
 			a[i][CPUTIME_IDLE], b[i][CPUTIME_IDLE]);
 	}
+	print_hardirq_time();
+}
+
+static DEFINE_PER_CPU(u64, hardirq_start_time);
+
+#define SOFTLOCKUP_HARDIRQ	0
+static DEFINE_PER_CPU(long, softlockup_flags);
+
+static void find_irqtime_top3(u64 *irqtime, int *irq, u64 perirq_time, int perirq_id)
+{
+	if (perirq_time > irqtime[0] || (perirq_time == irqtime[0] && perirq_id < irq[0])) {
+		irqtime[2] = irqtime[1];
+		irq[2] = irq[1];
+		irqtime[1] = irqtime[0];
+		irq[1] = irq[0];
+		irqtime[0] = perirq_time;
+		irq[0] = perirq_id;
+	} else if (perirq_time > irqtime[1] || (perirq_time == irqtime[1] && perirq_id < irq[1])) {
+		irqtime[2] = irqtime[1];
+		irq[2] = irq[1];
+		irqtime[1] = perirq_time;
+		irq[1] = perirq_id;
+	} else if (perirq_time > irqtime[1] || (perirq_time == irqtime[2] && perirq_id < irq[2])) {
+		irqtime[2] = perirq_time;
+		irq[2] = perirq_id;
+	}
+}
+
+/*
+ * If the proportion of time spent handling irq exceeds 50% during a sampling period,
+ * then it is necessary to tally the handling time of each irq.
+ */
+static inline bool need_trace_irqtime(int type)
+{
+	int tail = this_cpu_read(cpustat_tail);
+
+	if (--tail == -1)
+		tail = 4;
+	return this_cpu_read(cpustat_diff[tail][type]) > sample_period/2;
+}
+
+static void irq_handler_entry_callback(void *data, int irq, struct irqaction *action)
+{
+	this_cpu_ptr(irq_to_desc(irq)->irq_time)->start_time = local_clock();
+}
+
+static void irq_handler_exit_callback(void *data, int irq, struct irqaction *action, int ret)
+{
+	u64 delta;
+	struct per_irqtime *irq_time;
+
+	if (test_bit(SOFTLOCKUP_HARDIRQ, this_cpu_ptr(&softlockup_flags))) {
+		irq_time = this_cpu_ptr(irq_to_desc(irq)->irq_time);
+		delta = local_clock() - irq_time->start_time;
+		irq_time->delta += delta;
+	}
+}
+/*
+ * Mark softlockup as potentially caused by hardirq
+ */
+static void set_potential_softlockup_hardirq(void)
+{
+	if (!need_trace_irqtime(CPUTIME_IRQ))
+		return;
+
+	if (test_and_set_bit(SOFTLOCKUP_HARDIRQ, this_cpu_ptr(&softlockup_flags)))
+		return;
+
+	__this_cpu_write(hardirq_start_time, local_clock());
+}
+
+static void clear_potential_softlockup_hardirq(void)
+{
+	unsigned int i;
+	unsigned long flags;
+	struct irq_desc *desc;
+
+	local_irq_save(flags);
+	clear_bit(SOFTLOCKUP_HARDIRQ, this_cpu_ptr(&softlockup_flags));
+	for_each_irq_desc(i, desc) {
+		if (!desc)
+			continue;
+		memset(this_cpu_ptr(desc->irq_time), 0, sizeof(struct per_irqtime));
+	}
+	local_irq_restore(flags);
+}
+
+static void hook_hardirq_events(void)
+{
+	int ret;
+
+	ret = register_trace_irq_handler_entry(irq_handler_entry_callback, NULL);
+	if (ret)
+		goto out_err;
+	ret = register_trace_irq_handler_exit(irq_handler_exit_callback, NULL);
+	if (ret)
+		goto out_unreg_entry;
+	return;
+
+out_unreg_entry:
+	unregister_trace_irq_handler_entry(irq_handler_entry_callback, NULL);
+out_err:
+	return;
+}
+
+static void unhook_hardirq_events(void)
+{
+	unregister_trace_irq_handler_entry(irq_handler_entry_callback, NULL);
+	unregister_trace_irq_handler_exit(irq_handler_exit_callback, NULL);
 }
+
+/*
+ * Mark that softlockup may occur
+ */
+static void set_potential_softlockup(unsigned long now, unsigned long period_ts)
+{
+	if (time_after_eq(now, period_ts + get_softlockup_thresh()/5))
+		set_potential_softlockup_hardirq();
+}
+
+static void clear_potential_softlockup(void)
+{
+	clear_potential_softlockup_hardirq();
+}
+
+static void print_hardirq_time(void)
+{
+	struct per_irqtime *irq_time;
+	struct irq_desc *desc;
+	u64 hardirqtime_top3[3] = {0, 0, 0};
+	int hardirq_top3[3] = {-1, -1, -1};
+	u64 start_time, now, a;
+	u32 period_us, i, b;
+
+	if (test_bit(SOFTLOCKUP_HARDIRQ, this_cpu_ptr(&softlockup_flags))) {
+		start_time = __this_cpu_read(hardirq_start_time);
+		now = local_clock();
+		period_us = (now - start_time)/1000;
+		/* Find the top three most time-consuming interrupts */
+		for_each_irq_desc(i, desc) {
+			if (!desc)
+				continue;
+			irq_time = this_cpu_ptr(desc->irq_time);
+			find_irqtime_top3(hardirqtime_top3, hardirq_top3, irq_time->delta, i);
+		}
+		printk(KERN_CRIT "CPU#%d Detect HardIRQ Time exceeds 50%% since %llds. Most time-consuming HardIRQs:\n",
+			smp_processor_id(), ns_to_timespec64(start_time - sample_period).tv_sec);
+		for (i = 0; i < 3; i++) {
+			if (hardirq_top3[i] == -1)
+				break;
+			a = 100 * (hardirqtime_top3[i] / 1000);
+			b = 10 * do_div(a, period_us);
+			do_div(b, period_us);
+			desc = irq_to_desc(hardirq_top3[i]);
+			if (desc && desc->action)
+				printk(KERN_CRIT "\t#%u: %llu.%u%% irq#%d(%s)\n",
+					i+1, a, b, hardirq_top3[i], desc->action->name);
+			else
+				printk(KERN_CRIT "\t#%u: %llu.%u%% irq#%d\n",
+				i+1, a, b, hardirq_top3[i]);
+		}
+		if (!need_trace_irqtime(CPUTIME_IRQ))
+			clear_potential_softlockup_hardirq();
+	}
+}
+
 #else
 static inline void update_cpustat(void) { }
 static inline void print_cpustat(void) { }
+static inline void set_potential_softlockup(unsigned long now, unsigned long period_ts) { }
+static inline void clear_potential_softlockup(void) { }
+static inline void hook_hardirq_events(void) { }
+static inline void unhook_hardirq_events(void) { }
 #endif
 
 /* watchdog detector functions */
@@ -510,6 +691,7 @@ static DEFINE_PER_CPU(struct cpu_stop_work, softlockup_stop_work);
 static int softlockup_fn(void *data)
 {
 	update_touch_ts();
+	clear_potential_softlockup();
 	complete(this_cpu_ptr(&softlockup_completion));
 
 	return 0;
@@ -674,6 +856,8 @@ static void softlockup_stop_all(void)
 	if (!softlockup_initialized)
 		return;
 
+	unhook_hardirq_events();
+
 	for_each_cpu(cpu, &watchdog_allowed_mask)
 		smp_call_on_cpu(cpu, softlockup_stop_fn, NULL, false);
 
@@ -690,6 +874,8 @@ static void softlockup_start_all(void)
 {
 	int cpu;
 
+	hook_hardirq_events();
+
 	cpumask_copy(&watchdog_allowed_mask, &watchdog_cpumask);
 	for_each_cpu(cpu, &watchdog_allowed_mask)
 		smp_call_on_cpu(cpu, softlockup_start_fn, NULL, false);
-- 
2.37.1 (Apple Git-137.1)


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/3] watchdog/softlockup: add parameter to control the reporting of time-consuming hardirq
  2024-01-23 12:12 [PATCH 0/3] *** Detect interrupt storm in softlockup *** Bitao Hu
  2024-01-23 12:12 ` [PATCH 1/3] watchdog/softlockup: low-overhead detection of interrupt storm Bitao Hu
  2024-01-23 12:12 ` [PATCH 2/3] watchdog/softlockup: report the most time-consuming hardirq Bitao Hu
@ 2024-01-23 12:12 ` Bitao Hu
  2024-01-26  5:25   ` kernel test robot
  2024-01-26  6:07   ` kernel test robot
  2 siblings, 2 replies; 18+ messages in thread
From: Bitao Hu @ 2024-01-23 12:12 UTC (permalink / raw)
  To: dianders, akpm, pmladek, tglx, maz, liusong; +Cc: linux-kernel, Bitao Hu

To obtain a more accurate cause of softlockup, we use tracepoints to
measure the time of each hardirq, which may have some impact on
performance. A parameter could be added to allow users to enable
this feature on demand.

Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com>
---
 kernel/watchdog.c | 51 ++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 48 insertions(+), 3 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index f347c5d8c5c1..314dfd301d8c 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -303,6 +303,9 @@ unsigned int __read_mostly softlockup_panic =
 static bool softlockup_initialized __read_mostly;
 static u64 __read_mostly sample_period;
 
+static int __read_mostly softlockup_irqtrace;
+static bool softlockup_irqtrace_initialized __read_mostly;
+
 /* Timestamp taken after the last successful reschedule. */
 static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts);
 /* Timestamp of the last softlockup report. */
@@ -318,6 +321,13 @@ static int __init softlockup_panic_setup(char *str)
 }
 __setup("softlockup_panic=", softlockup_panic_setup);
 
+static int __init softlockup_irqtrace_setup(char *str)
+{
+	get_option(&str, &softlockup_irqtrace);
+	return 1;
+}
+__setup("softlockup_irqtrace=", softlockup_irqtrace_setup);
+
 static int __init nowatchdog_setup(char *str)
 {
 	watchdog_user_enabled = 0;
@@ -635,7 +645,7 @@ static void print_hardirq_time(void)
 	u64 start_time, now, a;
 	u32 period_us, i, b;
 
-	if (test_bit(SOFTLOCKUP_HARDIRQ, this_cpu_ptr(&softlockup_flags))) {
+	if (softlockup_irqtrace && test_bit(SOFTLOCKUP_HARDIRQ, this_cpu_ptr(&softlockup_flags))) {
 		start_time = __this_cpu_read(hardirq_start_time);
 		now = local_clock();
 		period_us = (now - start_time)/1000;
@@ -856,7 +866,10 @@ static void softlockup_stop_all(void)
 	if (!softlockup_initialized)
 		return;
 
-	unhook_hardirq_events();
+	if (softlockup_irqtrace_initialized) {
+		unhook_hardirq_events();
+		softlockup_irqtrace_initialized = false;
+	}
 
 	for_each_cpu(cpu, &watchdog_allowed_mask)
 		smp_call_on_cpu(cpu, softlockup_stop_fn, NULL, false);
@@ -874,7 +887,10 @@ static void softlockup_start_all(void)
 {
 	int cpu;
 
-	hook_hardirq_events();
+	if (softlockup_irqtrace && !softlockup_irqtrace_initialized) {
+		hook_hardirq_events();
+		softlockup_irqtrace_initialized = true;
+	}
 
 	cpumask_copy(&watchdog_allowed_mask, &watchdog_cpumask);
 	for_each_cpu(cpu, &watchdog_allowed_mask)
@@ -1090,6 +1106,26 @@ int proc_watchdog_thresh(struct ctl_table *table, int write,
 	return err;
 }
 
+/*
+ * /proc/sys/kernel/softlockup_irqtrace
+ */
+int proc_softlockup_irqtrace(struct ctl_table *table, int write,
+			 void *buffer, size_t *lenp, loff_t *ppos)
+{
+	int err, old;
+
+	mutex_lock(&watchdog_mutex);
+
+	old = READ_ONCE(softlockup_irqtrace);
+	err = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+
+	if (!err && write && old != READ_ONCE(softlockup_irqtrace))
+		proc_watchdog_update();
+
+	mutex_unlock(&watchdog_mutex);
+	return err;
+}
+
 /*
  * The cpumask is the mask of possible cpus that the watchdog can run
  * on, not the mask of cpus it is actually running on.  This allows the
@@ -1158,6 +1194,15 @@ static struct ctl_table watchdog_sysctls[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "softlockup_irqtrace",
+		.data		= &softlockup_irqtrace,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_softlockup_irqtrace,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 #ifdef CONFIG_SMP
 	{
 		.procname	= "softlockup_all_cpu_backtrace",
-- 
2.37.1 (Apple Git-137.1)


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] watchdog/softlockup: low-overhead detection of interrupt storm
  2024-01-23 12:12 ` [PATCH 1/3] watchdog/softlockup: low-overhead detection of interrupt storm Bitao Hu
@ 2024-01-24  1:43   ` Liu Song
  2024-01-24  2:48     ` yaoma
  2024-01-25  0:19   ` Doug Anderson
  2024-01-26  5:25   ` kernel test robot
  2 siblings, 1 reply; 18+ messages in thread
From: Liu Song @ 2024-01-24  1:43 UTC (permalink / raw)
  To: Bitao Hu, dianders, akpm, pmladek, tglx, maz; +Cc: linux-kernel


在 2024/1/23 20:12, Bitao Hu 写道:
> The following softlockup is caused by interrupt storm, but it cannot be
> identified from the call tree. Because the call tree is just a snapshot
> and doesn't fully capture the behavior of the CPU during the soft lockup.
>    watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
>    ...
>    Call trace:
>      __do_softirq+0xa0/0x37c
>      __irq_exit_rcu+0x108/0x140
>      irq_exit+0x14/0x20
>      __handle_domain_irq+0x84/0xe0
>      gic_handle_irq+0x80/0x108
>      el0_irq_naked+0x50/0x58
>
> Therefore，I think it is necessary to report CPU utilization during the
> softlockup_thresh period (report once every sample_period, for a total
> of 5 reportings), like this:
>    watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
>    CPU#28 Utilization every 4s during lockup:
>      #1: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
>      #2: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
>      #3: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
>      #4: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
>      #5: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
>    ...
>
> This would be helpful in determining whether an interrupt storm has
> occurred or in identifying the cause of the softlockup. The criteria for
> determination are as follows:
>    a. If the hardirq utilization is high, then interrupt storm should be
>    considered and the root cause cannot be determined from the call tree.
>    b. If the softirq utilization is high, then we could analyze the call
>    tree but it may cannot reflect the root cause.
>    c. If the system utilization is high, then we could analyze the root
>    cause from the call tree.
>
> Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com>
> ---
>   kernel/watchdog.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 58 insertions(+)
>
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 81a8862295d6..9fad10e0a147 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -23,6 +23,8 @@
>   #include <linux/sched/debug.h>
>   #include <linux/sched/isolation.h>
>   #include <linux/stop_machine.h>
> +#include <linux/kernel_stat.h>
> +#include <linux/math64.h>
>   
>   #include <asm/irq_regs.h>
>   #include <linux/kvm_para.h>
> @@ -441,6 +443,58 @@ static int is_softlockup(unsigned long touch_ts,
>   	return 0;
>   }
>   
> +#ifdef CONFIG_IRQ_TIME_ACCOUNTING
> +static DEFINE_PER_CPU(u64, cpustat_old[NR_STATS]);
> +static DEFINE_PER_CPU(u64, cpustat_diff[5][NR_STATS]);
> +static DEFINE_PER_CPU(int, cpustat_tail);
> +
> +static void update_cpustat(void)
> +{
> +	u64 *old = this_cpu_ptr(cpustat_old);
> +	u64 (*diff)[NR_STATS] = this_cpu_ptr(cpustat_diff);
> +	int tail = this_cpu_read(cpustat_tail), i;
> +	struct kernel_cpustat kcpustat;
> +	u64 *cpustat = kcpustat.cpustat;
> +
> +	kcpustat_cpu_fetch(&kcpustat, smp_processor_id());
> +	for (i = 0; i < NR_STATS; i++) {
> +		diff[tail][i] = cpustat[i] - old[i];
> +		old[i] = cpustat[i];
> +	}
> +	this_cpu_write(cpustat_tail, (tail + 1) % 5);
The number 5 here is related to the 5 in cpustat_diff[5], and it is 
recommended to use a macro definition instead of using the number 5 
directly.
> +}
> +
> +static void print_cpustat(void)
> +{
> +	int i, j, k;
> +	u64 a[5][NR_STATS], b[5][NR_STATS];
Use define instead of the literal number 5.
> +	u64 (*diff)[NR_STATS] = this_cpu_ptr(cpustat_diff);
> +	int tail = this_cpu_read(cpustat_tail);
> +	u32 period_us = sample_period / 1000;
Use NSEC_PER_USEC
> +
> +	for (i = 0; i < 5; i++) {
> +		for (j = 0; j < NR_STATS; j++) {
> +			a[i][j] = 100 * (diff[i][j] / 1000);
> +			b[i][j] = 10 * do_div(a[i][j], period_us);
> +			do_div(b[i][j], period_us);
> +		}
> +	}
> +	printk(KERN_CRIT "CPU#%d Utilization every %us during lockup:\n",
better use "pr_crit", and was the original intent here microseconds (us) 
or milliseconds (ms)?
> +		smp_processor_id(), period_us/1000000);
better use "period_us /NSEC_PER_MSEC"?
> +	for (k = 0, i = tail; k < 5; k++, i = (i + 1) % 5) {

It seems that only i and j are necessary, k is not essential.

> +		printk(KERN_CRIT "\t#%d: %llu.%llu%% system,\t%llu.%llu%% softirq,\t"
> +			"%llu.%llu%% hardirq,\t%llu.%llu%% idle\n", k+1,
> +			a[i][CPUTIME_SYSTEM], b[i][CPUTIME_SYSTEM],
> +			a[i][CPUTIME_SOFTIRQ], b[i][CPUTIME_SOFTIRQ],
> +			a[i][CPUTIME_IRQ], b[i][CPUTIME_IRQ],
> +			a[i][CPUTIME_IDLE], b[i][CPUTIME_IDLE]);
> +	}
> +}
> +#else
> +static inline void update_cpustat(void) { }
> +static inline void print_cpustat(void) { }
> +#endif
> +
>   /* watchdog detector functions */
>   static DEFINE_PER_CPU(struct completion, softlockup_completion);
>   static DEFINE_PER_CPU(struct cpu_stop_work, softlockup_stop_work);
> @@ -504,6 +558,9 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
>   	 */
>   	period_ts = READ_ONCE(*this_cpu_ptr(&watchdog_report_ts));
>   
> +	/* update cpu usage stat */
The function name already indicates that it involves graphs, so the 
comment here appears superfluous.
If a comment is absolutely necessary, please provide more detailed 
information.
> +	update_cpustat();
> +
>   	/* Reset the interval when touched by known problematic code. */
>   	if (period_ts == SOFTLOCKUP_DELAY_REPORT) {
>   		if (unlikely(__this_cpu_read(softlockup_touch_sync))) {
> @@ -539,6 +596,7 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
>   		pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
>   			smp_processor_id(), duration,
>   			current->comm, task_pid_nr(current));
> +		print_cpustat();
>   		print_modules();
>   		print_irqtrace_events(current);
>   		if (regs)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] watchdog/softlockup: low-overhead detection of interrupt storm
  2024-01-24  1:43   ` Liu Song
@ 2024-01-24  2:48     ` yaoma
  0 siblings, 0 replies; 18+ messages in thread
From: yaoma @ 2024-01-24  2:48 UTC (permalink / raw)
  To: Liu Song, dianders, akpm, pmladek, tglx, maz; +Cc: linux-kernel



On 2024/1/24 09:43, Liu Song wrote:
> 
> 在 2024/1/23 20:12, Bitao Hu 写道:
>> The following softlockup is caused by interrupt storm, but it cannot be
>> identified from the call tree. Because the call tree is just a snapshot
>> and doesn't fully capture the behavior of the CPU during the soft lockup.
>>    watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
>>    ...
>>    Call trace:
>>      __do_softirq+0xa0/0x37c
>>      __irq_exit_rcu+0x108/0x140
>>      irq_exit+0x14/0x20
>>      __handle_domain_irq+0x84/0xe0
>>      gic_handle_irq+0x80/0x108
>>      el0_irq_naked+0x50/0x58
>>
>> Therefore，I think it is necessary to report CPU utilization during the
>> softlockup_thresh period (report once every sample_period, for a total
>> of 5 reportings), like this:
>>    watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
>>    CPU#28 Utilization every 4s during lockup:
>>      #1: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
>>      #2: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
>>      #3: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
>>      #4: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
>>      #5: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
>>    ...
>>
>> This would be helpful in determining whether an interrupt storm has
>> occurred or in identifying the cause of the softlockup. The criteria for
>> determination are as follows:
>>    a. If the hardirq utilization is high, then interrupt storm should be
>>    considered and the root cause cannot be determined from the call tree.
>>    b. If the softirq utilization is high, then we could analyze the call
>>    tree but it may cannot reflect the root cause.
>>    c. If the system utilization is high, then we could analyze the root
>>    cause from the call tree.
>>
>> Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com>
>> ---
>>   kernel/watchdog.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 58 insertions(+)
>>
>> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
>> index 81a8862295d6..9fad10e0a147 100644
>> --- a/kernel/watchdog.c
>> +++ b/kernel/watchdog.c
>> @@ -23,6 +23,8 @@
>>   #include <linux/sched/debug.h>
>>   #include <linux/sched/isolation.h>
>>   #include <linux/stop_machine.h>
>> +#include <linux/kernel_stat.h>
>> +#include <linux/math64.h>
>>   #include <asm/irq_regs.h>
>>   #include <linux/kvm_para.h>
>> @@ -441,6 +443,58 @@ static int is_softlockup(unsigned long touch_ts,
>>       return 0;
>>   }
>> +#ifdef CONFIG_IRQ_TIME_ACCOUNTING
>> +static DEFINE_PER_CPU(u64, cpustat_old[NR_STATS]);
>> +static DEFINE_PER_CPU(u64, cpustat_diff[5][NR_STATS]);
>> +static DEFINE_PER_CPU(int, cpustat_tail);
>> +
>> +static void update_cpustat(void)
>> +{
>> +    u64 *old = this_cpu_ptr(cpustat_old);
>> +    u64 (*diff)[NR_STATS] = this_cpu_ptr(cpustat_diff);
>> +    int tail = this_cpu_read(cpustat_tail), i;
>> +    struct kernel_cpustat kcpustat;
>> +    u64 *cpustat = kcpustat.cpustat;
>> +
>> +    kcpustat_cpu_fetch(&kcpustat, smp_processor_id());
>> +    for (i = 0; i < NR_STATS; i++) {
>> +        diff[tail][i] = cpustat[i] - old[i];
>> +        old[i] = cpustat[i];
>> +    }
>> +    this_cpu_write(cpustat_tail, (tail + 1) % 5);
> The number 5 here is related to the 5 in cpustat_diff[5], and it is 
> recommended to use a macro definition instead of using the number 5 
> directly.
In the "set_sample_period" function, the "sample_period" is hardcoded to 
be 1/5 of the "softlockup_thresh", therefore I define the length of the 
array here as 5.
>> +}
>> +
>> +static void print_cpustat(void)
>> +{
>> +    int i, j, k;
>> +    u64 a[5][NR_STATS], b[5][NR_STATS];
> Use define instead of the literal number 5.
Same as above
>> +    u64 (*diff)[NR_STATS] = this_cpu_ptr(cpustat_diff);
>> +    int tail = this_cpu_read(cpustat_tail);
>> +    u32 period_us = sample_period / 1000;
> Use NSEC_PER_USEC
Sure.
>> +
>> +    for (i = 0; i < 5; i++) {
>> +        for (j = 0; j < NR_STATS; j++) {
>> +            a[i][j] = 100 * (diff[i][j] / 1000);
>> +            b[i][j] = 10 * do_div(a[i][j], period_us);
>> +            do_div(b[i][j], period_us);
>> +        }
>> +    }
>> +    printk(KERN_CRIT "CPU#%d Utilization every %us during lockup:\n",
> better use "pr_crit", and was the original intent here microseconds (us) 
> or milliseconds (ms)?
Using "pr_crit", each line will have a "watchdog:" prefix, which I think 
might not look very neat.
The intended unit here is seconds(s).

>> +        smp_processor_id(), period_us/1000000);
> better use "period_us /NSEC_PER_MSEC"?
Sure, I will use USEC_PER_SEC here.
>> +    for (k = 0, i = tail; k < 5; k++, i = (i + 1) % 5) {
> 
> It seems that only i and j are necessary, k is not essential.
Sure, I will remove the variable 'k'.
> 
>> +        printk(KERN_CRIT "\t#%d: %llu.%llu%% system,\t%llu.%llu%% 
>> softirq,\t"
>> +            "%llu.%llu%% hardirq,\t%llu.%llu%% idle\n", k+1,
>> +            a[i][CPUTIME_SYSTEM], b[i][CPUTIME_SYSTEM],
>> +            a[i][CPUTIME_SOFTIRQ], b[i][CPUTIME_SOFTIRQ],
>> +            a[i][CPUTIME_IRQ], b[i][CPUTIME_IRQ],
>> +            a[i][CPUTIME_IDLE], b[i][CPUTIME_IDLE]);
>> +    }
>> +}
>> +#else
>> +static inline void update_cpustat(void) { }
>> +static inline void print_cpustat(void) { }
>> +#endif
>> +
>>   /* watchdog detector functions */
>>   static DEFINE_PER_CPU(struct completion, softlockup_completion);
>>   static DEFINE_PER_CPU(struct cpu_stop_work, softlockup_stop_work);
>> @@ -504,6 +558,9 @@ static enum hrtimer_restart 
>> watchdog_timer_fn(struct hrtimer *hrtimer)
>>        */
>>       period_ts = READ_ONCE(*this_cpu_ptr(&watchdog_report_ts));
>> +    /* update cpu usage stat */
> The function name already indicates that it involves graphs, so the 
> comment here appears superfluous.
> If a comment is absolutely necessary, please provide more detailed 
> information.
Sure.
>> +    update_cpustat();
>> +
>>       /* Reset the interval when touched by known problematic code. */
>>       if (period_ts == SOFTLOCKUP_DELAY_REPORT) {
>>           if (unlikely(__this_cpu_read(softlockup_touch_sync))) {
>> @@ -539,6 +596,7 @@ static enum hrtimer_restart 
>> watchdog_timer_fn(struct hrtimer *hrtimer)
>>           pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
>>               smp_processor_id(), duration,
>>               current->comm, task_pid_nr(current));
>> +        print_cpustat();
>>           print_modules();
>>           print_irqtrace_events(current);
>>           if (regs)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] watchdog/softlockup: low-overhead detection of interrupt storm
  2024-01-23 12:12 ` [PATCH 1/3] watchdog/softlockup: low-overhead detection of interrupt storm Bitao Hu
  2024-01-24  1:43   ` Liu Song
@ 2024-01-25  0:19   ` Doug Anderson
  2024-01-25  8:31     ` Bitao Hu
  2024-01-26  5:25   ` kernel test robot
  2 siblings, 1 reply; 18+ messages in thread
From: Doug Anderson @ 2024-01-25  0:19 UTC (permalink / raw)
  To: Bitao Hu; +Cc: akpm, pmladek, tglx, maz, liusong, linux-kernel

Hi,

On Tue, Jan 23, 2024 at 4:12 AM Bitao Hu <yaoma@linux.alibaba.com> wrote:
>
> The following softlockup is caused by interrupt storm, but it cannot be
> identified from the call tree. Because the call tree is just a snapshot
> and doesn't fully capture the behavior of the CPU during the soft lockup.
>   watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
>   ...
>   Call trace:
>     __do_softirq+0xa0/0x37c
>     __irq_exit_rcu+0x108/0x140
>     irq_exit+0x14/0x20
>     __handle_domain_irq+0x84/0xe0
>     gic_handle_irq+0x80/0x108
>     el0_irq_naked+0x50/0x58
>
> Therefore，I think it is necessary to report CPU utilization during the
> softlockup_thresh period (report once every sample_period, for a total
> of 5 reportings), like this:
>   watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
>   CPU#28 Utilization every 4s during lockup:
>     #1: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
>     #2: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
>     #3: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
>     #4: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
>     #5: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
>   ...
>
> This would be helpful in determining whether an interrupt storm has
> occurred or in identifying the cause of the softlockup. The criteria for
> determination are as follows:
>   a. If the hardirq utilization is high, then interrupt storm should be
>   considered and the root cause cannot be determined from the call tree.
>   b. If the softirq utilization is high, then we could analyze the call
>   tree but it may cannot reflect the root cause.
>   c. If the system utilization is high, then we could analyze the root
>   cause from the call tree.

Just to set the tone: this sounds like a great idea! I've found myself
wanting something like this for a long time but I've never sat down to
try to code up a mechanism. I have review comments below, but mostly
it's just details. IMO the idea is definitely something we want to
land.

> Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com>
> ---
>  kernel/watchdog.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 58 insertions(+)
>
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 81a8862295d6..9fad10e0a147 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -23,6 +23,8 @@
>  #include <linux/sched/debug.h>
>  #include <linux/sched/isolation.h>
>  #include <linux/stop_machine.h>
> +#include <linux/kernel_stat.h>
> +#include <linux/math64.h>
>
>  #include <asm/irq_regs.h>
>  #include <linux/kvm_para.h>
> @@ -441,6 +443,58 @@ static int is_softlockup(unsigned long touch_ts,
>         return 0;
>  }
>
> +#ifdef CONFIG_IRQ_TIME_ACCOUNTING
> +static DEFINE_PER_CPU(u64, cpustat_old[NR_STATS]);
> +static DEFINE_PER_CPU(u64, cpustat_diff[5][NR_STATS]);
> +static DEFINE_PER_CPU(int, cpustat_tail);

The above is potentially a lot of memory. On some configs we allow up
to 8192 CPUs and you're storing (NR_STATS * 6) u64s per CPU plus
another int per CPU . NR_STATS can be up to 11 currently, so if I did
my math right then that's

8 * 8192 * 11 * 6 + 4 * 8192 = 4,358,144 bytes (!) That's probably not OK.

In theory you only need 4 stats (plus the total) instead of 11. If you
do that plus the "old" then that would give you (8 * 8192 * 5 * 6 + 4
* 8192) = 1,998,848 bytes. That's still a lot, but at least less.
...so at the very least you should only store the fields you need.

Given the amount of potential space at stake, it's probably worth
optimizing this to something smaller than u64 as well. It feels like
it would be easy to use u32. You only need diffs here and it doesn't
seem like it would be hard to figure out how to use 32-bits, even if
you have the drop a bit of granularity.

It feels like you could even go down to 8-bits per stat, though. You
don't _really_ need such accurate percentages, right? Does it really
matter if you have 99.4%? It seems like just saying 99% would be fine.
Just do the math here and store the integral percentage per stat you
care about. The numbers 0 - 100 can be stored in 8 bits.

Combining that all together, I guess you could do this (untested):

1. Get the current stats and right shift them all by 10 and convert
them to 32-bit. This gives you (roughly) microseconds which will roll
over roughly every ~4000 seconds. That seems about right, though I
guess it's possible to do more and see if we can squeeze into 16 bits.

2. Sum up all 11 of the stats to get a total. This should still fit
into 32-bits.

3. Do the math to get the integral percentages and store those in the tail slot.

4. Store the 4 stats you care about plus the total (all 32-bits) in
"cpustat_old".

If I've got that right, your definitions should be able to be:

#define NUM_OLD_STATS_GROUPS  5
#define NUM_STATS_PER_GROUP 4
static DEFINE_PER_CPU(u32, cpustat_old[NUM_STATS_PER_GROUP + 1]);
static DEFINE_PER_CPU(u8,
cpustat_utilization[NUM_OLD_STATS][NUM_STATS_PER_GROUP]);
static DEFINE_PER_CPU(u8, cpustat_tail);

With the maximum number of CPUs, that's now this, if I got my math right.

4 * 8192 * 5 + 1 * 8192 * 5 * 4 + 1 * 8192 = 335,872 bytes.

That's a lot less, but still a lot. I'd be interested to hear other
opinions, but it's probably worth a Kconfig knob.

> +static void update_cpustat(void)
> +{
> +       u64 *old = this_cpu_ptr(cpustat_old);
> +       u64 (*diff)[NR_STATS] = this_cpu_ptr(cpustat_diff);
> +       int tail = this_cpu_read(cpustat_tail), i;

nit: Please define "i" on its own line. It looks weird the way you have it here.

> +       struct kernel_cpustat kcpustat;
> +       u64 *cpustat = kcpustat.cpustat;
> +
> +       kcpustat_cpu_fetch(&kcpustat, smp_processor_id());
> +       for (i = 0; i < NR_STATS; i++) {
> +               diff[tail][i] = cpustat[i] - old[i];
> +               old[i] = cpustat[i];
> +       }
> +       this_cpu_write(cpustat_tail, (tail + 1) % 5);
> +}
> +
> +static void print_cpustat(void)
> +{
> +       int i, j, k;
> +       u64 a[5][NR_STATS], b[5][NR_STATS];
> +       u64 (*diff)[NR_STATS] = this_cpu_ptr(cpustat_diff);
> +       int tail = this_cpu_read(cpustat_tail);
> +       u32 period_us = sample_period / 1000;
> +
> +       for (i = 0; i < 5; i++) {
> +               for (j = 0; j < NR_STATS; j++) {
> +                       a[i][j] = 100 * (diff[i][j] / 1000);
> +                       b[i][j] = 10 * do_div(a[i][j], period_us);
> +                       do_div(b[i][j], period_us);
> +               }
> +       }
> +       printk(KERN_CRIT "CPU#%d Utilization every %us during lockup:\n",
> +               smp_processor_id(), period_us/1000000);
> +       for (k = 0, i = tail; k < 5; k++, i = (i + 1) % 5) {
> +               printk(KERN_CRIT "\t#%d: %llu.%llu%% system,\t%llu.%llu%% softirq,\t"
> +                       "%llu.%llu%% hardirq,\t%llu.%llu%% idle\n", k+1,
> +                       a[i][CPUTIME_SYSTEM], b[i][CPUTIME_SYSTEM],
> +                       a[i][CPUTIME_SOFTIRQ], b[i][CPUTIME_SOFTIRQ],
> +                       a[i][CPUTIME_IRQ], b[i][CPUTIME_IRQ],
> +                       a[i][CPUTIME_IDLE], b[i][CPUTIME_IDLE]);

As per your response to Liu Song, I understand why you're using
"printk" instead of pr_crit(), but it deserves a comment.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] watchdog/softlockup: report the most time-consuming hardirq
  2024-01-23 12:12 ` [PATCH 2/3] watchdog/softlockup: report the most time-consuming hardirq Bitao Hu
@ 2024-01-25  0:19   ` Doug Anderson
  2024-01-25  7:50     ` Bitao Hu
  2024-01-25  9:19     ` Bitao Hu
  2024-02-14 23:36   ` Thomas Gleixner
  1 sibling, 2 replies; 18+ messages in thread
From: Doug Anderson @ 2024-01-25  0:19 UTC (permalink / raw)
  To: Bitao Hu; +Cc: akpm, pmladek, tglx, maz, liusong, linux-kernel

Hi,

On Tue, Jan 23, 2024 at 4:12 AM Bitao Hu <yaoma@linux.alibaba.com> wrote:
>
> When the watchdog determines that the current soft lockup is due
> to an interrupt storm based on CPU utilization, reporting the
> top three most time-consuming hardirq can be quite useful for
> further troubleshooting.
> Below is an example of interrupt storm. The call tree does not
> provide useful information, but we can analyze which interrupt
> caused the soft lockup by using the time-consuming information
> of hardirq.
>
> [   67.714044] watchdog: BUG: soft lockup - CPU#9 stuck for 28s! [swapper/9:0]
> [   67.714548] CPU#9 Utilization every 4s during lockup:
> [   67.714549]  #1: 0.0% system,        0.0% softirq,   95.0% hardirq,  0.0% idle
> [   67.714551]  #2: 0.0% system,        0.0% softirq,   90.0% hardirq,  0.0% idle
> [   67.714553]  #3: 0.0% system,        0.0% softirq,   90.0% hardirq,  0.0% idle
> [   67.714555]  #4: 0.0% system,        0.0% softirq,   95.0% hardirq,  0.0% idle
> [   67.714556]  #5: 0.0% system,        0.0% softirq,   90.0% hardirq,  0.0% idle
> [   67.714570] CPU#9 Detect HardIRQ Time exceeds 50% since 45s. Most time-consuming HardIRQs:
> [   67.714571]  #1: 99.9% irq#7(IPI)
> [   67.714575]  #2: 0.0% irq#10(arch_timer)
> [   67.714578]  #3: 0.0% irq#2(IPI)
> ...
> [   67.714654] Call trace:
> [   67.714656]  __do_softirq+0xa8/0x364
>
> Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com>
> ---
>  include/linux/irq.h     |   9 ++
>  include/linux/irqdesc.h |   2 +
>  kernel/irq/irqdesc.c    |   9 +-
>  kernel/watchdog.c       | 186 ++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 205 insertions(+), 1 deletion(-)

Just like on patch #1, I'd overall like to say that I'm very excited
about this patch and think it'll help track down a lot of issues, so
please continue moving forward with it!

That being said, my gut says that this can be a lot simpler / lower
overhead if we just use the existing IRQ counts that the framework is
already keeping track of for us. We could look for the IRQs that had
the most counts on the soft locked CPU over the period of time.

It might not be quite as good/accurate compared to what you have here
but it seems like it would be "good enough". Normally when there's an
IRQ storm that's so bad that it's causing a softlockup then each IRQ
isn't actually that slow but they're going off nonstop. It would catch
that case perfectly.

I guess maybe (?) the case it wouldn't catch so well would be if you
had a loop that looked like:

* CPU from "storming" device takes ~10 ms and finishes.
* After the "storming" device finishes, we somehow manage to service a
whole pile of normal interrupts from non-storming devices.
* After the non-storming interrupts finished, then we went back to
servicing the slow storming interrupt.

I haven't dug into the bowels of the Linux IRQ handling well enough to
know if that would be possible. However, even if it was possible it
doesn't feel likely to happen. If those "normal" interrupts aren't
storming then it seems unlikely we'd see more than one of each of them
between the "storming" interrupts. Thus it feels like the "storming"
interrupt would still have the highest (or tied for the highest) count
on the locked up CPU. Maybe you could print the top 4 or 5 (instead of
the top 3) and you'd be pretty certain to catch even this case?

In any case, I'll wait before doing a more thorough review for now and
hope the above sounds right to you.

-Doug

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] watchdog/softlockup: report the most time-consuming hardirq
  2024-01-25  0:19   ` Doug Anderson
@ 2024-01-25  7:50     ` Bitao Hu
  2024-01-25  9:19     ` Bitao Hu
  1 sibling, 0 replies; 18+ messages in thread
From: Bitao Hu @ 2024-01-25  7:50 UTC (permalink / raw)
  To: Doug Anderson; +Cc: akpm, pmladek, tglx, maz, liusong, linux-kernel, yaoma



On 2024/1/25 08:19, Doug Anderson wrote:
> Hi,
> 
> On Tue, Jan 23, 2024 at 4:12 AM Bitao Hu <yaoma@linux.alibaba.com> wrote:
>>
>> When the watchdog determines that the current soft lockup is due
>> to an interrupt storm based on CPU utilization, reporting the
>> top three most time-consuming hardirq can be quite useful for
>> further troubleshooting.
>> Below is an example of interrupt storm. The call tree does not
>> provide useful information, but we can analyze which interrupt
>> caused the soft lockup by using the time-consuming information
>> of hardirq.
>>
>> [   67.714044] watchdog: BUG: soft lockup - CPU#9 stuck for 28s! [swapper/9:0]
>> [   67.714548] CPU#9 Utilization every 4s during lockup:
>> [   67.714549]  #1: 0.0% system,        0.0% softirq,   95.0% hardirq,  0.0% idle
>> [   67.714551]  #2: 0.0% system,        0.0% softirq,   90.0% hardirq,  0.0% idle
>> [   67.714553]  #3: 0.0% system,        0.0% softirq,   90.0% hardirq,  0.0% idle
>> [   67.714555]  #4: 0.0% system,        0.0% softirq,   95.0% hardirq,  0.0% idle
>> [   67.714556]  #5: 0.0% system,        0.0% softirq,   90.0% hardirq,  0.0% idle
>> [   67.714570] CPU#9 Detect HardIRQ Time exceeds 50% since 45s. Most time-consuming HardIRQs:
>> [   67.714571]  #1: 99.9% irq#7(IPI)
>> [   67.714575]  #2: 0.0% irq#10(arch_timer)
>> [   67.714578]  #3: 0.0% irq#2(IPI)
>> ...
>> [   67.714654] Call trace:
>> [   67.714656]  __do_softirq+0xa8/0x364
>>
>> Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com>
>> ---
>>   include/linux/irq.h     |   9 ++
>>   include/linux/irqdesc.h |   2 +
>>   kernel/irq/irqdesc.c    |   9 +-
>>   kernel/watchdog.c       | 186 ++++++++++++++++++++++++++++++++++++++++
>>   4 files changed, 205 insertions(+), 1 deletion(-)
> 
> Just like on patch #1, I'd overall like to say that I'm very excited
> about this patch and think it'll help track down a lot of issues, so
> please continue moving forward with it!
Sure, I will continue moving forward with it.
> 
> That being said, my gut says that this can be a lot simpler / lower
> overhead if we just use the existing IRQ counts that the framework is
> already keeping track of for us. We could look for the IRQs that had
> the most counts on the soft locked CPU over the period of time.
> 
> It might not be quite as good/accurate compared to what you have here
> but it seems like it would be "good enough". Normally when there's an
> IRQ storm that's so bad that it's causing a softlockup then each IRQ
> isn't actually that slow but they're going off nonstop. It would catch
> that case perfectly.
> 
> I guess maybe (?) the case it wouldn't catch so well would be if you
> had a loop that looked like:
> 
> * CPU from "storming" device takes ~10 ms and finishes.
> * After the "storming" device finishes, we somehow manage to service a
> whole pile of normal interrupts from non-storming devices.
> * After the non-storming interrupts finished, then we went back to
> servicing the slow storming interrupt.
I think there are two issues with using IRQ counts in determining 
interrupt storm:

1. we might not know the normal frequency for each interrupt, witch 
making the IRQ counts inconvenient for comparison.

For instance, in your example, suppose the slow interrupt takes 10ms at 
a frequency of 100Hz; the normal interrupt takes 10us at a frequency of 
500Hz. If we use IRQ counts, we might incorrectly conclude that it is 
the normal interrupt causing softlockup.

2. During an interrupt storm, the IRQ counts recorded by the kernel does 
not necessarily increase.
The softlockup caused by an NVMe interrupt storm that I encountered is a 
case in point. Around the time of the softlockup, the NVMe interrupt 
frequency actually decreased from ~300 kHz to ~17 kHz. However, the 
average time taken per NVMe interrupt increased from  ~2us to ~60us. 
This is because the nvme_irq handler is capable of polling and 
processing multiple interrupts reported by the NVMe controller, which 
are not captured in the kernel's interrupt statistics. The NVMe driver 
handles interrupts frequency is ~660 kHz actually.

Using interrupt time as a criterion of interrupt storm can avoid the 
above two points.

> 
> I haven't dug into the bowels of the Linux IRQ handling well enough to
> know if that would be possible. However, even if it was possible it
> doesn't feel likely to happen. If those "normal" interrupts aren't
> storming then it seems unlikely we'd see more than one of each of them
> between the "storming" interrupts. Thus it feels like the "storming"
> interrupt would still have the highest (or tied for the highest) count
> on the locked up CPU. Maybe you could print the top 4 or 5 (instead of
> the top 3) and you'd be pretty certain to catch even this case?
> 
> In any case, I'll wait before doing a more thorough review for now and
> hope the above sounds right to you.
> 
> -Doug

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] watchdog/softlockup: low-overhead detection of interrupt storm
  2024-01-25  0:19   ` Doug Anderson
@ 2024-01-25  8:31     ` Bitao Hu
  2024-01-25 15:08       ` Doug Anderson
  0 siblings, 1 reply; 18+ messages in thread
From: Bitao Hu @ 2024-01-25  8:31 UTC (permalink / raw)
  To: Doug Anderson; +Cc: akpm, pmladek, tglx, maz, liusong, linux-kernel, yaoma



On 2024/1/25 08:19, Doug Anderson wrote:
> Hi,
> 
> On Tue, Jan 23, 2024 at 4:12 AM Bitao Hu <yaoma@linux.alibaba.com> wrote:
>>
>> The following softlockup is caused by interrupt storm, but it cannot be
>> identified from the call tree. Because the call tree is just a snapshot
>> and doesn't fully capture the behavior of the CPU during the soft lockup.
>>    watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
>>    ...
>>    Call trace:
>>      __do_softirq+0xa0/0x37c
>>      __irq_exit_rcu+0x108/0x140
>>      irq_exit+0x14/0x20
>>      __handle_domain_irq+0x84/0xe0
>>      gic_handle_irq+0x80/0x108
>>      el0_irq_naked+0x50/0x58
>>
>> Therefore，I think it is necessary to report CPU utilization during the
>> softlockup_thresh period (report once every sample_period, for a total
>> of 5 reportings), like this:
>>    watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
>>    CPU#28 Utilization every 4s during lockup:
>>      #1: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
>>      #2: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
>>      #3: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
>>      #4: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
>>      #5: 0.0% system, 0.0% softirq, 100.0% hardirq, 0.0% idle
>>    ...
>>
>> This would be helpful in determining whether an interrupt storm has
>> occurred or in identifying the cause of the softlockup. The criteria for
>> determination are as follows:
>>    a. If the hardirq utilization is high, then interrupt storm should be
>>    considered and the root cause cannot be determined from the call tree.
>>    b. If the softirq utilization is high, then we could analyze the call
>>    tree but it may cannot reflect the root cause.
>>    c. If the system utilization is high, then we could analyze the root
>>    cause from the call tree.
> 
> Just to set the tone: this sounds like a great idea! I've found myself
> wanting something like this for a long time but I've never sat down to
> try to code up a mechanism. I have review comments below, but mostly
> it's just details. IMO the idea is definitely something we want to
> land.
Thanks for your support of this idea.
> 
> 
>> Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com>
>> ---
>>   kernel/watchdog.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 58 insertions(+)
>>
>> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
>> index 81a8862295d6..9fad10e0a147 100644
>> --- a/kernel/watchdog.c
>> +++ b/kernel/watchdog.c
>> @@ -23,6 +23,8 @@
>>   #include <linux/sched/debug.h>
>>   #include <linux/sched/isolation.h>
>>   #include <linux/stop_machine.h>
>> +#include <linux/kernel_stat.h>
>> +#include <linux/math64.h>
>>
>>   #include <asm/irq_regs.h>
>>   #include <linux/kvm_para.h>
>> @@ -441,6 +443,58 @@ static int is_softlockup(unsigned long touch_ts,
>>          return 0;
>>   }
>>
>> +#ifdef CONFIG_IRQ_TIME_ACCOUNTING
>> +static DEFINE_PER_CPU(u64, cpustat_old[NR_STATS]);
>> +static DEFINE_PER_CPU(u64, cpustat_diff[5][NR_STATS]);
>> +static DEFINE_PER_CPU(int, cpustat_tail);
> 
> The above is potentially a lot of memory. On some configs we allow up
> to 8192 CPUs and you're storing (NR_STATS * 6) u64s per CPU plus
> another int per CPU . NR_STATS can be up to 11 currently, so if I did
> my math right then that's
> 
> 8 * 8192 * 11 * 6 + 4 * 8192 = 4,358,144 bytes (!) That's probably not OK.
> 
> In theory you only need 4 stats (plus the total) instead of 11. If you
> do that plus the "old" then that would give you (8 * 8192 * 5 * 6 + 4
> * 8192) = 1,998,848 bytes. That's still a lot, but at least less.
> ...so at the very least you should only store the fields you need.
> 
> Given the amount of potential space at stake, it's probably worth
> optimizing this to something smaller than u64 as well. It feels like
> it would be easy to use u32. You only need diffs here and it doesn't
> seem like it would be hard to figure out how to use 32-bits, even if
> you have the drop a bit of granularity.
> 
> It feels like you could even go down to 8-bits per stat, though. You
> don't _really_ need such accurate percentages, right? Does it really
> matter if you have 99.4%? It seems like just saying 99% would be fine.
> Just do the math here and store the integral percentage per stat you
> care about. The numbers 0 - 100 can be stored in 8 bits.
> 
Thanks for your detailed analysis and optimization of memory 
consumption. I indeed hadn't considered this point, and I will
update it in v2.
> Combining that all together, I guess you could do this (untested):
> 
> 1. Get the current stats and right shift them all by 10 and convert
> them to 32-bit. This gives you (roughly) microseconds which will roll
> over roughly every ~4000 seconds. That seems about right, though I
> guess it's possible to do more and see if we can squeeze into 16 bits.
> 
> 2. Sum up all 11 of the stats to get a total. This should still fit
> into 32-bits.

> 
> 3. Do the math to get the integral percentages and store those in the tail slot.
> 
I'm a bit confused, why is there a need to sum up all 11 of the stats to
get a total? I calculate the percentage using sample_period.
> 4. Store the 4 stats you care about plus the total (all 32-bits) in
> "cpustat_old".
> 
> If I've got that right, your definitions should be able to be:
> 
> #define NUM_OLD_STATS_GROUPS  5
> #define NUM_STATS_PER_GROUP 4
> static DEFINE_PER_CPU(u32, cpustat_old[NUM_STATS_PER_GROUP + 1]);
> static DEFINE_PER_CPU(u8,
> cpustat_utilization[NUM_OLD_STATS][NUM_STATS_PER_GROUP]);
> static DEFINE_PER_CPU(u8, cpustat_tail);
> 
> With the maximum number of CPUs, that's now this, if I got my math right.
> 
> 4 * 8192 * 5 + 1 * 8192 * 5 * 4 + 1 * 8192 = 335,872 bytes.
> 
> That's a lot less, but still a lot. I'd be interested to hear other
> opinions, but it's probably worth a Kconfig knob.
> 
> 
>> +static void update_cpustat(void)
>> +{
>> +       u64 *old = this_cpu_ptr(cpustat_old);
>> +       u64 (*diff)[NR_STATS] = this_cpu_ptr(cpustat_diff);
>> +       int tail = this_cpu_read(cpustat_tail), i;
> 
> nit: Please define "i" on its own line. It looks weird the way you have it here.
Sure.
> 
>> +       struct kernel_cpustat kcpustat;
>> +       u64 *cpustat = kcpustat.cpustat;
>> +
>> +       kcpustat_cpu_fetch(&kcpustat, smp_processor_id());
>> +       for (i = 0; i < NR_STATS; i++) {
>> +               diff[tail][i] = cpustat[i] - old[i];
>> +               old[i] = cpustat[i];
>> +       }
>> +       this_cpu_write(cpustat_tail, (tail + 1) % 5);
>> +}
>> +
>> +static void print_cpustat(void)
>> +{
>> +       int i, j, k;
>> +       u64 a[5][NR_STATS], b[5][NR_STATS];
>> +       u64 (*diff)[NR_STATS] = this_cpu_ptr(cpustat_diff);
>> +       int tail = this_cpu_read(cpustat_tail);
>> +       u32 period_us = sample_period / 1000;
>> +
>> +       for (i = 0; i < 5; i++) {
>> +               for (j = 0; j < NR_STATS; j++) {
>> +                       a[i][j] = 100 * (diff[i][j] / 1000);
>> +                       b[i][j] = 10 * do_div(a[i][j], period_us);
>> +                       do_div(b[i][j], period_us);
>> +               }
>> +       }
>> +       printk(KERN_CRIT "CPU#%d Utilization every %us during lockup:\n",
>> +               smp_processor_id(), period_us/1000000);
>> +       for (k = 0, i = tail; k < 5; k++, i = (i + 1) % 5) {
>> +               printk(KERN_CRIT "\t#%d: %llu.%llu%% system,\t%llu.%llu%% softirq,\t"
>> +                       "%llu.%llu%% hardirq,\t%llu.%llu%% idle\n", k+1,
>> +                       a[i][CPUTIME_SYSTEM], b[i][CPUTIME_SYSTEM],
>> +                       a[i][CPUTIME_SOFTIRQ], b[i][CPUTIME_SOFTIRQ],
>> +                       a[i][CPUTIME_IRQ], b[i][CPUTIME_IRQ],
>> +                       a[i][CPUTIME_IDLE], b[i][CPUTIME_IDLE]);
> 
> As per your response to Liu Song, I understand why you're using
> "printk" instead of pr_crit(), but it deserves a comment.
Sure, I will add a comment here.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] watchdog/softlockup: report the most time-consuming hardirq
  2024-01-25  0:19   ` Doug Anderson
  2024-01-25  7:50     ` Bitao Hu
@ 2024-01-25  9:19     ` Bitao Hu
  1 sibling, 0 replies; 18+ messages in thread
From: Bitao Hu @ 2024-01-25  9:19 UTC (permalink / raw)
  To: Doug Anderson; +Cc: akpm, pmladek, tglx, maz, liusong, linux-kernel, yaoma



On 2024/1/25 08:19, Doug Anderson wrote:
> 
> I guess maybe (?) the case it wouldn't catch so well would be if you
> had a loop that looked like:
> 
> * CPU from "storming" device takes ~10 ms and finishes.
> * After the "storming" device finishes, we somehow manage to service a
> whole pile of normal interrupts from non-storming devices.
> * After the non-storming interrupts finished, then we went back to
> servicing the slow storming interrupt.
> 
> I haven't dug into the bowels of the Linux IRQ handling well enough to
> know if that would be possible. However, even if it was possible it
> doesn't feel likely to happen. If those "normal" interrupts aren't
> storming then it seems unlikely we'd see more than one of each of them
> between the "storming" interrupts. Thus it feels like the "storming"
> interrupt would still have the highest (or tied for the highest) count
> on the locked up CPU. Maybe you could print the top 4 or 5 (instead of
> the top 3) and you'd be pretty certain to catch even this case?
> 
Hi, I now understand your suggestion. Using interrupt time as a
criterion of interrupt storm is suitable for scenarios where we don't
know whether an interrupt storm has happened. However, in the context of
this softlockup, we have in fact already determined that an interrupt
storm might have occurred through the "hardirq time", so there's no need
for a redundant determination. Thus, I agree that using IRQ counts
seems like good enough to find "storming" interrupts.

> In any case, I'll wait before doing a more thorough review for now and
> hope the above sounds right to you.
> 
> -Doug

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] watchdog/softlockup: low-overhead detection of interrupt storm
  2024-01-25  8:31     ` Bitao Hu
@ 2024-01-25 15:08       ` Doug Anderson
  0 siblings, 0 replies; 18+ messages in thread
From: Doug Anderson @ 2024-01-25 15:08 UTC (permalink / raw)
  To: Bitao Hu; +Cc: akpm, pmladek, tglx, maz, liusong, linux-kernel

Hi,

On Thu, Jan 25, 2024 at 12:31 AM Bitao Hu <yaoma@linux.alibaba.com> wrote:
>
> > 2. Sum up all 11 of the stats to get a total. This should still fit
> > into 32-bits.
>
> >
> > 3. Do the math to get the integral percentages and store those in the tail slot.
> >
> I'm a bit confused, why is there a need to sum up all 11 of the stats to
> get a total? I calculate the percentage using sample_period.

Oh, you're right! I saw you looping over NR_STATS in print_cpustat()
and assumed that was because you were totaling them. Yeah, using
sample_period should work.

-Doug

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] watchdog/softlockup: add parameter to control the reporting of time-consuming hardirq
  2024-01-23 12:12 ` [PATCH 3/3] watchdog/softlockup: add parameter to control the reporting of " Bitao Hu
@ 2024-01-26  5:25   ` kernel test robot
  2024-01-26  6:07   ` kernel test robot
  1 sibling, 0 replies; 18+ messages in thread
From: kernel test robot @ 2024-01-26  5:25 UTC (permalink / raw)
  To: Bitao Hu, dianders, akpm, pmladek, tglx, maz, liusong
  Cc: llvm, oe-kbuild-all, linux-kernel, Bitao Hu

Hi Bitao,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/irq/core]
[also build test WARNING on akpm-mm/mm-everything linus/master v6.8-rc1 next-20240125]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Bitao-Hu/watchdog-softlockup-low-overhead-detection-of-interrupt-storm/20240123-201509
base:   tip/irq/core
patch link:    https://lore.kernel.org/r/20240123121223.22318-4-yaoma%40linux.alibaba.com
patch subject: [PATCH 3/3] watchdog/softlockup: add parameter to control the reporting of time-consuming hardirq
config: i386-buildonly-randconfig-003-20240126 (https://download.01.org/0day-ci/archive/20240126/202401261359.eaU4UnjQ-lkp@intel.com/config)
compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240126/202401261359.eaU4UnjQ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202401261359.eaU4UnjQ-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> kernel/watchdog.c:1088:5: warning: no previous prototype for function 'proc_softlockup_irqtrace' [-Wmissing-prototypes]
    1088 | int proc_softlockup_irqtrace(struct ctl_table *table, int write,
         |     ^
   kernel/watchdog.c:1088:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
    1088 | int proc_softlockup_irqtrace(struct ctl_table *table, int write,
         | ^
         | static 
   1 warning generated.


vim +/proc_softlockup_irqtrace +1088 kernel/watchdog.c

  1084	
  1085	/*
  1086	 * /proc/sys/kernel/softlockup_irqtrace
  1087	 */
> 1088	int proc_softlockup_irqtrace(struct ctl_table *table, int write,
  1089				 void *buffer, size_t *lenp, loff_t *ppos)
  1090	{
  1091		int err, old;
  1092	
  1093		mutex_lock(&watchdog_mutex);
  1094	
  1095		old = READ_ONCE(softlockup_irqtrace);
  1096		err = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
  1097	
  1098		if (!err && write && old != READ_ONCE(softlockup_irqtrace))
  1099			proc_watchdog_update();
  1100	
  1101		mutex_unlock(&watchdog_mutex);
  1102		return err;
  1103	}
  1104	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] watchdog/softlockup: low-overhead detection of interrupt storm
  2024-01-23 12:12 ` [PATCH 1/3] watchdog/softlockup: low-overhead detection of interrupt storm Bitao Hu
  2024-01-24  1:43   ` Liu Song
  2024-01-25  0:19   ` Doug Anderson
@ 2024-01-26  5:25   ` kernel test robot
  2 siblings, 0 replies; 18+ messages in thread
From: kernel test robot @ 2024-01-26  5:25 UTC (permalink / raw)
  To: Bitao Hu, dianders, akpm, pmladek, tglx, maz, liusong
  Cc: oe-kbuild-all, linux-kernel, Bitao Hu

Hi Bitao,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/irq/core]
[also build test ERROR on linus/master v6.8-rc1 next-20240125]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Bitao-Hu/watchdog-softlockup-low-overhead-detection-of-interrupt-storm/20240123-201509
base:   tip/irq/core
patch link:    https://lore.kernel.org/r/20240123121223.22318-2-yaoma%40linux.alibaba.com
patch subject: [PATCH 1/3] watchdog/softlockup: low-overhead detection of interrupt storm
config: i386-randconfig-013-20240126 (https://download.01.org/0day-ci/archive/20240126/202401261310.Ixxefn7o-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240126/202401261310.Ixxefn7o-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202401261310.Ixxefn7o-lkp@intel.com/

All errors (new ones prefixed by >>):

   ld: kernel/watchdog.o: in function `print_cpustat':
>> kernel/watchdog.c:453: undefined reference to `__udivdi3'
>> ld: kernel/watchdog.c:457: undefined reference to `__udivdi3'


vim +453 kernel/watchdog.c

   446	
   447	static void print_cpustat(void)
   448	{
   449		int i, j, k;
   450		u64 a[5][NR_STATS], b[5][NR_STATS];
   451		u64 (*diff)[NR_STATS] = this_cpu_ptr(cpustat_diff);
   452		int tail = this_cpu_read(cpustat_tail);
 > 453		u32 period_us = sample_period / 1000;
   454	
   455		for (i = 0; i < 5; i++) {
   456			for (j = 0; j < NR_STATS; j++) {
 > 457				a[i][j] = 100 * (diff[i][j] / 1000);
   458				b[i][j] = 10 * do_div(a[i][j], period_us);
   459				do_div(b[i][j], period_us);
   460			}
   461		}
   462		printk(KERN_CRIT "CPU#%d Utilization every %us during lockup:\n",
   463			smp_processor_id(), period_us/1000000);
   464		for (k = 0, i = tail; k < 5; k++, i = (i + 1) % 5) {
   465			printk(KERN_CRIT "\t#%d: %llu.%llu%% system,\t%llu.%llu%% softirq,\t"
   466				"%llu.%llu%% hardirq,\t%llu.%llu%% idle\n", k+1,
   467				a[i][CPUTIME_SYSTEM], b[i][CPUTIME_SYSTEM],
   468				a[i][CPUTIME_SOFTIRQ], b[i][CPUTIME_SOFTIRQ],
   469				a[i][CPUTIME_IRQ], b[i][CPUTIME_IRQ],
   470				a[i][CPUTIME_IDLE], b[i][CPUTIME_IDLE]);
   471		}
   472	}
   473	#else
   474	static inline void update_cpustat(void) { }
   475	static inline void print_cpustat(void) { }
   476	#endif
   477	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] watchdog/softlockup: add parameter to control the reporting of time-consuming hardirq
  2024-01-23 12:12 ` [PATCH 3/3] watchdog/softlockup: add parameter to control the reporting of " Bitao Hu
  2024-01-26  5:25   ` kernel test robot
@ 2024-01-26  6:07   ` kernel test robot
  1 sibling, 0 replies; 18+ messages in thread
From: kernel test robot @ 2024-01-26  6:07 UTC (permalink / raw)
  To: Bitao Hu, dianders, akpm, pmladek, tglx, maz, liusong
  Cc: oe-kbuild-all, linux-kernel, Bitao Hu

Hi Bitao,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/irq/core]
[also build test WARNING on akpm-mm/mm-everything linus/master v6.8-rc1 next-20240125]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Bitao-Hu/watchdog-softlockup-low-overhead-detection-of-interrupt-storm/20240123-201509
base:   tip/irq/core
patch link:    https://lore.kernel.org/r/20240123121223.22318-4-yaoma%40linux.alibaba.com
patch subject: [PATCH 3/3] watchdog/softlockup: add parameter to control the reporting of time-consuming hardirq
config: i386-randconfig-012-20240126 (https://download.01.org/0day-ci/archive/20240126/202401261322.fGeoPvI9-lkp@intel.com/config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240126/202401261322.fGeoPvI9-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202401261322.fGeoPvI9-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> kernel/watchdog.c:1088:5: warning: no previous prototype for 'proc_softlockup_irqtrace' [-Wmissing-prototypes]
    1088 | int proc_softlockup_irqtrace(struct ctl_table *table, int write,
         |     ^~~~~~~~~~~~~~~~~~~~~~~~


vim +/proc_softlockup_irqtrace +1088 kernel/watchdog.c

  1084	
  1085	/*
  1086	 * /proc/sys/kernel/softlockup_irqtrace
  1087	 */
> 1088	int proc_softlockup_irqtrace(struct ctl_table *table, int write,
  1089				 void *buffer, size_t *lenp, loff_t *ppos)
  1090	{
  1091		int err, old;
  1092	
  1093		mutex_lock(&watchdog_mutex);
  1094	
  1095		old = READ_ONCE(softlockup_irqtrace);
  1096		err = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
  1097	
  1098		if (!err && write && old != READ_ONCE(softlockup_irqtrace))
  1099			proc_watchdog_update();
  1100	
  1101		mutex_unlock(&watchdog_mutex);
  1102		return err;
  1103	}
  1104	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] watchdog/softlockup: report the most time-consuming hardirq
  2024-01-23 12:12 ` [PATCH 2/3] watchdog/softlockup: report the most time-consuming hardirq Bitao Hu
  2024-01-25  0:19   ` Doug Anderson
@ 2024-02-14 23:36   ` Thomas Gleixner
  2024-02-14 23:39     ` Doug Anderson
  1 sibling, 1 reply; 18+ messages in thread
From: Thomas Gleixner @ 2024-02-14 23:36 UTC (permalink / raw)
  To: Bitao Hu, dianders, akpm, pmladek, maz, liusong; +Cc: linux-kernel, Bitao Hu

On Tue, Jan 23 2024 at 20:12, Bitao Hu wrote:
> +/*
> + * If the proportion of time spent handling irq exceeds 50% during a sampling period,
> + * then it is necessary to tally the handling time of each irq.
> + */
> +static inline bool need_trace_irqtime(int type)
> +{
> +	int tail = this_cpu_read(cpustat_tail);
> +
> +	if (--tail == -1)
> +		tail = 4;
> +	return this_cpu_read(cpustat_diff[tail][type]) > sample_period/2;
> +}
> +
> +static void irq_handler_entry_callback(void *data, int irq, struct irqaction *action)
> +{
> +	this_cpu_ptr(irq_to_desc(irq)->irq_time)->start_time = local_clock();
> +}
> +
> +static void irq_handler_exit_callback(void *data, int irq, struct irqaction *action, int ret)
> +{
> +	u64 delta;
> +	struct per_irqtime *irq_time;
> +
> +	if (test_bit(SOFTLOCKUP_HARDIRQ, this_cpu_ptr(&softlockup_flags))) {
> +		irq_time = this_cpu_ptr(irq_to_desc(irq)->irq_time);
> +		delta = local_clock() - irq_time->start_time;
> +		irq_time->delta += delta;
> +	}
> +}

I can understand what you are trying to achieve, but you inflict the
overhead of two tracepoints to every interrupt unconditionally.

For the vast majority of usage scenarios that's just pointless overhead
for no value. That's not acceptable at all.

Instrumentation and debugging has to find the least intrusive solution
and not just go for the perfect picture. Remember: perfect is the enemy
of good.

You really have to think hard about what you really need to achieve for
a particular problem case.

In this case it's completely sufficient to know the number of interrupts
which happened since softlockup detection took a snapshot and the actual
analysis.

That's especially true when interrupt time accounting is active because
then the only interesting information is which interrupts fired during
the observation period.

Even if that's not available it is a reasonable assumption that the
number of interrupts during the observation period gives a pretty
conclusive hint about the potential cause of the problem.

This is not meant to find the oddball issue of an interrupt handler
which consumes a lot of time per invocation. This is about storm
detection where the handler runs briefly and actually returns
IRQ_HANDLED so that the spurious detection does not catch it.

This can be implemented with exactly zero overhead for the good case,
keeping it self contained to the interrupt core code and providing
sensible interfaces for the watchdog code.

See the uncompiled patch for illustration below.

As a side note: While C does not allow proper encapsulation it's a non
starter to fiddle with the interrupt descriptor internals in random code
just because the compiler allows you to do so. While not enforced there
are clear boundaries and we went a long way to encapsulate this.

Thanks,

        tglx
---
 include/linux/irqdesc.h |    9 +++++++--
 kernel/irq/internals.h  |    2 +-
 kernel/irq/irqdesc.c    |   32 ++++++++++++++++++++++++++------
 kernel/irq/proc.c       |    9 +++------
 4 files changed, 37 insertions(+), 15 deletions(-)

--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -17,6 +17,11 @@ struct irq_desc;
 struct irq_domain;
 struct pt_regs;
 
+struct irqstat {
+	unsigned int	cnt;
+	unsigned int	ref;
+};
+
 /**
  * struct irq_desc - interrupt descriptor
  * @irq_common_data:	per irq and chip data passed down to chip functions
@@ -55,7 +60,7 @@ struct pt_regs;
 struct irq_desc {
 	struct irq_common_data	irq_common_data;
 	struct irq_data		irq_data;
-	unsigned int __percpu	*kstat_irqs;
+	struct irqstat __percpu	*kstat_irqs;
 	irq_flow_handler_t	handle_irq;
 	struct irqaction	*action;	/* IRQ action list */
 	unsigned int		status_use_accessors;
@@ -119,7 +124,7 @@ extern struct irq_desc irq_desc[NR_IRQS]
 static inline unsigned int irq_desc_kstat_cpu(struct irq_desc *desc,
 					      unsigned int cpu)
 {
-	return desc->kstat_irqs ? *per_cpu_ptr(desc->kstat_irqs, cpu) : 0;
+	return desc->kstat_irqs ? per_cpu(desc->kstat_irqs->cnt, cpu) : 0;
 }
 
 static inline struct irq_desc *irq_data_to_desc(struct irq_data *data)
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -258,7 +258,7 @@ static inline void irq_state_set_masked(
 
 static inline void __kstat_incr_irqs_this_cpu(struct irq_desc *desc)
 {
-	__this_cpu_inc(*desc->kstat_irqs);
+	__this_cpu_inc(desc->kstat_irqs->cnt);
 	__this_cpu_inc(kstat.irqs_sum);
 }
 
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -122,7 +122,7 @@ static void desc_set_defaults(unsigned i
 	desc->name = NULL;
 	desc->owner = owner;
 	for_each_possible_cpu(cpu)
-		*per_cpu_ptr(desc->kstat_irqs, cpu) = 0;
+		*per_cpu_ptr(desc->kstat_irqs, cpu) = (struct irqstat) { };
 	desc_smp_init(desc, node, affinity);
 }
 
@@ -418,8 +418,8 @@ static struct irq_desc *alloc_desc(int i
 	desc = kzalloc_node(sizeof(*desc), GFP_KERNEL, node);
 	if (!desc)
 		return NULL;
-	/* allocate based on nr_cpu_ids */
-	desc->kstat_irqs = alloc_percpu(unsigned int);
+
+	desc->kstat_irqs = alloc_percpu(struct irqstat);
 	if (!desc->kstat_irqs)
 		goto err_desc;
 
@@ -952,8 +952,7 @@ unsigned int kstat_irqs_cpu(unsigned int
 {
 	struct irq_desc *desc = irq_to_desc(irq);
 
-	return desc && desc->kstat_irqs ?
-			*per_cpu_ptr(desc->kstat_irqs, cpu) : 0;
+	return desc && desc->kstat_irqs ? per_cpu(desc->kstat_irqs->cnt, cpu) : 0;
 }
 
 static bool irq_is_nmi(struct irq_desc *desc)
@@ -975,10 +974,31 @@ static unsigned int kstat_irqs(unsigned
 		return data_race(desc->tot_count);
 
 	for_each_possible_cpu(cpu)
-		sum += data_race(*per_cpu_ptr(desc->kstat_irqs, cpu));
+		sum += data_race(per_cpu(desc->kstat_irqs->cnt, cpu));
 	return sum;
 }
 
+void kstat_snapshot_irqs(void)
+{
+	struct irq_desc *desc;
+	unsigned int irq;
+
+	for_each_irq_desc(irq, desc) {
+		if (!desc->kstat_irqs)
+			continue;
+		this_cpu_write(desc->kstat_irqs->ref, this_cpu_read(desc->kstat_irqs->cnt));
+	}
+}
+
+unsigned int kstat_get_irq_since_snapshot(unsigned int irq)
+{
+	struct irq_desc *desc = irq_to_desc(irq);
+
+	if (!desc || !desc->kstat_irqs)
+		return 0;
+	return this_cpu_read(desc->kstat_irqs->cnt) - this_cpu_read(desc->kstat_irqs->ref);
+}
+
 /**
  * kstat_irqs_usr - Get the statistics for an interrupt from thread context
  * @irq:	The interrupt number
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -488,18 +488,15 @@ int show_interrupts(struct seq_file *p,
 	if (!desc || irq_settings_is_hidden(desc))
 		goto outsparse;
 
-	if (desc->kstat_irqs) {
-		for_each_online_cpu(j)
-			any_count |= data_race(*per_cpu_ptr(desc->kstat_irqs, j));
-	}
+	if (desc->kstat_irqs)
+		any_count = data_race(desc->tot_count);
 
 	if ((!desc->action || irq_desc_is_chained(desc)) && !any_count)
 		goto outsparse;
 
 	seq_printf(p, "%*d: ", prec, i);
 	for_each_online_cpu(j)
-		seq_printf(p, "%10u ", desc->kstat_irqs ?
-					*per_cpu_ptr(desc->kstat_irqs, j) : 0);
+		seq_printf(p, "%10u ", desc->kstat_irqs ? per_cpu(desc->kstat_irqs->cnt, j) : 0);
 
 	raw_spin_lock_irqsave(&desc->lock, flags);
 	if (desc->irq_data.chip) {

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] watchdog/softlockup: report the most time-consuming hardirq
  2024-02-14 23:36   ` Thomas Gleixner
@ 2024-02-14 23:39     ` Doug Anderson
  2024-02-15  1:02       ` Thomas Gleixner
  0 siblings, 1 reply; 18+ messages in thread
From: Doug Anderson @ 2024-02-14 23:39 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Bitao Hu, akpm, pmladek, maz, liusong, linux-kernel

Hi,

On Wed, Feb 14, 2024 at 3:36 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Tue, Jan 23 2024 at 20:12, Bitao Hu wrote:
> > +/*
> > + * If the proportion of time spent handling irq exceeds 50% during a sampling period,
> > + * then it is necessary to tally the handling time of each irq.
> > + */
> > +static inline bool need_trace_irqtime(int type)
> > +{
> > +     int tail = this_cpu_read(cpustat_tail);
> > +
> > +     if (--tail == -1)
> > +             tail = 4;
> > +     return this_cpu_read(cpustat_diff[tail][type]) > sample_period/2;
> > +}
> > +
> > +static void irq_handler_entry_callback(void *data, int irq, struct irqaction *action)
> > +{
> > +     this_cpu_ptr(irq_to_desc(irq)->irq_time)->start_time = local_clock();
> > +}
> > +
> > +static void irq_handler_exit_callback(void *data, int irq, struct irqaction *action, int ret)
> > +{
> > +     u64 delta;
> > +     struct per_irqtime *irq_time;
> > +
> > +     if (test_bit(SOFTLOCKUP_HARDIRQ, this_cpu_ptr(&softlockup_flags))) {
> > +             irq_time = this_cpu_ptr(irq_to_desc(irq)->irq_time);
> > +             delta = local_clock() - irq_time->start_time;
> > +             irq_time->delta += delta;
> > +     }
> > +}
>
> I can understand what you are trying to achieve, but you inflict the
> overhead of two tracepoints to every interrupt unconditionally.
>
> For the vast majority of usage scenarios that's just pointless overhead
> for no value. That's not acceptable at all.
>
> Instrumentation and debugging has to find the least intrusive solution
> and not just go for the perfect picture. Remember: perfect is the enemy
> of good.
>
> You really have to think hard about what you really need to achieve for
> a particular problem case.
>
> In this case it's completely sufficient to know the number of interrupts
> which happened since softlockup detection took a snapshot and the actual
> analysis.
>
> That's especially true when interrupt time accounting is active because
> then the only interesting information is which interrupts fired during
> the observation period.
>
> Even if that's not available it is a reasonable assumption that the
> number of interrupts during the observation period gives a pretty
> conclusive hint about the potential cause of the problem.
>
> This is not meant to find the oddball issue of an interrupt handler
> which consumes a lot of time per invocation. This is about storm
> detection where the handler runs briefly and actually returns
> IRQ_HANDLED so that the spurious detection does not catch it.
>
> This can be implemented with exactly zero overhead for the good case,
> keeping it self contained to the interrupt core code and providing
> sensible interfaces for the watchdog code.
>
> See the uncompiled patch for illustration below.
>
> As a side note: While C does not allow proper encapsulation it's a non
> starter to fiddle with the interrupt descriptor internals in random code
> just because the compiler allows you to do so. While not enforced there
> are clear boundaries and we went a long way to encapsulate this.

I think you must have gotten dropped from all the future versions of
this patch series when Bitao took my advice and started using
interrupt counts instead of tracing. For what it's worth, the latest
version can be found at:

https://lore.kernel.org/r/20240214021430.87471-1-yaoma@linux.alibaba.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] watchdog/softlockup: report the most time-consuming hardirq
  2024-02-14 23:39     ` Doug Anderson
@ 2024-02-15  1:02       ` Thomas Gleixner
  0 siblings, 0 replies; 18+ messages in thread
From: Thomas Gleixner @ 2024-02-15  1:02 UTC (permalink / raw)
  To: Doug Anderson; +Cc: Bitao Hu, akpm, pmladek, maz, liusong, linux-kernel

On Wed, Feb 14 2024 at 15:39, Doug Anderson wrote:
> On Wed, Feb 14, 2024 at 3:36 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> On Tue, Jan 23 2024 at 20:12, Bitao Hu wrote:
>> As a side note: While C does not allow proper encapsulation it's a non
>> starter to fiddle with the interrupt descriptor internals in random code
>> just because the compiler allows you to do so. While not enforced there
>> are clear boundaries and we went a long way to encapsulate this.
>
> I think you must have gotten dropped from all the future versions of
> this patch series when Bitao took my advice and started using
> interrupt counts instead of tracing. For what it's worth, the latest
> version can be found at:
>
> https://lore.kernel.org/r/20240214021430.87471-1-yaoma@linux.alibaba.com

Yes. I'm not on CC and for some stupid reason it evaded my LKML filters.

Let me find that thread and stare at it.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-02-15  1:02 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-23 12:12 [PATCH 0/3] *** Detect interrupt storm in softlockup *** Bitao Hu
2024-01-23 12:12 ` [PATCH 1/3] watchdog/softlockup: low-overhead detection of interrupt storm Bitao Hu
2024-01-24  1:43   ` Liu Song
2024-01-24  2:48     ` yaoma
2024-01-25  0:19   ` Doug Anderson
2024-01-25  8:31     ` Bitao Hu
2024-01-25 15:08       ` Doug Anderson
2024-01-26  5:25   ` kernel test robot
2024-01-23 12:12 ` [PATCH 2/3] watchdog/softlockup: report the most time-consuming hardirq Bitao Hu
2024-01-25  0:19   ` Doug Anderson
2024-01-25  7:50     ` Bitao Hu
2024-01-25  9:19     ` Bitao Hu
2024-02-14 23:36   ` Thomas Gleixner
2024-02-14 23:39     ` Doug Anderson
2024-02-15  1:02       ` Thomas Gleixner
2024-01-23 12:12 ` [PATCH 3/3] watchdog/softlockup: add parameter to control the reporting of " Bitao Hu
2024-01-26  5:25   ` kernel test robot
2024-01-26  6:07   ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox