public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET v3 wq/for-6.5] workqueue: Implement automatic CPU intensive detection and add monitoring
@ 2023-05-11 18:19 Tejun Heo
  2023-05-11 18:19 ` [PATCH 1/7] workqueue: Add pwq->stats[] and a monitoring script Tejun Heo
                   ` (6 more replies)
  0 siblings, 7 replies; 30+ messages in thread
From: Tejun Heo @ 2023-05-11 18:19 UTC (permalink / raw)
  To: jiangshanlai; +Cc: torvalds, peterz, linux-kernel, kernel-team

Hello,

v3: * Switched to hooking into scheduler_tick() instead of scheduling paths
      as suggested by Peter. It's less gnarly and works well in general;
      however, as the mechanism is now sampling based, there can be
      contrived cases where detection can be temporarily avoided. Also, it
      wouldn't work on nohz_full CPUs. Neither is critical especially given
      that common offenders are likely to be weeded out with the debug
      reporting over time.

    * As the above means that workqueue is no longer obersving all
      scheduling events, it can't track the CPU time being consumed by the
      workers on its own and thus can't use global clocks (e.g. jiffies).
      The CPU time consumption tracking is still done with
      p->se.sum_exec_runtime.

    * The mechanism was incorrectly monitoring the entire CPU time a given
      work item has consumed instead of each stretch without intervening
      sleeps. Fixed.

    * CPU time monitoring is now tick sampling based. The previous
      p->se.sum_exec_runtime implementation was missing CPU time consumed
      between the last scheduling event the work item finished and the
      completion, so, e.g., work items that never schedule would always be
      accounted as zero CPU time. While the sampling based implementation
      isn't very accurate, this is good enough for getting the overall
      picture of workqueues that consume a lot of CPU cycles.

    * Patches reordered so that the visibility one can be applied first.
      Documentation improved.

v2: * Lai pointed out that !SM_NONE cases should also be notified. 0001 and
      0004 are updated accordingly.

    * PeterZ suggested reporting on work items that trigger the auto CPU
      intensive mechanism. 0006 adds reporting of work functions that
      trigger the mechanism repeatedly with exponential backoff.

Hello,

To reduce the number of concurrent worker threads, workqueue holds back
starting per-cpu work items while the previous work item stays in the
RUNNING state. As such a per-cpu work item which consumes a lot of CPU
cycles, even if it has cond_resched()'s in the right places, can stall other
per-cpu work items.

To support per-cpu work items that may occupy the CPU for a substantial
period of time, workqueue has WQ_CPU_INTENSIVE flag which exempts work items
issued through the marked workqueue from concurrency management - they're
started immediately and don't block other work items. While this works, it's
error-prone in that a workqueue user can easily forget to set the flag or
set it unnecessarily. Furthermore, the impacts of the wrong flag setting can
be rather indirect and challenging to root-cause.

This patchset makes workqueue auto-detect CPU intensive work items based on
CPU consumption. If a work item consumes more than the threshold (10ms by
default) of CPU time, it's automatically marked as CPU intensive when it
gets scheduled out which unblocks starting of pending per-cpu work items.

The mechanism isn't foolproof in that the detection delays can add up if
many CPU-hogging work items are queued at the same time. However, in such
situations, the bigger problem likely is the CPU being saturated with
per-cpu work items and the solution would be making them UNBOUND. Future
changes will make UNBOUND workqueues more attractive by improving their
locality behaviors and configurability. We might eventually remove the
explicit WQ_CPU_INTENSIVE flag.

While at it, add statistics and a monitoring script. Lack of visibility has
always been a bit of pain point when debugging workqueue related issues and
with this change and more drastic ones planned for workqueue, this is a good
time to address the shortcoming.

This patchset was born out of the discussion in the following thread:

 https://lkml.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com

and contains the following patches:

 0001-workqueue-Add-pwq-stats-and-a-monitoring-script.patch
 0002-workqueue-Re-order-struct-worker-fields.patch
 0003-workqueue-Move-worker_set-clr_flags-upwards.patch
 0004-workqueue-Improve-locking-rule-description-for-worke.patch
 0005-workqueue-Automatically-mark-CPU-hogging-work-items-.patch
 0006-workqueue-Report-work-funcs-that-trigger-automatic-C.patch
 0007-workqueue-Track-and-monitor-per-workqueue-CPU-time-u.patch

and also available in the following git branch:

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git auto-cpu-intensive-v3

diffstat follows. Thanks.

 Documentation/core-api/workqueue.rst |   32 +++++
 kernel/sched/core.c                  |    3
 kernel/workqueue.c                   |  337 ++++++++++++++++++++++++++++++++++++++++++++++++++-----------
 kernel/workqueue_internal.h          |   24 ++--
 lib/Kconfig.debug                    |   13 ++
 tools/workqueue/wq_monitor.py        |  169 ++++++++++++++++++++++++++++++
 6 files changed, 507 insertions(+), 71 deletions(-)

--
tejun


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/7] workqueue: Add pwq->stats[] and a monitoring script
  2023-05-11 18:19 [PATCHSET v3 wq/for-6.5] workqueue: Implement automatic CPU intensive detection and add monitoring Tejun Heo
@ 2023-05-11 18:19 ` Tejun Heo
  2023-05-11 18:19 ` [PATCH 2/7] workqueue: Re-order struct worker fields Tejun Heo
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2023-05-11 18:19 UTC (permalink / raw)
  To: jiangshanlai; +Cc: torvalds, peterz, linux-kernel, kernel-team, Tejun Heo

Currently, the only way to peer into workqueue operations is through
tracing. While possible, it isn't easy or convenient to monitor
per-workqueue behaviors over time this way. Let's add pwq->stats[] that
track relevant events and a drgn monitoring script -
tools/workqueue/wq_monitor.py.

It's arguable whether this needs to be configurable. However, it currently
only has several counters and the runtime overhead shouldn't be noticeable
given that they're on pwq's which are per-cpu on per-cpu workqueues and
per-numa-node on unbound ones. Let's keep it simple for the time being.

v2: Patch reordered to earlier with fewer fields. Field will be added back
    gradually. Help message improved.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
---
 Documentation/core-api/workqueue.rst |  32 ++++++
 kernel/workqueue.c                   |  24 ++++-
 tools/workqueue/wq_monitor.py        | 151 +++++++++++++++++++++++++++
 3 files changed, 206 insertions(+), 1 deletion(-)
 create mode 100644 tools/workqueue/wq_monitor.py

diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
index 8ec4d6270b24..7e5c39310bbf 100644
--- a/Documentation/core-api/workqueue.rst
+++ b/Documentation/core-api/workqueue.rst
@@ -348,6 +348,37 @@ Guidelines
   level of locality in wq operations and work item execution.
 
 
+Monitoring
+==========
+
+Use tools/workqueue/wq_monitor.py to monitor workqueue operations: ::
+
+  $ tools/workqueue/wq_monitor.py events
+                              total  infl  CMwake  mayday rescued
+  events                      18545     0       5       -       -
+  events_highpri                  8     0       0       -       -
+  events_long                     3     0       0       -       -
+  events_unbound              38306     0       -       -       -
+  events_freezable                0     0       0       -       -
+  events_power_efficient      29598     0       0       -       -
+  events_freezable_power_        10     0       0       -       -
+  sock_diag_events                0     0       0       -       -
+
+                              total  infl  CMwake  mayday rescued
+  events                      18548     0       5       -       -
+  events_highpri                  8     0       0       -       -
+  events_long                     3     0       0       -       -
+  events_unbound              38322     0       -       -       -
+  events_freezable                0     0       0       -       -
+  events_power_efficient      29603     0       0       -       -
+  events_freezable_power_        10     0       0       -       -
+  sock_diag_events                0     0       0       -       -
+
+  ...
+
+See the command's help message for more info.
+
+
 Debugging
 =========
 
@@ -387,6 +418,7 @@ For the second type of problems it should be possible to just check
 The work item's function should be trivially visible in the stack
 trace.
 
+
 Non-reentrance Conditions
 =========================
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 4666a1a92a31..33ddfaacdf29 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -199,6 +199,20 @@ struct worker_pool {
 	struct rcu_head		rcu;
 };
 
+/*
+ * Per-pool_workqueue statistics. These can be monitored using
+ * tools/workqueue/wq_monitor.py.
+ */
+enum pool_workqueue_stats {
+	PWQ_STAT_STARTED,	/* work items started execution */
+	PWQ_STAT_COMPLETED,	/* work items completed execution */
+	PWQ_STAT_CM_WAKEUP,	/* concurrency-management worker wakeups */
+	PWQ_STAT_MAYDAY,	/* maydays to rescuer */
+	PWQ_STAT_RESCUED,	/* linked work items executed by rescuer */
+
+	PWQ_NR_STATS,
+};
+
 /*
  * The per-pool workqueue.  While queued, the lower WORK_STRUCT_FLAG_BITS
  * of work_struct->data are used for flags and the remaining high bits
@@ -236,6 +250,8 @@ struct pool_workqueue {
 	struct list_head	pwqs_node;	/* WR: node on wq->pwqs */
 	struct list_head	mayday_node;	/* MD: node on wq->maydays */
 
+	u64			stats[PWQ_NR_STATS];
+
 	/*
 	 * Release of unbound pwq is punted to system_wq.  See put_pwq()
 	 * and pwq_unbound_release_workfn() for details.  pool_workqueue
@@ -929,8 +945,10 @@ void wq_worker_sleeping(struct task_struct *task)
 	}
 
 	pool->nr_running--;
-	if (need_more_worker(pool))
+	if (need_more_worker(pool)) {
+		worker->current_pwq->stats[PWQ_STAT_CM_WAKEUP]++;
 		wake_up_worker(pool);
+	}
 	raw_spin_unlock_irq(&pool->lock);
 }
 
@@ -2163,6 +2181,7 @@ static void send_mayday(struct work_struct *work)
 		get_pwq(pwq);
 		list_add_tail(&pwq->mayday_node, &wq->maydays);
 		wake_up_process(wq->rescuer->task);
+		pwq->stats[PWQ_STAT_MAYDAY]++;
 	}
 }
 
@@ -2401,6 +2420,7 @@ __acquires(&pool->lock)
 	 * workqueues), so hiding them isn't a problem.
 	 */
 	lockdep_invariant_state(true);
+	pwq->stats[PWQ_STAT_STARTED]++;
 	trace_workqueue_execute_start(work);
 	worker->current_func(work);
 	/*
@@ -2408,6 +2428,7 @@ __acquires(&pool->lock)
 	 * point will only record its address.
 	 */
 	trace_workqueue_execute_end(work, worker->current_func);
+	pwq->stats[PWQ_STAT_COMPLETED]++;
 	lock_map_release(&lockdep_map);
 	lock_map_release(&pwq->wq->lockdep_map);
 
@@ -2651,6 +2672,7 @@ static int rescuer_thread(void *__rescuer)
 				if (first)
 					pool->watchdog_ts = jiffies;
 				move_linked_works(work, scheduled, &n);
+				pwq->stats[PWQ_STAT_RESCUED]++;
 			}
 			first = false;
 		}
diff --git a/tools/workqueue/wq_monitor.py b/tools/workqueue/wq_monitor.py
new file mode 100644
index 000000000000..94435fd14c98
--- /dev/null
+++ b/tools/workqueue/wq_monitor.py
@@ -0,0 +1,151 @@
+#!/usr/bin/env drgn
+#
+# Copyright (C) 2023 Tejun Heo <tj@kernel.org>
+# Copyright (C) 2023 Meta Platforms, Inc. and affiliates.
+
+desc = """
+
+This is a drgn script to monitor workqueues. For more info on drgn, visit
+https://github.com/osandov/drgn.
+
+  total    Total number of work items executed by the workqueue.
+
+  infl     The number of currently in-flight work items.
+
+  CMwake   The number of concurrency-management wake-ups while executing a
+           work item of the workqueue.
+
+  mayday   The number of times the rescuer was requested while waiting for
+           new worker creation.
+
+  rescued  The number of work items executed by the rescuer.
+"""
+
+import sys
+import signal
+import os
+import re
+import time
+import json
+
+import drgn
+from drgn.helpers.linux.list import list_for_each_entry,list_empty
+from drgn.helpers.linux.cpumask import for_each_possible_cpu
+
+import argparse
+parser = argparse.ArgumentParser(description=desc,
+                                 formatter_class=argparse.RawTextHelpFormatter)
+parser.add_argument('workqueue', metavar='REGEX', nargs='*',
+                    help='Target workqueue name patterns (all if empty)')
+parser.add_argument('-i', '--interval', metavar='SECS', type=float, default=1,
+                    help='Monitoring interval (0 to print once and exit)')
+parser.add_argument('-j', '--json', action='store_true',
+                    help='Output in json')
+args = parser.parse_args()
+
+def err(s):
+    print(s, file=sys.stderr, flush=True)
+    sys.exit(1)
+
+workqueues              = prog['workqueues']
+
+WQ_UNBOUND              = prog['WQ_UNBOUND']
+WQ_MEM_RECLAIM          = prog['WQ_MEM_RECLAIM']
+
+PWQ_STAT_STARTED        = prog['PWQ_STAT_STARTED']      # work items started execution
+PWQ_STAT_COMPLETED      = prog['PWQ_STAT_COMPLETED']	# work items completed execution
+PWQ_STAT_CM_WAKEUP      = prog['PWQ_STAT_CM_WAKEUP']    # concurrency-management worker wakeups
+PWQ_STAT_MAYDAY         = prog['PWQ_STAT_MAYDAY']	# maydays to rescuer
+PWQ_STAT_RESCUED        = prog['PWQ_STAT_RESCUED']	# linked work items executed by rescuer
+PWQ_NR_STATS            = prog['PWQ_NR_STATS']
+
+class WqStats:
+    def __init__(self, wq):
+        self.name = wq.name.string_().decode()
+        self.unbound = wq.flags & WQ_UNBOUND != 0
+        self.mem_reclaim = wq.flags & WQ_MEM_RECLAIM != 0
+        self.stats = [0] * PWQ_NR_STATS
+        for pwq in list_for_each_entry('struct pool_workqueue', wq.pwqs.address_of_(), 'pwqs_node'):
+            for i in range(PWQ_NR_STATS):
+                self.stats[i] += int(pwq.stats[i])
+
+    def dict(self, now):
+        return { 'timestamp'            : now,
+                 'name'                 : self.name,
+                 'unbound'              : self.unbound,
+                 'mem_reclaim'          : self.mem_reclaim,
+                 'started'              : self.stats[PWQ_STAT_STARTED],
+                 'completed'            : self.stats[PWQ_STAT_COMPLETED],
+                 'cm_wakeup'            : self.stats[PWQ_STAT_CM_WAKEUP],
+                 'mayday'               : self.stats[PWQ_STAT_MAYDAY],
+                 'rescued'              : self.stats[PWQ_STAT_RESCUED], }
+
+    def table_header_str():
+        return f'{"":>24} {"total":>8} {"infl":>5} {"CMwake":>7} {"mayday":>7} {"rescued":>7}'
+
+    def table_row_str(self):
+        cm_wakeup = '-'
+        mayday = '-'
+        rescued = '-'
+
+        if not self.unbound:
+            cm_wakeup = str(self.stats[PWQ_STAT_CM_WAKEUP])
+
+        if self.mem_reclaim:
+            mayday = str(self.stats[PWQ_STAT_MAYDAY])
+            rescued = str(self.stats[PWQ_STAT_RESCUED])
+
+        out = f'{self.name[-24:]:24} ' \
+              f'{self.stats[PWQ_STAT_STARTED]:8} ' \
+              f'{max(self.stats[PWQ_STAT_STARTED] - self.stats[PWQ_STAT_COMPLETED], 0):5} ' \
+              f'{cm_wakeup:>7} ' \
+              f'{mayday:>7} ' \
+              f'{rescued:>7} '
+        return out.rstrip(':')
+
+exit_req = False
+
+def sigint_handler(signr, frame):
+    global exit_req
+    exit_req = True
+
+def main():
+    # handle args
+    table_fmt = not args.json
+    interval = args.interval
+
+    re_str = None
+    if args.workqueue:
+        for r in args.workqueue:
+            if re_str is None:
+                re_str = r
+            else:
+                re_str += '|' + r
+
+    filter_re = re.compile(re_str) if re_str else None
+
+    # monitoring loop
+    signal.signal(signal.SIGINT, sigint_handler)
+
+    while not exit_req:
+        now = time.time()
+
+        if table_fmt:
+            print()
+            print(WqStats.table_header_str())
+
+        for wq in list_for_each_entry('struct workqueue_struct', workqueues.address_of_(), 'list'):
+            stats = WqStats(wq)
+            if filter_re and not filter_re.search(stats.name):
+                continue
+            if table_fmt:
+                print(stats.table_row_str())
+            else:
+                print(stats.dict(now))
+
+        if interval == 0:
+            break
+        time.sleep(interval)
+
+if __name__ == "__main__":
+    main()
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/7] workqueue: Re-order struct worker fields
  2023-05-11 18:19 [PATCHSET v3 wq/for-6.5] workqueue: Implement automatic CPU intensive detection and add monitoring Tejun Heo
  2023-05-11 18:19 ` [PATCH 1/7] workqueue: Add pwq->stats[] and a monitoring script Tejun Heo
@ 2023-05-11 18:19 ` Tejun Heo
  2023-05-11 18:19 ` [PATCH 3/7] workqueue: Move worker_set/clr_flags() upwards Tejun Heo
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2023-05-11 18:19 UTC (permalink / raw)
  To: jiangshanlai; +Cc: torvalds, peterz, linux-kernel, kernel-team, Tejun Heo

struct worker was laid out with the intent that all fields that are modified
for each work item execution are in the first cacheline. However, this
hasn't been true for a while with the addition of ->last_func. Let's just
collect hot fields together at the top.

Move ->sleeping in the hole after ->current_color and move ->lst_func right
below. While at it, drop the cacheline comment which isn't useful anymore.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
---
 kernel/workqueue_internal.h | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
index e00b1204a8e9..0600f04ceeb2 100644
--- a/kernel/workqueue_internal.h
+++ b/kernel/workqueue_internal.h
@@ -32,9 +32,12 @@ struct worker {
 	work_func_t		current_func;	/* L: current_work's fn */
 	struct pool_workqueue	*current_pwq;	/* L: current_work's pwq */
 	unsigned int		current_color;	/* L: current_work's color */
-	struct list_head	scheduled;	/* L: scheduled works */
+	int			sleeping;	/* None */
+
+	/* used by the scheduler to determine a worker's last known identity */
+	work_func_t		last_func;	/* L: last work's fn */
 
-	/* 64 bytes boundary on 64bit, 32 on 32bit */
+	struct list_head	scheduled;	/* L: scheduled works */
 
 	struct task_struct	*task;		/* I: worker task */
 	struct worker_pool	*pool;		/* A: the associated pool */
@@ -45,7 +48,6 @@ struct worker {
 	unsigned long		last_active;	/* L: last active timestamp */
 	unsigned int		flags;		/* X: flags */
 	int			id;		/* I: worker id */
-	int			sleeping;	/* None */
 
 	/*
 	 * Opaque string set with work_set_desc().  Printed out with task
@@ -55,9 +57,6 @@ struct worker {
 
 	/* used only by rescuers to point to the target workqueue */
 	struct workqueue_struct	*rescue_wq;	/* I: the workqueue to rescue */
-
-	/* used by the scheduler to determine a worker's last known identity */
-	work_func_t		last_func;
 };
 
 /**
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/7] workqueue: Move worker_set/clr_flags() upwards
  2023-05-11 18:19 [PATCHSET v3 wq/for-6.5] workqueue: Implement automatic CPU intensive detection and add monitoring Tejun Heo
  2023-05-11 18:19 ` [PATCH 1/7] workqueue: Add pwq->stats[] and a monitoring script Tejun Heo
  2023-05-11 18:19 ` [PATCH 2/7] workqueue: Re-order struct worker fields Tejun Heo
@ 2023-05-11 18:19 ` Tejun Heo
  2023-05-11 18:19 ` [PATCH 4/7] workqueue: Improve locking rule description for worker fields Tejun Heo
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2023-05-11 18:19 UTC (permalink / raw)
  To: jiangshanlai; +Cc: torvalds, peterz, linux-kernel, kernel-team, Tejun Heo

They are going to be used in wq_worker_stopping(). Move them upwards.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
---
 kernel/workqueue.c | 108 ++++++++++++++++++++++-----------------------
 1 file changed, 54 insertions(+), 54 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 33ddfaacdf29..452bcb49192a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -879,6 +879,60 @@ static void wake_up_worker(struct worker_pool *pool)
 		wake_up_process(worker->task);
 }
 
+/**
+ * worker_set_flags - set worker flags and adjust nr_running accordingly
+ * @worker: self
+ * @flags: flags to set
+ *
+ * Set @flags in @worker->flags and adjust nr_running accordingly.
+ *
+ * CONTEXT:
+ * raw_spin_lock_irq(pool->lock)
+ */
+static inline void worker_set_flags(struct worker *worker, unsigned int flags)
+{
+	struct worker_pool *pool = worker->pool;
+
+	WARN_ON_ONCE(worker->task != current);
+
+	/* If transitioning into NOT_RUNNING, adjust nr_running. */
+	if ((flags & WORKER_NOT_RUNNING) &&
+	    !(worker->flags & WORKER_NOT_RUNNING)) {
+		pool->nr_running--;
+	}
+
+	worker->flags |= flags;
+}
+
+/**
+ * worker_clr_flags - clear worker flags and adjust nr_running accordingly
+ * @worker: self
+ * @flags: flags to clear
+ *
+ * Clear @flags in @worker->flags and adjust nr_running accordingly.
+ *
+ * CONTEXT:
+ * raw_spin_lock_irq(pool->lock)
+ */
+static inline void worker_clr_flags(struct worker *worker, unsigned int flags)
+{
+	struct worker_pool *pool = worker->pool;
+	unsigned int oflags = worker->flags;
+
+	WARN_ON_ONCE(worker->task != current);
+
+	worker->flags &= ~flags;
+
+	/*
+	 * If transitioning out of NOT_RUNNING, increment nr_running.  Note
+	 * that the nested NOT_RUNNING is not a noop.  NOT_RUNNING is mask
+	 * of multiple flags, not a single flag.
+	 */
+	if ((flags & WORKER_NOT_RUNNING) && (oflags & WORKER_NOT_RUNNING))
+		if (!(worker->flags & WORKER_NOT_RUNNING))
+			pool->nr_running++;
+}
+
 /**
  * wq_worker_running - a worker is running again
  * @task: task waking up
@@ -983,60 +1037,6 @@ work_func_t wq_worker_last_func(struct task_struct *task)
 	return worker->last_func;
 }
 
-/**
- * worker_set_flags - set worker flags and adjust nr_running accordingly
- * @worker: self
- * @flags: flags to set
- *
- * Set @flags in @worker->flags and adjust nr_running accordingly.
- *
- * CONTEXT:
- * raw_spin_lock_irq(pool->lock)
- */
-static inline void worker_set_flags(struct worker *worker, unsigned int flags)
-{
-	struct worker_pool *pool = worker->pool;
-
-	WARN_ON_ONCE(worker->task != current);
-
-	/* If transitioning into NOT_RUNNING, adjust nr_running. */
-	if ((flags & WORKER_NOT_RUNNING) &&
-	    !(worker->flags & WORKER_NOT_RUNNING)) {
-		pool->nr_running--;
-	}
-
-	worker->flags |= flags;
-}
-
-/**
- * worker_clr_flags - clear worker flags and adjust nr_running accordingly
- * @worker: self
- * @flags: flags to clear
- *
- * Clear @flags in @worker->flags and adjust nr_running accordingly.
- *
- * CONTEXT:
- * raw_spin_lock_irq(pool->lock)
- */
-static inline void worker_clr_flags(struct worker *worker, unsigned int flags)
-{
-	struct worker_pool *pool = worker->pool;
-	unsigned int oflags = worker->flags;
-
-	WARN_ON_ONCE(worker->task != current);
-
-	worker->flags &= ~flags;
-
-	/*
-	 * If transitioning out of NOT_RUNNING, increment nr_running.  Note
-	 * that the nested NOT_RUNNING is not a noop.  NOT_RUNNING is mask
-	 * of multiple flags, not a single flag.
-	 */
-	if ((flags & WORKER_NOT_RUNNING) && (oflags & WORKER_NOT_RUNNING))
-		if (!(worker->flags & WORKER_NOT_RUNNING))
-			pool->nr_running++;
-}
-
 /**
  * find_worker_executing_work - find worker which is executing a work
  * @pool: pool of interest
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/7] workqueue: Improve locking rule description for worker fields
  2023-05-11 18:19 [PATCHSET v3 wq/for-6.5] workqueue: Implement automatic CPU intensive detection and add monitoring Tejun Heo
                   ` (2 preceding siblings ...)
  2023-05-11 18:19 ` [PATCH 3/7] workqueue: Move worker_set/clr_flags() upwards Tejun Heo
@ 2023-05-11 18:19 ` Tejun Heo
  2023-05-11 18:19 ` [PATCH 5/7] workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE Tejun Heo
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2023-05-11 18:19 UTC (permalink / raw)
  To: jiangshanlai; +Cc: torvalds, peterz, linux-kernel, kernel-team, Tejun Heo

* Some worker fields are modified only by the worker itself while holding
  pool->lock thus making them safe to read from self, IRQ context if the CPU
  is running the worker or while holding pool->lock. Add 'K' locking rule
  for them.

* worker->sleeping is currently marked "None" which isn't very descriptive.
  It's used only by the worker itself. Add 'S' locking rule for it.

A future patch will depend on the 'K' rule to access worker->current_* from
the scheduler ticks.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c          |  6 ++++++
 kernel/workqueue_internal.h | 15 ++++++++-------
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 452bcb49192a..d6bfac871965 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -126,6 +126,12 @@ enum {
  *    cpu or grabbing pool->lock is enough for read access.  If
  *    POOL_DISASSOCIATED is set, it's identical to L.
  *
+ * K: Only modified by worker while holding pool->lock. Can be safely read by
+ *    self, while holding pool->lock or from IRQ context if %current is the
+ *    kworker.
+ *
+ * S: Only modified by worker self.
+ *
  * A: wq_pool_attach_mutex protected.
  *
  * PL: wq_pool_mutex protected.
diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
index 0600f04ceeb2..c2455be7b4c2 100644
--- a/kernel/workqueue_internal.h
+++ b/kernel/workqueue_internal.h
@@ -28,14 +28,15 @@ struct worker {
 		struct hlist_node	hentry;	/* L: while busy */
 	};
 
-	struct work_struct	*current_work;	/* L: work being processed */
-	work_func_t		current_func;	/* L: current_work's fn */
-	struct pool_workqueue	*current_pwq;	/* L: current_work's pwq */
-	unsigned int		current_color;	/* L: current_work's color */
-	int			sleeping;	/* None */
+	struct work_struct	*current_work;	/* K: work being processed and its */
+	work_func_t		current_func;	/* K: function */
+	struct pool_workqueue	*current_pwq;	/* K: pwq */
+	unsigned int		current_color;	/* K: color */
+
+	int			sleeping;	/* S: is worker sleeping? */
 
 	/* used by the scheduler to determine a worker's last known identity */
-	work_func_t		last_func;	/* L: last work's fn */
+	work_func_t		last_func;	/* K: last work's fn */
 
 	struct list_head	scheduled;	/* L: scheduled works */
 
@@ -45,7 +46,7 @@ struct worker {
 	struct list_head	node;		/* A: anchored at pool->workers */
 						/* A: runs through worker->node */
 
-	unsigned long		last_active;	/* L: last active timestamp */
+	unsigned long		last_active;	/* K: last active timestamp */
 	unsigned int		flags;		/* X: flags */
 	int			id;		/* I: worker id */
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 5/7] workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE
  2023-05-11 18:19 [PATCHSET v3 wq/for-6.5] workqueue: Implement automatic CPU intensive detection and add monitoring Tejun Heo
                   ` (3 preceding siblings ...)
  2023-05-11 18:19 ` [PATCH 4/7] workqueue: Improve locking rule description for worker fields Tejun Heo
@ 2023-05-11 18:19 ` Tejun Heo
  2023-05-11 21:23   ` Peter Zijlstra
  2023-05-11 18:19 ` [PATCH 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism Tejun Heo
  2023-05-11 18:19 ` [PATCH 7/7] workqueue: Track and monitor per-workqueue CPU time usage Tejun Heo
  6 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2023-05-11 18:19 UTC (permalink / raw)
  To: jiangshanlai; +Cc: torvalds, peterz, linux-kernel, kernel-team, Tejun Heo

If a per-cpu work item hogs the CPU, it can prevent other work items from
starting through concurrency management. A per-cpu workqueue which intends
to host such CPU-hogging work items can choose to not participate in
concurrency management by setting %WQ_CPU_INTENSIVE; however, this can be
error-prone and difficult to debug when missed.

This patch adds an automatic CPU usage based detection. If a
concurrency-managed work item consumes more CPU time than the threshold
(10ms by default) continuously without intervening sleeps, wq_worker_tick()
which is called from scheduler_tick() will detect the condition and
automatically mark it CPU_INTENSIVE.

The mechanism isn't foolproof:

* Detection depends on tick hitting the work item. Getting preempted at the
  right timings may allow a violating work item to evade detection at least
  temporarily.

* nohz_full CPUs may not be running ticks and thus can fail detection.

* Even when detection is working, the 10ms detection delays can add up if
  many CPU-hogging work items are queued at the same time.

However, in vast majority of cases, this should be able to detect violations
reliably and provide reasonable protection with a small increase in code
complexity.

If some work items trigger this condition repeatedly, the bigger problem
likely is the CPU being saturated with such per-cpu work items and the
solution would be making them UNBOUND. The next patch will add a debug
mechanism to help spot such cases.

v3: Switch to use wq_worker_tick() instead of hooking into preemptions as
    suggested by Peter.

v2: Lai pointed out that wq_worker_stopping() also needs to be called from
    preemption and rtlock paths and an earlier patch was updated
    accordingly. This patch adds a comment describing the risk of infinte
    recursions and how they're avoided.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
---
 Documentation/core-api/workqueue.rst | 38 ++++++++--------
 kernel/sched/core.c                  |  3 ++
 kernel/workqueue.c                   | 68 ++++++++++++++++++++++++++--
 kernel/workqueue_internal.h          |  2 +
 tools/workqueue/wq_monitor.py        | 13 +++++-
 5 files changed, 99 insertions(+), 25 deletions(-)

diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
index 7e5c39310bbf..a389f31b025c 100644
--- a/Documentation/core-api/workqueue.rst
+++ b/Documentation/core-api/workqueue.rst
@@ -354,25 +354,25 @@ Monitoring
 Use tools/workqueue/wq_monitor.py to monitor workqueue operations: ::
 
   $ tools/workqueue/wq_monitor.py events
-                              total  infl  CMwake  mayday rescued
-  events                      18545     0       5       -       -
-  events_highpri                  8     0       0       -       -
-  events_long                     3     0       0       -       -
-  events_unbound              38306     0       -       -       -
-  events_freezable                0     0       0       -       -
-  events_power_efficient      29598     0       0       -       -
-  events_freezable_power_        10     0       0       -       -
-  sock_diag_events                0     0       0       -       -
-
-                              total  infl  CMwake  mayday rescued
-  events                      18548     0       5       -       -
-  events_highpri                  8     0       0       -       -
-  events_long                     3     0       0       -       -
-  events_unbound              38322     0       -       -       -
-  events_freezable                0     0       0       -       -
-  events_power_efficient      29603     0       0       -       -
-  events_freezable_power_        10     0       0       -       -
-  sock_diag_events                0     0       0       -       -
+                              total  infl CPUitsv  CMwake  mayday rescued
+  events                      18545     0       0       5       -       -
+  events_highpri                  8     0       0       0       -       -
+  events_long                     3     0       0       0       -       -
+  events_unbound              38306     0       -       -       -       -
+  events_freezable                0     0       0       0       -       -
+  events_power_efficient      29598     0       0       0       -       -
+  events_freezable_power_        10     0       0       0       -       -
+  sock_diag_events                0     0       0       0       -       -
+
+                              total  infl CPUitsv  CMwake  mayday rescued
+  events                      18548     0       0       5       -       -
+  events_highpri                  8     0       0       0       -       -
+  events_long                     3     0       0       0       -       -
+  events_unbound              38322     0       -       -       -       -
+  events_freezable                0     0       0       0       -       -
+  events_power_efficient      29603     0       0       0       -       -
+  events_freezable_power_        10     0       0       0       -       -
+  sock_diag_events                0     0       0       0       -       -
 
   ...
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 944c3ae39861..3484cada9a4a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5632,6 +5632,9 @@ void scheduler_tick(void)
 
 	perf_event_task_tick();
 
+	if (curr->flags & PF_WQ_WORKER)
+		wq_worker_tick(curr);
+
 #ifdef CONFIG_SMP
 	rq->idle_balance = idle_cpu(cpu);
 	trigger_load_balance(rq);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index d6bfac871965..661acc5afcfd 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -212,6 +212,7 @@ struct worker_pool {
 enum pool_workqueue_stats {
 	PWQ_STAT_STARTED,	/* work items started execution */
 	PWQ_STAT_COMPLETED,	/* work items completed execution */
+	PWQ_STAT_CPU_INTENSIVE,	/* wq_cpu_intensive_thresh_us violations */
 	PWQ_STAT_CM_WAKEUP,	/* concurrency-management worker wakeups */
 	PWQ_STAT_MAYDAY,	/* maydays to rescuer */
 	PWQ_STAT_RESCUED,	/* linked work items executed by rescuer */
@@ -332,6 +333,14 @@ static struct kmem_cache *pwq_cache;
 static cpumask_var_t *wq_numa_possible_cpumask;
 					/* possible CPUs of each node */
 
+/*
+ * Per-cpu work items which run for longer than the following threshold are
+ * automatically considered CPU intensive and excluded from concurrency
+ * management to prevent them from noticeably delaying other per-cpu work items.
+ */
+static unsigned long wq_cpu_intensive_thresh_us = 10000;
+module_param_named(cpu_intensive_thresh_us, wq_cpu_intensive_thresh_us, ulong, 0644);
+
 static bool wq_disable_numa;
 module_param_named(disable_numa, wq_disable_numa, bool, 0444);
 
@@ -962,6 +971,13 @@ void wq_worker_running(struct task_struct *task)
 	if (!(worker->flags & WORKER_NOT_RUNNING))
 		worker->pool->nr_running++;
 	preempt_enable();
+
+	/*
+	 * CPU intensive auto-detection cares about how long a work item hogged
+	 * CPU without sleeping. Reset the starting timestamp on wakeup.
+	 */
+	worker->current_at = worker->task->se.sum_exec_runtime;
+
 	worker->sleeping = 0;
 }
 
@@ -1012,6 +1028,45 @@ void wq_worker_sleeping(struct task_struct *task)
 	raw_spin_unlock_irq(&pool->lock);
 }
 
+/**
+ * wq_worker_tick - a scheduler tick occurred while a kworker is running
+ * @task: task currently running
+ *
+ * Called from scheduler_tick(). We're in the IRQ context and the current
+ * worker's fields which follow the 'K' locking rule can be accessed safely.
+ */
+void wq_worker_tick(struct task_struct *task)
+{
+	struct worker *worker = kthread_data(task);
+	struct pool_workqueue *pwq = worker->current_pwq;
+	struct worker_pool *pool = worker->pool;
+
+	if (!pwq)
+		return;
+
+	/*
+	 * If the current worker is concurrency managed and hogged the CPU for
+	 * longer than wq_cpu_intensive_thresh_us, it's automatically marked
+	 * CPU_INTENSIVE to avoid stalling other concurrency-managed work items.
+	 */
+	if ((worker->flags & WORKER_NOT_RUNNING) ||
+	    worker->task->se.sum_exec_runtime - worker->current_at <
+	    wq_cpu_intensive_thresh_us * NSEC_PER_USEC)
+		return;
+
+	raw_spin_lock(&pool->lock);
+
+	worker_set_flags(worker, WORKER_CPU_INTENSIVE);
+	pwq->stats[PWQ_STAT_CPU_INTENSIVE]++;
+
+	if (need_more_worker(pool)) {
+		pwq->stats[PWQ_STAT_CM_WAKEUP]++;
+		wake_up_worker(pool);
+	}
+
+	raw_spin_unlock(&pool->lock);
+}
+
 /**
  * wq_worker_last_func - retrieve worker's last work function
  * @task: Task to retrieve last work function of.
@@ -2325,7 +2380,6 @@ __acquires(&pool->lock)
 {
 	struct pool_workqueue *pwq = get_work_pwq(work);
 	struct worker_pool *pool = worker->pool;
-	bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
 	unsigned long work_data;
 	struct worker *collision;
 #ifdef CONFIG_LOCKDEP
@@ -2362,6 +2416,7 @@ __acquires(&pool->lock)
 	worker->current_work = work;
 	worker->current_func = work->func;
 	worker->current_pwq = pwq;
+	worker->current_at = worker->task->se.sum_exec_runtime;
 	work_data = *work_data_bits(work);
 	worker->current_color = get_work_color(work_data);
 
@@ -2379,7 +2434,7 @@ __acquires(&pool->lock)
 	 * of concurrency management and the next code block will chain
 	 * execution of the pending work items.
 	 */
-	if (unlikely(cpu_intensive))
+	if (unlikely(pwq->wq->flags & WQ_CPU_INTENSIVE))
 		worker_set_flags(worker, WORKER_CPU_INTENSIVE);
 
 	/*
@@ -2459,9 +2514,12 @@ __acquires(&pool->lock)
 
 	raw_spin_lock_irq(&pool->lock);
 
-	/* clear cpu intensive status */
-	if (unlikely(cpu_intensive))
-		worker_clr_flags(worker, WORKER_CPU_INTENSIVE);
+	/*
+	 * In addition to %WQ_CPU_INTENSIVE, @worker may also have been marked
+	 * CPU intensive by wq_worker_tick() if @work hogged CPU longer than
+	 * wq_cpu_intensive_thresh_us. Clear it.
+	 */
+	worker_clr_flags(worker, WORKER_CPU_INTENSIVE);
 
 	/* tag the worker for identification in schedule() */
 	worker->last_func = worker->current_func;
diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
index c2455be7b4c2..6b1d66e28269 100644
--- a/kernel/workqueue_internal.h
+++ b/kernel/workqueue_internal.h
@@ -31,6 +31,7 @@ struct worker {
 	struct work_struct	*current_work;	/* K: work being processed and its */
 	work_func_t		current_func;	/* K: function */
 	struct pool_workqueue	*current_pwq;	/* K: pwq */
+	u64			current_at;	/* K: runtime at start or last wakeup */
 	unsigned int		current_color;	/* K: color */
 
 	int			sleeping;	/* S: is worker sleeping? */
@@ -76,6 +77,7 @@ static inline struct worker *current_wq_worker(void)
  */
 void wq_worker_running(struct task_struct *task);
 void wq_worker_sleeping(struct task_struct *task);
+void wq_worker_tick(struct task_struct *task);
 work_func_t wq_worker_last_func(struct task_struct *task);
 
 #endif /* _KERNEL_WORKQUEUE_INTERNAL_H */
diff --git a/tools/workqueue/wq_monitor.py b/tools/workqueue/wq_monitor.py
index 94435fd14c98..5dc0b61b36e8 100644
--- a/tools/workqueue/wq_monitor.py
+++ b/tools/workqueue/wq_monitor.py
@@ -12,6 +12,11 @@ https://github.com/osandov/drgn.
 
   infl     The number of currently in-flight work items.
 
+  CPUitsv  The number of times a concurrency-managed work item hogged CPU
+           longer than the threshold (workqueue.cpu_intensive_thresh_us)
+           and got excluded from concurrency management to avoid stalling
+           other work items.
+
   CMwake   The number of concurrency-management wake-ups while executing a
            work item of the workqueue.
 
@@ -54,6 +59,7 @@ WQ_MEM_RECLAIM          = prog['WQ_MEM_RECLAIM']
 
 PWQ_STAT_STARTED        = prog['PWQ_STAT_STARTED']      # work items started execution
 PWQ_STAT_COMPLETED      = prog['PWQ_STAT_COMPLETED']	# work items completed execution
+PWQ_STAT_CPU_INTENSIVE  = prog['PWQ_STAT_CPU_INTENSIVE'] # wq_cpu_intensive_thresh_us violations
 PWQ_STAT_CM_WAKEUP      = prog['PWQ_STAT_CM_WAKEUP']    # concurrency-management worker wakeups
 PWQ_STAT_MAYDAY         = prog['PWQ_STAT_MAYDAY']	# maydays to rescuer
 PWQ_STAT_RESCUED        = prog['PWQ_STAT_RESCUED']	# linked work items executed by rescuer
@@ -76,19 +82,23 @@ PWQ_NR_STATS            = prog['PWQ_NR_STATS']
                  'mem_reclaim'          : self.mem_reclaim,
                  'started'              : self.stats[PWQ_STAT_STARTED],
                  'completed'            : self.stats[PWQ_STAT_COMPLETED],
+                 'cpu_intensive'        : self.stats[PWQ_STAT_CPU_INTENSIVE],
                  'cm_wakeup'            : self.stats[PWQ_STAT_CM_WAKEUP],
                  'mayday'               : self.stats[PWQ_STAT_MAYDAY],
                  'rescued'              : self.stats[PWQ_STAT_RESCUED], }
 
     def table_header_str():
-        return f'{"":>24} {"total":>8} {"infl":>5} {"CMwake":>7} {"mayday":>7} {"rescued":>7}'
+        return f'{"":>24} {"total":>8} {"infl":>5} {"CPUitsv":>7} '\
+            f'{"CMwake":>7} {"mayday":>7} {"rescued":>7}'
 
     def table_row_str(self):
+        cpu_intensive = '-'
         cm_wakeup = '-'
         mayday = '-'
         rescued = '-'
 
         if not self.unbound:
+            cpu_intensive = str(self.stats[PWQ_STAT_CPU_INTENSIVE])
             cm_wakeup = str(self.stats[PWQ_STAT_CM_WAKEUP])
 
         if self.mem_reclaim:
@@ -98,6 +108,7 @@ PWQ_NR_STATS            = prog['PWQ_NR_STATS']
         out = f'{self.name[-24:]:24} ' \
               f'{self.stats[PWQ_STAT_STARTED]:8} ' \
               f'{max(self.stats[PWQ_STAT_STARTED] - self.stats[PWQ_STAT_COMPLETED], 0):5} ' \
+              f'{cpu_intensive:>7} ' \
               f'{cm_wakeup:>7} ' \
               f'{mayday:>7} ' \
               f'{rescued:>7} '
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism
  2023-05-11 18:19 [PATCHSET v3 wq/for-6.5] workqueue: Implement automatic CPU intensive detection and add monitoring Tejun Heo
                   ` (4 preceding siblings ...)
  2023-05-11 18:19 ` [PATCH 5/7] workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE Tejun Heo
@ 2023-05-11 18:19 ` Tejun Heo
  2023-05-11 21:26   ` Peter Zijlstra
  2023-05-12 19:42   ` [PATCH v2 " Tejun Heo
  2023-05-11 18:19 ` [PATCH 7/7] workqueue: Track and monitor per-workqueue CPU time usage Tejun Heo
  6 siblings, 2 replies; 30+ messages in thread
From: Tejun Heo @ 2023-05-11 18:19 UTC (permalink / raw)
  To: jiangshanlai; +Cc: torvalds, peterz, linux-kernel, kernel-team, Tejun Heo

Workqueue now automatically marks per-cpu work items that hog CPU for too
long as CPU_INTENSIVE, which excludes them from concurrency management and
prevents stalling other concurrency-managed work items. If a work function
keeps running over the thershold, it likely needs to be switched to use an
unbound workqueue.

This patch adds a debug mechanism which tracks the work functions which
trigger the automatic CPU_INTENSIVE mechanism and report them using
pr_warn() with exponential backoff.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/workqueue.c | 128 +++++++++++++++++++++++++++++++++++++++++++++
 lib/Kconfig.debug  |  13 +++++
 2 files changed, 141 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 661acc5afcfd..b85d70f133f7 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -948,6 +948,132 @@ static inline void worker_clr_flags(struct worker *worker, unsigned int flags)
 			pool->nr_running++;
 }
 
+#ifdef CONFIG_WQ_CPU_INTENSIVE_REPORT
+
+/*
+ * Concurrency-managed per-cpu work items that hog CPU for longer than
+ * wq_cpu_intensive_thresh_us trigger the automatic CPU_INTENSIVE mechanism,
+ * which prevents them from stalling other concurrency-managed work items. If a
+ * work function keeps triggering this mechanism, it's likely that the work item
+ * should be using an unbound workqueue instead.
+ *
+ * wq_cpu_intensive_report() tracks work functions which trigger such conditions
+ * and report them so that they can be examined and converted to use unbound
+ * workqueues as appropriate. To avoid flooding the console, each violating work
+ * function is tracked and reported with exponential backoff.
+ */
+#define WCI_MAX_ENTS 128
+
+struct wci_ent {
+	work_func_t		func;
+	atomic64_t		cnt;
+	struct hlist_node	hash_node;
+	struct list_head	report_node;
+};
+
+static struct wci_ent wci_ents[WCI_MAX_ENTS];
+static int wci_nr_ents;
+static DEFINE_RAW_SPINLOCK(wci_lock);
+static DEFINE_HASHTABLE(wci_hash, ilog2(WCI_MAX_ENTS));
+static LIST_HEAD(wci_report_list);
+static struct kthread_worker *wci_report_worker;
+
+static void wci_report_workfn(struct kthread_work *work)
+{
+	struct wci_ent *ent, *n;
+
+	raw_spin_lock_irq(&wci_lock);
+	list_for_each_entry_safe(ent, n, &wci_report_list, report_node) {
+		pr_warn("workqueue: %ps hogged CPU for >%luus %llu times, consider switching to WQ_UNBOUND\n",
+			ent->func, wq_cpu_intensive_thresh_us,
+			atomic64_read(&ent->cnt));
+		list_del_init(&ent->report_node);
+	}
+	raw_spin_unlock_irq(&wci_lock);
+}
+
+static DEFINE_KTHREAD_WORK(wci_report_work, wci_report_workfn);
+
+static struct wci_ent *wci_find_ent(work_func_t func)
+{
+	struct wci_ent *ent;
+
+	hash_for_each_possible_rcu(wci_hash, ent, hash_node,
+				   (unsigned long)func) {
+		if (ent->func == func)
+			return ent;
+	}
+	return NULL;
+}
+
+static void wq_cpu_intensive_report(work_func_t func)
+{
+	struct wci_ent *ent;
+
+restart:
+	ent = wci_find_ent(func);
+	if (ent) {
+		u64 cnt;
+
+		/*
+		 * Start reporting from the fourth time and back off
+		 * exponentially.
+		 */
+		cnt = atomic64_inc_return_relaxed(&ent->cnt);
+		if (cnt < 4 || !is_power_of_2(cnt))
+			return;
+
+		raw_spin_lock(&wci_lock);
+		if (list_empty(&ent->report_node)) {
+			list_add_tail(&ent->report_node, &wci_report_list);
+			if (wci_report_worker)
+				kthread_queue_work(wci_report_worker,
+						   &wci_report_work);
+		}
+		raw_spin_unlock(&wci_lock);
+		return;
+	}
+
+	/*
+	 * @func is a new violation. Allocate a new entry for it. If wcn_ents[]
+	 * is exhausted, something went really wrong and we probably made enough
+	 * noise already.
+	 */
+	if (wci_nr_ents >= WCI_MAX_ENTS)
+		return;
+
+	raw_spin_lock(&wci_lock);
+
+	if (wci_nr_ents >= WCI_MAX_ENTS) {
+		raw_spin_unlock(&wci_lock);
+		return;
+	}
+
+	if (wci_find_ent(func)) {
+		raw_spin_unlock(&wci_lock);
+		goto restart;
+	}
+
+	ent = &wci_ents[wci_nr_ents++];
+	ent->func = func;
+	atomic64_set(&ent->cnt, 1);
+	INIT_LIST_HEAD(&ent->report_node);
+	hash_add_rcu(wci_hash, &ent->hash_node, (unsigned long)func);
+
+	raw_spin_unlock(&wci_lock);
+}
+
+static void __init wq_cpu_intensive_report_init(void)
+{
+	wci_report_worker = kthread_create_worker(0, "wq_cpu_intensive_report");
+	WARN_ON(!wci_report_worker);
+}
+
+#else	/* CONFIG_WQ_CPU_INTENSIVE_REPORT */
+static void wq_cpu_intensive_report(work_func_t func) {}
+static void __init wq_cpu_intensive_report_init(void) {}
+#endif	/* CONFIG_WQ_CPU_INTENSIVE_REPORT */
+
 /**
  * wq_worker_running - a worker is running again
  * @task: task waking up
@@ -1057,6 +1183,7 @@ void wq_worker_tick(struct task_struct *task)
 	raw_spin_lock(&pool->lock);
 
 	worker_set_flags(worker, WORKER_CPU_INTENSIVE);
+	wq_cpu_intensive_report(worker->current_func);
 	pwq->stats[PWQ_STAT_CPU_INTENSIVE]++;
 
 	if (need_more_worker(pool)) {
@@ -6458,6 +6585,7 @@ void __init workqueue_init(void)
 
 	wq_online = true;
 	wq_watchdog_init();
+	wq_cpu_intensive_report_init();
 }
 
 /*
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index ce51d4dc6803..97e880aa48d7 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1134,6 +1134,19 @@ config WQ_WATCHDOG
 	  state.  This can be configured through kernel parameter
 	  "workqueue.watchdog_thresh" and its sysfs counterpart.
 
+config WQ_CPU_INTENSIVE_REPORT
+	bool "Report per-cpu work items which hog CPU for too long"
+	depends on DEBUG_KERNEL
+	help
+	  Say Y here to enable reporting of concurrency-managed per-cpu work
+	  items that hog CPUs for longer than
+	  workqueue.cpu_intensive_threshold_us. Workqueue automatically
+	  detects and excludes them from concurrency management to prevent
+	  them from stalling other per-cpu work items. Occassional
+	  triggering may not necessarily indicate a problem. Repeated
+	  triggering likely indicates that the work item should be switched
+	  to use an unbound workqueue.
+
 config TEST_LOCKUP
 	tristate "Test module to generate lockups"
 	depends on m
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 7/7] workqueue: Track and monitor per-workqueue CPU time usage
  2023-05-11 18:19 [PATCHSET v3 wq/for-6.5] workqueue: Implement automatic CPU intensive detection and add monitoring Tejun Heo
                   ` (5 preceding siblings ...)
  2023-05-11 18:19 ` [PATCH 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism Tejun Heo
@ 2023-05-11 18:19 ` Tejun Heo
  2023-05-11 21:11   ` Peter Zijlstra
  6 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2023-05-11 18:19 UTC (permalink / raw)
  To: jiangshanlai; +Cc: torvalds, peterz, linux-kernel, kernel-team, Tejun Heo

Now that wq_worker_tick() is there, we can easily track the rough CPU time
consumption of each workqueue by charging the whole tick whenever a tick
hits an active workqueue. While not super accurate, it provides reasonable
visibility into the workqueues that consume a lot of CPU cycles.
wq_monitor.py is updated to report the per-workqueue CPU times.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 Documentation/core-api/workqueue.rst | 38 ++++++++++++++--------------
 kernel/workqueue.c                   |  3 +++
 tools/workqueue/wq_monitor.py        |  9 ++++++-
 3 files changed, 30 insertions(+), 20 deletions(-)

diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
index a389f31b025c..a4c9b9d1905f 100644
--- a/Documentation/core-api/workqueue.rst
+++ b/Documentation/core-api/workqueue.rst
@@ -354,25 +354,25 @@ Monitoring
 Use tools/workqueue/wq_monitor.py to monitor workqueue operations: ::
 
   $ tools/workqueue/wq_monitor.py events
-                              total  infl CPUitsv  CMwake  mayday rescued
-  events                      18545     0       0       5       -       -
-  events_highpri                  8     0       0       0       -       -
-  events_long                     3     0       0       0       -       -
-  events_unbound              38306     0       -       -       -       -
-  events_freezable                0     0       0       0       -       -
-  events_power_efficient      29598     0       0       0       -       -
-  events_freezable_power_        10     0       0       0       -       -
-  sock_diag_events                0     0       0       0       -       -
-
-                              total  infl CPUitsv  CMwake  mayday rescued
-  events                      18548     0       0       5       -       -
-  events_highpri                  8     0       0       0       -       -
-  events_long                     3     0       0       0       -       -
-  events_unbound              38322     0       -       -       -       -
-  events_freezable                0     0       0       0       -       -
-  events_power_efficient      29603     0       0       0       -       -
-  events_freezable_power_        10     0       0       0       -       -
-  sock_diag_events                0     0       0       0       -       -
+                              total  infl  CPUtime  CPUhog  CMwake  mayday rescued
+  events                      18545     0      6.1       0       5       -       -
+  events_highpri                  8     0      0.0       0       0       -       -
+  events_long                     3     0      0.0       0       0       -       -
+  events_unbound              38306     0      0.1       -       -       -       -
+  events_freezable                0     0      0.0       0       0       -       -
+  events_power_efficient      29598     0      0.2       0       0       -       -
+  events_freezable_power_        10     0      0.0       0       0       -       -
+  sock_diag_events                0     0      0.0       0       0       -       -
+
+                              total  infl  CPUtime  CPUhog  CMwake  mayday rescued
+  events                      18548     0      6.1       0       5       -       -
+  events_highpri                  8     0      0.0       0       0       -       -
+  events_long                     3     0      0.0       0       0       -       -
+  events_unbound              38322     0      0.1       -       -       -       -
+  events_freezable                0     0      0.0       0       0       -       -
+  events_power_efficient      29603     0      0.2       0       0       -       -
+  events_freezable_power_        10     0      0.0       0       0       -       -
+  sock_diag_events                0     0      0.0       0       0       -       -
 
   ...
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b85d70f133f7..42c41d5368dc 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -212,6 +212,7 @@ struct worker_pool {
 enum pool_workqueue_stats {
 	PWQ_STAT_STARTED,	/* work items started execution */
 	PWQ_STAT_COMPLETED,	/* work items completed execution */
+	PWQ_STAT_CPU_TIME,	/* total CPU time consumed */
 	PWQ_STAT_CPU_INTENSIVE,	/* wq_cpu_intensive_thresh_us violations */
 	PWQ_STAT_CM_WAKEUP,	/* concurrency-management worker wakeups */
 	PWQ_STAT_MAYDAY,	/* maydays to rescuer */
@@ -1170,6 +1171,8 @@ void wq_worker_tick(struct task_struct *task)
 	if (!pwq)
 		return;
 
+	pwq->stats[PWQ_STAT_CPU_TIME] += TICK_USEC;
+
 	/*
 	 * If the current worker is concurrency managed and hogged the CPU for
 	 * longer than wq_cpu_intensive_thresh_us, it's automatically marked
diff --git a/tools/workqueue/wq_monitor.py b/tools/workqueue/wq_monitor.py
index 5dc0b61b36e8..3c35239d50c4 100644
--- a/tools/workqueue/wq_monitor.py
+++ b/tools/workqueue/wq_monitor.py
@@ -12,6 +12,10 @@ https://github.com/osandov/drgn.
 
   infl     The number of currently in-flight work items.
 
+  CPUtime  Total CPU time consumed by the workqueue in seconds. This is
+           sampled from scheduler ticks and only provides ballpark
+           measurement. "nohz_full=" CPUs are excluded from measurement.
+
   CPUitsv  The number of times a concurrency-managed work item hogged CPU
            longer than the threshold (workqueue.cpu_intensive_thresh_us)
            and got excluded from concurrency management to avoid stalling
@@ -59,6 +63,7 @@ WQ_MEM_RECLAIM          = prog['WQ_MEM_RECLAIM']
 
 PWQ_STAT_STARTED        = prog['PWQ_STAT_STARTED']      # work items started execution
 PWQ_STAT_COMPLETED      = prog['PWQ_STAT_COMPLETED']	# work items completed execution
+PWQ_STAT_CPU_TIME       = prog['PWQ_STAT_CPU_TIME']     # total CPU time consumed
 PWQ_STAT_CPU_INTENSIVE  = prog['PWQ_STAT_CPU_INTENSIVE'] # wq_cpu_intensive_thresh_us violations
 PWQ_STAT_CM_WAKEUP      = prog['PWQ_STAT_CM_WAKEUP']    # concurrency-management worker wakeups
 PWQ_STAT_MAYDAY         = prog['PWQ_STAT_MAYDAY']	# maydays to rescuer
@@ -82,13 +87,14 @@ PWQ_NR_STATS            = prog['PWQ_NR_STATS']
                  'mem_reclaim'          : self.mem_reclaim,
                  'started'              : self.stats[PWQ_STAT_STARTED],
                  'completed'            : self.stats[PWQ_STAT_COMPLETED],
+                 'cputime'              : self.stats[PWQ_STAT_CPU_TIME],
                  'cpu_intensive'        : self.stats[PWQ_STAT_CPU_INTENSIVE],
                  'cm_wakeup'            : self.stats[PWQ_STAT_CM_WAKEUP],
                  'mayday'               : self.stats[PWQ_STAT_MAYDAY],
                  'rescued'              : self.stats[PWQ_STAT_RESCUED], }
 
     def table_header_str():
-        return f'{"":>24} {"total":>8} {"infl":>5} {"CPUitsv":>7} '\
+        return f'{"":>24} {"total":>8} {"infl":>5} {"CPUtime":>8} {"CPUitsv":>7} '\
             f'{"CMwake":>7} {"mayday":>7} {"rescued":>7}'
 
     def table_row_str(self):
@@ -108,6 +114,7 @@ PWQ_NR_STATS            = prog['PWQ_NR_STATS']
         out = f'{self.name[-24:]:24} ' \
               f'{self.stats[PWQ_STAT_STARTED]:8} ' \
               f'{max(self.stats[PWQ_STAT_STARTED] - self.stats[PWQ_STAT_COMPLETED], 0):5} ' \
+              f'{self.stats[PWQ_STAT_CPU_TIME] / 1000000:8.1f} ' \
               f'{cpu_intensive:>7} ' \
               f'{cm_wakeup:>7} ' \
               f'{mayday:>7} ' \
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/7] workqueue: Track and monitor per-workqueue CPU time usage
  2023-05-11 18:19 ` [PATCH 7/7] workqueue: Track and monitor per-workqueue CPU time usage Tejun Heo
@ 2023-05-11 21:11   ` Peter Zijlstra
  2023-05-11 23:03     ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2023-05-11 21:11 UTC (permalink / raw)
  To: Tejun Heo; +Cc: jiangshanlai, torvalds, linux-kernel, kernel-team

On Thu, May 11, 2023 at 08:19:31AM -1000, Tejun Heo wrote:
> Now that wq_worker_tick() is there, we can easily track the rough CPU time
> consumption of each workqueue by charging the whole tick whenever a tick
> hits an active workqueue. While not super accurate, it provides reasonable
> visibility into the workqueues that consume a lot of CPU cycles.
> wq_monitor.py is updated to report the per-workqueue CPU times.

I'm utterly failing to read that dragon thing (or possibly snake, but I
can typically sorta make out what it intends to do).

However, while you don't have preempt-out, you still have sched-out
through wq_worker_sleeping(), so you should be able to compute the time
spend on the workqueue by past worker runs -- fully accurate.

Then you only need to add the time since sched-in of any current worker
and you have a complete picture of time spend on the workqueue, no
approximation needed.

Or am I completely missing the point?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/7] workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE
  2023-05-11 18:19 ` [PATCH 5/7] workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE Tejun Heo
@ 2023-05-11 21:23   ` Peter Zijlstra
  2023-05-11 22:47     ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2023-05-11 21:23 UTC (permalink / raw)
  To: Tejun Heo; +Cc: jiangshanlai, torvalds, linux-kernel, kernel-team

On Thu, May 11, 2023 at 08:19:29AM -1000, Tejun Heo wrote:
> If a per-cpu work item hogs the CPU, it can prevent other work items from
> starting through concurrency management. A per-cpu workqueue which intends
> to host such CPU-hogging work items can choose to not participate in
> concurrency management by setting %WQ_CPU_INTENSIVE; however, this can be
> error-prone and difficult to debug when missed.
> 
> This patch adds an automatic CPU usage based detection. If a
> concurrency-managed work item consumes more CPU time than the threshold
> (10ms by default) continuously without intervening sleeps, wq_worker_tick()
> which is called from scheduler_tick() will detect the condition and
> automatically mark it CPU_INTENSIVE.
> 
> The mechanism isn't foolproof:
> 
> * Detection depends on tick hitting the work item. Getting preempted at the
>   right timings may allow a violating work item to evade detection at least
>   temporarily.

Right, if you have active tick avoidance in your work items you've got
bigger problems :-)

> * nohz_full CPUs may not be running ticks and thus can fail detection.

We do have sched_tick_remote() for the NOHZ_FULL case; it's all a big
can of tricky but it might just do -- if you really care ofc.

> * Even when detection is working, the 10ms detection delays can add up if
>   many CPU-hogging work items are queued at the same time.

HZ=100 assumption there :-) My HZs are bigger 'n yours etc.

> However, in vast majority of cases, this should be able to detect violations
> reliably and provide reasonable protection with a small increase in code
> complexity.
> 
> If some work items trigger this condition repeatedly, the bigger problem
> likely is the CPU being saturated with such per-cpu work items and the
> solution would be making them UNBOUND. The next patch will add a debug
> mechanism to help spot such cases.
> 
> v3: Switch to use wq_worker_tick() instead of hooking into preemptions as
>     suggested by Peter.
> 
> v2: Lai pointed out that wq_worker_stopping() also needs to be called from
>     preemption and rtlock paths and an earlier patch was updated
>     accordingly. This patch adds a comment describing the risk of infinte
>     recursions and how they're avoided.

I tend to prefer these changelog-changelogs to go below the --- so that
they go away on applying, they're not really relevant when reading the
patch in a year's time when trying to figure out wtf this patch did.

> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
> ---

Anyway, this seems entirely reasonable.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism
  2023-05-11 18:19 ` [PATCH 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism Tejun Heo
@ 2023-05-11 21:26   ` Peter Zijlstra
  2023-05-11 22:52     ` Tejun Heo
  2023-05-12 19:42   ` [PATCH v2 " Tejun Heo
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2023-05-11 21:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: jiangshanlai, torvalds, linux-kernel, kernel-team

On Thu, May 11, 2023 at 08:19:30AM -1000, Tejun Heo wrote:
> Workqueue now automatically marks per-cpu work items that hog CPU for too
> long as CPU_INTENSIVE, which excludes them from concurrency management and
> prevents stalling other concurrency-managed work items. If a work function
> keeps running over the thershold, it likely needs to be switched to use an
> unbound workqueue.
> 
> This patch adds a debug mechanism which tracks the work functions which
> trigger the automatic CPU_INTENSIVE mechanism and report them using
> pr_warn() with exponential backoff.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>

I did do wonder why you chose for external storage for this -- I figured
it was to keep the cost down since it shouldn't really be happening, so
storage in the normal data structures is a waste etc..?

Otherwise seems fine; thanks!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/7] workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE
  2023-05-11 21:23   ` Peter Zijlstra
@ 2023-05-11 22:47     ` Tejun Heo
  0 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2023-05-11 22:47 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: jiangshanlai, torvalds, linux-kernel, kernel-team

Hello,

On Thu, May 11, 2023 at 11:23:58PM +0200, Peter Zijlstra wrote:
> > * Even when detection is working, the 10ms detection delays can add up if
> >   many CPU-hogging work items are queued at the same time.
> 
> HZ=100 assumption there :-) My HZs are bigger 'n yours etc.

Oh, I was referring to the default threshold, but yeah, lower HZ can
definitely be a factor.

> > However, in vast majority of cases, this should be able to detect violations
> > reliably and provide reasonable protection with a small increase in code
> > complexity.
> > 
> > If some work items trigger this condition repeatedly, the bigger problem
> > likely is the CPU being saturated with such per-cpu work items and the
> > solution would be making them UNBOUND. The next patch will add a debug
> > mechanism to help spot such cases.
> > 
> > v3: Switch to use wq_worker_tick() instead of hooking into preemptions as
> >     suggested by Peter.
> > 
> > v2: Lai pointed out that wq_worker_stopping() also needs to be called from
> >     preemption and rtlock paths and an earlier patch was updated
> >     accordingly. This patch adds a comment describing the risk of infinte
> >     recursions and how they're avoided.
> 
> I tend to prefer these changelog-changelogs to go below the --- so that
> they go away on applying, they're not really relevant when reading the
> patch in a year's time when trying to figure out wtf this patch did.

I tried that and promptly lost the version logs while iterating / posting
cuz that involved going through git commits. But yeah, the above version
logs aren't useful. I'll drop them when applying.

> Anyway, this seems entirely reasonable.
> 
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism
  2023-05-11 21:26   ` Peter Zijlstra
@ 2023-05-11 22:52     ` Tejun Heo
  0 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2023-05-11 22:52 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: jiangshanlai, torvalds, linux-kernel, kernel-team

Hello,

On Thu, May 11, 2023 at 11:26:06PM +0200, Peter Zijlstra wrote:
> On Thu, May 11, 2023 at 08:19:30AM -1000, Tejun Heo wrote:
> > Workqueue now automatically marks per-cpu work items that hog CPU for too
> > long as CPU_INTENSIVE, which excludes them from concurrency management and
> > prevents stalling other concurrency-managed work items. If a work function
> > keeps running over the thershold, it likely needs to be switched to use an
> > unbound workqueue.
> > 
> > This patch adds a debug mechanism which tracks the work functions which
> > trigger the automatic CPU_INTENSIVE mechanism and report them using
> > pr_warn() with exponential backoff.
> > 
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> 
> I did do wonder why you chose for external storage for this -- I figured
> it was to keep the cost down since it shouldn't really be happening, so
> storage in the normal data structures is a waste etc..?

The only releveant data structures are workqueue and work_struct. The former
is too coarse because a given workqueue can run any number of different work
items (e.g. system_wq). The latter is too transient to record anything on.
In a lot of cases, the only meaningfully identifiable thing is the work
function pointer, which doesn't have any data structure attached by default,
so...

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/7] workqueue: Track and monitor per-workqueue CPU time usage
  2023-05-11 21:11   ` Peter Zijlstra
@ 2023-05-11 23:03     ` Tejun Heo
  0 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2023-05-11 23:03 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: jiangshanlai, torvalds, linux-kernel, kernel-team

Hello,

On Thu, May 11, 2023 at 11:11:34PM +0200, Peter Zijlstra wrote:
> On Thu, May 11, 2023 at 08:19:31AM -1000, Tejun Heo wrote:
> > Now that wq_worker_tick() is there, we can easily track the rough CPU time
> > consumption of each workqueue by charging the whole tick whenever a tick
> > hits an active workqueue. While not super accurate, it provides reasonable
> > visibility into the workqueues that consume a lot of CPU cycles.
> > wq_monitor.py is updated to report the per-workqueue CPU times.
> 
> I'm utterly failing to read that dragon thing (or possibly snake, but I
> can typically sorta make out what it intends to do).

Yeah, it's basically gdb in python. It's surprisingly widely useful.

> However, while you don't have preempt-out, you still have sched-out
> through wq_worker_sleeping(), so you should be able to compute the time
> spend on the workqueue by past worker runs -- fully accurate.
> 
> Then you only need to add the time since sched-in of any current worker
> and you have a complete picture of time spend on the workqueue, no
> approximation needed.
> 
> Or am I completely missing the point?

One issue is that sum_exec_runtime isn't necessarily updated when a work
item completes and then the work can go execute other work items without any
scheduling event inbetween. So, no matter what I do, if I wanna account CPU
time accurately per work item, I need to read the current time once more
between work items, which is okay but not really necessary, not preemptively
anyway. If there are use cases that need accurate accounting, we can always
revisit the issue.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism
  2023-05-11 18:19 ` [PATCH 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism Tejun Heo
  2023-05-11 21:26   ` Peter Zijlstra
@ 2023-05-12 19:42   ` Tejun Heo
  2023-07-11 13:55     ` Consider switching to WQ_UNBOUND messages (was: Re: [PATCH v2 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism) Geert Uytterhoeven
  1 sibling, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2023-05-12 19:42 UTC (permalink / raw)
  To: jiangshanlai; +Cc: torvalds, peterz, linux-kernel, kernel-team

Workqueue now automatically marks per-cpu work items that hog CPU for too
long as CPU_INTENSIVE, which excludes them from concurrency management and
prevents stalling other concurrency-managed work items. If a work function
keeps running over the thershold, it likely needs to be switched to use an
unbound workqueue.

This patch adds a debug mechanism which tracks the work functions which
trigger the automatic CPU_INTENSIVE mechanism and report them using
pr_warn() with exponential backoff.

v2: Drop bouncing through kthread_worker for printing messages. It was to
    avoid introducing circular locking dependency but wasn't effective as it
    still had pool lock -> wci_lock -> printk -> pool lock loop. Let's just
    print directly using printk_deferred().

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/workqueue.c |   93 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 lib/Kconfig.debug  |   13 +++++++
 2 files changed, 106 insertions(+)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -948,6 +948,98 @@ static inline void worker_clr_flags(stru
 			pool->nr_running++;
 }
 
+#ifdef CONFIG_WQ_CPU_INTENSIVE_REPORT
+
+/*
+ * Concurrency-managed per-cpu work items that hog CPU for longer than
+ * wq_cpu_intensive_thresh_us trigger the automatic CPU_INTENSIVE mechanism,
+ * which prevents them from stalling other concurrency-managed work items. If a
+ * work function keeps triggering this mechanism, it's likely that the work item
+ * should be using an unbound workqueue instead.
+ *
+ * wq_cpu_intensive_report() tracks work functions which trigger such conditions
+ * and report them so that they can be examined and converted to use unbound
+ * workqueues as appropriate. To avoid flooding the console, each violating work
+ * function is tracked and reported with exponential backoff.
+ */
+#define WCI_MAX_ENTS 128
+
+struct wci_ent {
+	work_func_t		func;
+	atomic64_t		cnt;
+	struct hlist_node	hash_node;
+};
+
+static struct wci_ent wci_ents[WCI_MAX_ENTS];
+static int wci_nr_ents;
+static DEFINE_RAW_SPINLOCK(wci_lock);
+static DEFINE_HASHTABLE(wci_hash, ilog2(WCI_MAX_ENTS));
+
+static struct wci_ent *wci_find_ent(work_func_t func)
+{
+	struct wci_ent *ent;
+
+	hash_for_each_possible_rcu(wci_hash, ent, hash_node,
+				   (unsigned long)func) {
+		if (ent->func == func)
+			return ent;
+	}
+	return NULL;
+}
+
+static void wq_cpu_intensive_report(work_func_t func)
+{
+	struct wci_ent *ent;
+
+restart:
+	ent = wci_find_ent(func);
+	if (ent) {
+		u64 cnt;
+
+		/*
+		 * Start reporting from the fourth time and back off
+		 * exponentially.
+		 */
+		cnt = atomic64_inc_return_relaxed(&ent->cnt);
+		if (cnt >= 4 && is_power_of_2(cnt))
+			printk_deferred(KERN_WARNING "workqueue: %ps hogged CPU for >%luus %llu times, consider switching to WQ_UNBOUND\n",
+					ent->func, wq_cpu_intensive_thresh_us,
+					atomic64_read(&ent->cnt));
+		return;
+	}
+
+	/*
+	 * @func is a new violation. Allocate a new entry for it. If wcn_ents[]
+	 * is exhausted, something went really wrong and we probably made enough
+	 * noise already.
+	 */
+	if (wci_nr_ents >= WCI_MAX_ENTS)
+		return;
+
+	raw_spin_lock(&wci_lock);
+
+	if (wci_nr_ents >= WCI_MAX_ENTS) {
+		raw_spin_unlock(&wci_lock);
+		return;
+	}
+
+	if (wci_find_ent(func)) {
+		raw_spin_unlock(&wci_lock);
+		goto restart;
+	}
+
+	ent = &wci_ents[wci_nr_ents++];
+	ent->func = func;
+	atomic64_set(&ent->cnt, 1);
+	hash_add_rcu(wci_hash, &ent->hash_node, (unsigned long)func);
+
+	raw_spin_unlock(&wci_lock);
+}
+
+#else	/* CONFIG_WQ_CPU_INTENSIVE_REPORT */
+static void wq_cpu_intensive_report(work_func_t func) {}
+#endif	/* CONFIG_WQ_CPU_INTENSIVE_REPORT */
+
 /**
  * wq_worker_running - a worker is running again
  * @task: task waking up
@@ -1057,6 +1149,7 @@ void wq_worker_tick(struct task_struct *
 	raw_spin_lock(&pool->lock);
 
 	worker_set_flags(worker, WORKER_CPU_INTENSIVE);
+	wq_cpu_intensive_report(worker->current_func);
 	pwq->stats[PWQ_STAT_CPU_INTENSIVE]++;
 
 	if (need_more_worker(pool)) {
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1134,6 +1134,19 @@ config WQ_WATCHDOG
 	  state.  This can be configured through kernel parameter
 	  "workqueue.watchdog_thresh" and its sysfs counterpart.
 
+config WQ_CPU_INTENSIVE_REPORT
+	bool "Report per-cpu work items which hog CPU for too long"
+	depends on DEBUG_KERNEL
+	help
+	  Say Y here to enable reporting of concurrency-managed per-cpu work
+	  items that hog CPUs for longer than
+	  workqueue.cpu_intensive_threshold_us. Workqueue automatically
+	  detects and excludes them from concurrency management to prevent
+	  them from stalling other per-cpu work items. Occassional
+	  triggering may not necessarily indicate a problem. Repeated
+	  triggering likely indicates that the work item should be switched
+	  to use an unbound workqueue.
+
 config TEST_LOCKUP
 	tristate "Test module to generate lockups"
 	depends on m

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 3/7] workqueue: Move worker_set/clr_flags() upwards
  2023-05-18  3:00 [PATCHSET v4 wq/for-6.5] workqueue: Implement automatic CPU intensive detection and add monitoring Tejun Heo
@ 2023-05-18  3:00 ` Tejun Heo
  0 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2023-05-18  3:00 UTC (permalink / raw)
  To: jiangshanlai; +Cc: torvalds, peterz, linux-kernel, kernel-team, Tejun Heo

They are going to be used in wq_worker_stopping(). Move them upwards.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
---
 kernel/workqueue.c | 108 ++++++++++++++++++++++-----------------------
 1 file changed, 54 insertions(+), 54 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 60d5b84cccb2..d70bb5be99ce 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -879,6 +879,60 @@ static void wake_up_worker(struct worker_pool *pool)
 		wake_up_process(worker->task);
 }
 
+/**
+ * worker_set_flags - set worker flags and adjust nr_running accordingly
+ * @worker: self
+ * @flags: flags to set
+ *
+ * Set @flags in @worker->flags and adjust nr_running accordingly.
+ *
+ * CONTEXT:
+ * raw_spin_lock_irq(pool->lock)
+ */
+static inline void worker_set_flags(struct worker *worker, unsigned int flags)
+{
+	struct worker_pool *pool = worker->pool;
+
+	WARN_ON_ONCE(worker->task != current);
+
+	/* If transitioning into NOT_RUNNING, adjust nr_running. */
+	if ((flags & WORKER_NOT_RUNNING) &&
+	    !(worker->flags & WORKER_NOT_RUNNING)) {
+		pool->nr_running--;
+	}
+
+	worker->flags |= flags;
+}
+
+/**
+ * worker_clr_flags - clear worker flags and adjust nr_running accordingly
+ * @worker: self
+ * @flags: flags to clear
+ *
+ * Clear @flags in @worker->flags and adjust nr_running accordingly.
+ *
+ * CONTEXT:
+ * raw_spin_lock_irq(pool->lock)
+ */
+static inline void worker_clr_flags(struct worker *worker, unsigned int flags)
+{
+	struct worker_pool *pool = worker->pool;
+	unsigned int oflags = worker->flags;
+
+	WARN_ON_ONCE(worker->task != current);
+
+	worker->flags &= ~flags;
+
+	/*
+	 * If transitioning out of NOT_RUNNING, increment nr_running.  Note
+	 * that the nested NOT_RUNNING is not a noop.  NOT_RUNNING is mask
+	 * of multiple flags, not a single flag.
+	 */
+	if ((flags & WORKER_NOT_RUNNING) && (oflags & WORKER_NOT_RUNNING))
+		if (!(worker->flags & WORKER_NOT_RUNNING))
+			pool->nr_running++;
+}
+
 /**
  * wq_worker_running - a worker is running again
  * @task: task waking up
@@ -983,60 +1037,6 @@ work_func_t wq_worker_last_func(struct task_struct *task)
 	return worker->last_func;
 }
 
-/**
- * worker_set_flags - set worker flags and adjust nr_running accordingly
- * @worker: self
- * @flags: flags to set
- *
- * Set @flags in @worker->flags and adjust nr_running accordingly.
- *
- * CONTEXT:
- * raw_spin_lock_irq(pool->lock)
- */
-static inline void worker_set_flags(struct worker *worker, unsigned int flags)
-{
-	struct worker_pool *pool = worker->pool;
-
-	WARN_ON_ONCE(worker->task != current);
-
-	/* If transitioning into NOT_RUNNING, adjust nr_running. */
-	if ((flags & WORKER_NOT_RUNNING) &&
-	    !(worker->flags & WORKER_NOT_RUNNING)) {
-		pool->nr_running--;
-	}
-
-	worker->flags |= flags;
-}
-
-/**
- * worker_clr_flags - clear worker flags and adjust nr_running accordingly
- * @worker: self
- * @flags: flags to clear
- *
- * Clear @flags in @worker->flags and adjust nr_running accordingly.
- *
- * CONTEXT:
- * raw_spin_lock_irq(pool->lock)
- */
-static inline void worker_clr_flags(struct worker *worker, unsigned int flags)
-{
-	struct worker_pool *pool = worker->pool;
-	unsigned int oflags = worker->flags;
-
-	WARN_ON_ONCE(worker->task != current);
-
-	worker->flags &= ~flags;
-
-	/*
-	 * If transitioning out of NOT_RUNNING, increment nr_running.  Note
-	 * that the nested NOT_RUNNING is not a noop.  NOT_RUNNING is mask
-	 * of multiple flags, not a single flag.
-	 */
-	if ((flags & WORKER_NOT_RUNNING) && (oflags & WORKER_NOT_RUNNING))
-		if (!(worker->flags & WORKER_NOT_RUNNING))
-			pool->nr_running++;
-}
-
 /**
  * find_worker_executing_work - find worker which is executing a work
  * @pool: pool of interest
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Consider switching to WQ_UNBOUND messages (was: Re: [PATCH v2 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism)
  2023-05-12 19:42   ` [PATCH v2 " Tejun Heo
@ 2023-07-11 13:55     ` Geert Uytterhoeven
  2023-07-11 14:06       ` Geert Uytterhoeven
  0 siblings, 1 reply; 30+ messages in thread
From: Geert Uytterhoeven @ 2023-07-11 13:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Lai Jiangshan, torvalds@linux-foundation.org, Peter Zijlstra,
	Linux Kernel Mailing List, kernel-team, Linux PM list,
	DRI Development, linux-rtc, linux-riscv, netdev,
	Linux Fbdev development list, Linux MMC List,
	open list:LIBATA SUBSYSTEM (Serial and Parallel ATA drivers),
	Linux-Renesas

Hi Tejun,

On Fri, May 12, 2023 at 9:54 PM Tejun Heo <tj@kernel.org> wrote:
> Workqueue now automatically marks per-cpu work items that hog CPU for too
> long as CPU_INTENSIVE, which excludes them from concurrency management and
> prevents stalling other concurrency-managed work items. If a work function
> keeps running over the thershold, it likely needs to be switched to use an
> unbound workqueue.
>
> This patch adds a debug mechanism which tracks the work functions which
> trigger the automatic CPU_INTENSIVE mechanism and report them using
> pr_warn() with exponential backoff.
>
> v2: Drop bouncing through kthread_worker for printing messages. It was to
>     avoid introducing circular locking dependency but wasn't effective as it
>     still had pool lock -> wci_lock -> printk -> pool lock loop. Let's just
>     print directly using printk_deferred().
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>

Thanks for your patch, which is now commit 6363845005202148
("workqueue: Report work funcs that trigger automatic CPU_INTENSIVE
mechanism") in v6.5-rc1.

I guess you are interested to know where this triggers.
I enabled CONFIG_WQ_CPU_INTENSIVE_REPORT=y, and tested
the result on various machines...

SH/R-Mobile:

  workqueue: genpd_power_off_work_fn hogged CPU for >10000us 4 times,
consider switching to WQ_UNBOUND

Atmark Techno Armadillo800-EVA with shmob_drm:

  workqueue: drm_fb_helper_damage_work hogged CPU for >10000us 16
times, consider switching to WQ_UNBOUND

R-Car Gen2:

  workqueue: rtc_timer_do_work hogged CPU for >10000us 4 times,
consider switching to WQ_UNBOUND

R-Car Gen2/Gen3:

  workqueue: pm_runtime_work hogged CPU for >10000us 4 times, consider
switching to WQ_UNBOUND

R-Car Gen3:

  workqueue: kfree_rcu_work hogged CPU for >10000us 4 times, consider
switching to WQ_UNBOUND

OrangeCrab/Linux-on-LiteX-VexRiscV with ht16k33 14-seg display and ssd130xdrmfb:

  workqueue: check_lifetime hogged CPU for >10000us 4 times, consider
switching to WQ_UNBOUND
  workqueue: drm_fb_helper_damage_work hogged CPU for >10000us 1024
times, consider switching to WQ_UNBOUND
  workqueue: fb_flashcursor hogged CPU for >10000us 128 times,
consider switching to WQ_UNBOUND
  workqueue: ht16k33_seg14_update hogged CPU for >10000us 128 times,
consider switching to WQ_UNBOUND
  workqueue: mmc_rescan hogged CPU for >10000us 128 times, consider
switching to WQ_UNBOUND

Atari (ARAnyM):

  workqueue: ata_sff_pio_task hogged CPU for >10000us 64 times,
consider switching to WQ_UNBOUND

The OrangeCrab is a slow machine, so it's not that surprising to see these
messages...

Gr{oetje,eeting}s,

                        Geert


--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Consider switching to WQ_UNBOUND messages (was: Re: [PATCH v2 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism)
  2023-07-11 13:55     ` Consider switching to WQ_UNBOUND messages (was: Re: [PATCH v2 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism) Geert Uytterhoeven
@ 2023-07-11 14:06       ` Geert Uytterhoeven
  2023-07-11 21:39         ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Geert Uytterhoeven @ 2023-07-11 14:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Lai Jiangshan, torvalds@linux-foundation.org, Peter Zijlstra,
	Linux Kernel Mailing List, kernel-team, Linux PM list,
	DRI Development, linux-rtc, linux-riscv, netdev,
	Linux Fbdev development list, Linux MMC List,
	open list:LIBATA SUBSYSTEM (Serial and Parallel ATA drivers),
	Linux-Renesas

On Tue, Jul 11, 2023 at 3:55 PM Geert Uytterhoeven <geert@linux-m68k.org> wrote:
>
> Hi Tejun,
>
> On Fri, May 12, 2023 at 9:54 PM Tejun Heo <tj@kernel.org> wrote:
> > Workqueue now automatically marks per-cpu work items that hog CPU for too
> > long as CPU_INTENSIVE, which excludes them from concurrency management and
> > prevents stalling other concurrency-managed work items. If a work function
> > keeps running over the thershold, it likely needs to be switched to use an
> > unbound workqueue.
> >
> > This patch adds a debug mechanism which tracks the work functions which
> > trigger the automatic CPU_INTENSIVE mechanism and report them using
> > pr_warn() with exponential backoff.
> >
> > v2: Drop bouncing through kthread_worker for printing messages. It was to
> >     avoid introducing circular locking dependency but wasn't effective as it
> >     still had pool lock -> wci_lock -> printk -> pool lock loop. Let's just
> >     print directly using printk_deferred().
> >
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> > Suggested-by: Peter Zijlstra <peterz@infradead.org>
>
> Thanks for your patch, which is now commit 6363845005202148
> ("workqueue: Report work funcs that trigger automatic CPU_INTENSIVE
> mechanism") in v6.5-rc1.
>
> I guess you are interested to know where this triggers.
> I enabled CONFIG_WQ_CPU_INTENSIVE_REPORT=y, and tested
> the result on various machines...

> OrangeCrab/Linux-on-LiteX-VexRiscV with ht16k33 14-seg display and ssd130xdrmfb:
>
>   workqueue: check_lifetime hogged CPU for >10000us 4 times, consider
> switching to WQ_UNBOUND
>   workqueue: drm_fb_helper_damage_work hogged CPU for >10000us 1024
> times, consider switching to WQ_UNBOUND
>   workqueue: fb_flashcursor hogged CPU for >10000us 128 times,
> consider switching to WQ_UNBOUND
>   workqueue: ht16k33_seg14_update hogged CPU for >10000us 128 times,
> consider switching to WQ_UNBOUND
>   workqueue: mmc_rescan hogged CPU for >10000us 128 times, consider
> switching to WQ_UNBOUND

Got one more after a while:

workqueue: neigh_managed_work hogged CPU for >10000us 4 times,
consider switching to WQ_UNBOUND

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Consider switching to WQ_UNBOUND messages (was: Re: [PATCH v2 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism)
  2023-07-11 14:06       ` Geert Uytterhoeven
@ 2023-07-11 21:39         ` Tejun Heo
  2023-07-12  0:30           ` Tejun Heo
  2023-07-12  8:05           ` Consider switching to WQ_UNBOUND messages (was: Re: [PATCH v2 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism) Peter Zijlstra
  0 siblings, 2 replies; 30+ messages in thread
From: Tejun Heo @ 2023-07-11 21:39 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Lai Jiangshan, torvalds@linux-foundation.org, Peter Zijlstra,
	Linux Kernel Mailing List, kernel-team, Linux PM list,
	DRI Development, linux-rtc, linux-riscv, netdev,
	Linux Fbdev development list, Linux MMC List,
	open list:LIBATA SUBSYSTEM (Serial and Parallel ATA drivers),
	Linux-Renesas

Hello,

On Tue, Jul 11, 2023 at 04:06:22PM +0200, Geert Uytterhoeven wrote:
> On Tue, Jul 11, 2023 at 3:55 PM Geert Uytterhoeven <geert@linux-m68k.org> wrote:
> >
> > Hi Tejun,
> >
> > On Fri, May 12, 2023 at 9:54 PM Tejun Heo <tj@kernel.org> wrote:
> > > Workqueue now automatically marks per-cpu work items that hog CPU for too
> > > long as CPU_INTENSIVE, which excludes them from concurrency management and
> > > prevents stalling other concurrency-managed work items. If a work function
> > > keeps running over the thershold, it likely needs to be switched to use an
> > > unbound workqueue.
> > >
> > > This patch adds a debug mechanism which tracks the work functions which
> > > trigger the automatic CPU_INTENSIVE mechanism and report them using
> > > pr_warn() with exponential backoff.
> > >
> > > v2: Drop bouncing through kthread_worker for printing messages. It was to
> > >     avoid introducing circular locking dependency but wasn't effective as it
> > >     still had pool lock -> wci_lock -> printk -> pool lock loop. Let's just
> > >     print directly using printk_deferred().
> > >
> > > Signed-off-by: Tejun Heo <tj@kernel.org>
> > > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> >
> > Thanks for your patch, which is now commit 6363845005202148
> > ("workqueue: Report work funcs that trigger automatic CPU_INTENSIVE
> > mechanism") in v6.5-rc1.
> >
> > I guess you are interested to know where this triggers.
> > I enabled CONFIG_WQ_CPU_INTENSIVE_REPORT=y, and tested
> > the result on various machines...
> 
> > OrangeCrab/Linux-on-LiteX-VexRiscV with ht16k33 14-seg display and ssd130xdrmfb:
> >
> >   workqueue: check_lifetime hogged CPU for >10000us 4 times, consider
> > switching to WQ_UNBOUND
> >   workqueue: drm_fb_helper_damage_work hogged CPU for >10000us 1024
> > times, consider switching to WQ_UNBOUND
> >   workqueue: fb_flashcursor hogged CPU for >10000us 128 times,
> > consider switching to WQ_UNBOUND
> >   workqueue: ht16k33_seg14_update hogged CPU for >10000us 128 times,
> > consider switching to WQ_UNBOUND
> >   workqueue: mmc_rescan hogged CPU for >10000us 128 times, consider
> > switching to WQ_UNBOUND
> 
> Got one more after a while:
> 
> workqueue: neigh_managed_work hogged CPU for >10000us 4 times,
> consider switching to WQ_UNBOUND

I wonder whether the right thing to do here is somehow scaling the threshold
according to the relative processing power. It's difficult to come up with a
threshold which works well across the latest & fastest and really tiny CPUs.
I'll think about it some more but if you have some ideas, please feel free
to suggest.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Consider switching to WQ_UNBOUND messages (was: Re: [PATCH v2 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism)
  2023-07-11 21:39         ` Tejun Heo
@ 2023-07-12  0:30           ` Tejun Heo
  2023-07-12  9:57             ` Geert Uytterhoeven
  2023-07-12  8:05           ` Consider switching to WQ_UNBOUND messages (was: Re: [PATCH v2 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism) Peter Zijlstra
  1 sibling, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2023-07-12  0:30 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Lai Jiangshan, torvalds@linux-foundation.org, Peter Zijlstra,
	Linux Kernel Mailing List, kernel-team, Linux PM list,
	DRI Development, linux-rtc, linux-riscv, netdev,
	Linux Fbdev development list, Linux MMC List,
	open list:LIBATA SUBSYSTEM (Serial and Parallel ATA drivers),
	Linux-Renesas

On Tue, Jul 11, 2023 at 11:39:17AM -1000, Tejun Heo wrote:
> On Tue, Jul 11, 2023 at 04:06:22PM +0200, Geert Uytterhoeven wrote:
> > On Tue, Jul 11, 2023 at 3:55 PM Geert Uytterhoeven <geert@linux-m68k.org> wrote:
...
> > workqueue: neigh_managed_work hogged CPU for >10000us 4 times,
> > consider switching to WQ_UNBOUND
> 
> I wonder whether the right thing to do here is somehow scaling the threshold
> according to the relative processing power. It's difficult to come up with a
> threshold which works well across the latest & fastest and really tiny CPUs.
> I'll think about it some more but if you have some ideas, please feel free
> to suggest.

Geert, do you mind posting the full kernel logs for the affected machines?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Consider switching to WQ_UNBOUND messages (was: Re: [PATCH v2 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism)
  2023-07-11 21:39         ` Tejun Heo
  2023-07-12  0:30           ` Tejun Heo
@ 2023-07-12  8:05           ` Peter Zijlstra
  2023-07-12  9:04             ` Geert Uytterhoeven
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2023-07-12  8:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Geert Uytterhoeven, Lai Jiangshan, torvalds@linux-foundation.org,
	Linux Kernel Mailing List, kernel-team, Linux PM list,
	DRI Development, linux-rtc, linux-riscv, netdev,
	Linux Fbdev development list, Linux MMC List,
	open list:LIBATA SUBSYSTEM (Serial and Parallel ATA drivers),
	Linux-Renesas

On Tue, Jul 11, 2023 at 11:39:17AM -1000, Tejun Heo wrote:

> I wonder whether the right thing to do here is somehow scaling the threshold
> according to the relative processing power. It's difficult to come up with a
> threshold which works well across the latest & fastest and really tiny CPUs.
> I'll think about it some more but if you have some ideas, please feel free
> to suggest.

We could scale by BogoMIPS I suppose, it's a bogus measurement, as per
the name, but it does have some relation to how fast the machine is.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Consider switching to WQ_UNBOUND messages (was: Re: [PATCH v2 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism)
  2023-07-12  8:05           ` Consider switching to WQ_UNBOUND messages (was: Re: [PATCH v2 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism) Peter Zijlstra
@ 2023-07-12  9:04             ` Geert Uytterhoeven
  2023-07-12 12:27               ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Geert Uytterhoeven @ 2023-07-12  9:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Lai Jiangshan, torvalds@linux-foundation.org,
	Linux Kernel Mailing List, kernel-team, Linux PM list,
	DRI Development, linux-rtc, linux-riscv, netdev,
	Linux Fbdev development list, Linux MMC List,
	open list:LIBATA SUBSYSTEM (Serial and Parallel ATA drivers),
	Linux-Renesas

Hoi Peter,

On Wed, Jul 12, 2023 at 10:05 AM Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Jul 11, 2023 at 11:39:17AM -1000, Tejun Heo wrote:
> > I wonder whether the right thing to do here is somehow scaling the threshold
> > according to the relative processing power. It's difficult to come up with a
> > threshold which works well across the latest & fastest and really tiny CPUs.
> > I'll think about it some more but if you have some ideas, please feel free
> > to suggest.
>
> We could scale by BogoMIPS I suppose, it's a bogus measurement, as per
> the name, but it does have some relation to how fast the machine is.

That's gonna fail miserably on e.g. ARM and RISC-V, where BogoMIPS
depends on some timer frequency.

R-Car M2-W with 1.5 GHz Cortex-A15: 40.00 BogoMIPS
R-Car V4H with 1.8 GHz Cortex-A76: 33.33 BogoMIPS

while the real slow 48 MHz VexRiscV gets 128 BogoMIPS.

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Consider switching to WQ_UNBOUND messages (was: Re: [PATCH v2 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism)
  2023-07-12  0:30           ` Tejun Heo
@ 2023-07-12  9:57             ` Geert Uytterhoeven
  2023-07-17 23:03               ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Geert Uytterhoeven @ 2023-07-12  9:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Lai Jiangshan, torvalds@linux-foundation.org, Peter Zijlstra,
	Linux Kernel Mailing List, kernel-team, Linux PM list,
	DRI Development, linux-rtc, linux-riscv, netdev,
	Linux Fbdev development list, Linux MMC List,
	open list:LIBATA SUBSYSTEM (Serial and Parallel ATA drivers),
	Linux-Renesas

Hi Tejun,

On Wed, Jul 12, 2023 at 2:30 AM Tejun Heo <tj@kernel.org> wrote:
> On Tue, Jul 11, 2023 at 11:39:17AM -1000, Tejun Heo wrote:
> > On Tue, Jul 11, 2023 at 04:06:22PM +0200, Geert Uytterhoeven wrote:
> > > On Tue, Jul 11, 2023 at 3:55 PM Geert Uytterhoeven <geert@linux-m68k.org> wrote:
> ...
> > > workqueue: neigh_managed_work hogged CPU for >10000us 4 times,
> > > consider switching to WQ_UNBOUND
> >
> > I wonder whether the right thing to do here is somehow scaling the threshold
> > according to the relative processing power. It's difficult to come up with a
> > threshold which works well across the latest & fastest and really tiny CPUs.
> > I'll think about it some more but if you have some ideas, please feel free
> > to suggest.
>
> Geert, do you mind posting the full kernel logs for the affected machines?

https://drive.google.com/file/d/1toDs7ugZJ2eXatpdvySY4yxSsNam9xAC
is an archive with boot and s2ram logs.  Note that my kernels do contain
local debug code, and may be noisy.

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Consider switching to WQ_UNBOUND messages (was: Re: [PATCH v2 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism)
  2023-07-12  9:04             ` Geert Uytterhoeven
@ 2023-07-12 12:27               ` Peter Zijlstra
  2023-07-13 18:53                 ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2023-07-12 12:27 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Tejun Heo, Lai Jiangshan, torvalds@linux-foundation.org,
	Linux Kernel Mailing List, kernel-team, Linux PM list,
	DRI Development, linux-rtc, linux-riscv, netdev,
	Linux Fbdev development list, Linux MMC List,
	open list:LIBATA SUBSYSTEM (Serial and Parallel ATA drivers),
	Linux-Renesas

On Wed, Jul 12, 2023 at 11:04:16AM +0200, Geert Uytterhoeven wrote:
> Hoi Peter,
> 
> On Wed, Jul 12, 2023 at 10:05 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, Jul 11, 2023 at 11:39:17AM -1000, Tejun Heo wrote:
> > > I wonder whether the right thing to do here is somehow scaling the threshold
> > > according to the relative processing power. It's difficult to come up with a
> > > threshold which works well across the latest & fastest and really tiny CPUs.
> > > I'll think about it some more but if you have some ideas, please feel free
> > > to suggest.
> >
> > We could scale by BogoMIPS I suppose, it's a bogus measurement, as per
> > the name, but it does have some relation to how fast the machine is.
> 
> That's gonna fail miserably on e.g. ARM and RISC-V, where BogoMIPS
> depends on some timer frequency.
> 
> R-Car M2-W with 1.5 GHz Cortex-A15: 40.00 BogoMIPS
> R-Car V4H with 1.8 GHz Cortex-A76: 33.33 BogoMIPS
> 
> while the real slow 48 MHz VexRiscV gets 128 BogoMIPS.

Hehe, OK, really bogus then. Lets file this idea in the bit-bucket then.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Consider switching to WQ_UNBOUND messages (was: Re: [PATCH v2 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism)
  2023-07-12 12:27               ` Peter Zijlstra
@ 2023-07-13 18:53                 ` Tejun Heo
  0 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2023-07-13 18:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Geert Uytterhoeven, Lai Jiangshan, torvalds@linux-foundation.org,
	Linux Kernel Mailing List, kernel-team, Linux PM list,
	DRI Development, linux-rtc, linux-riscv, netdev,
	Linux Fbdev development list, Linux MMC List,
	open list:LIBATA SUBSYSTEM (Serial and Parallel ATA drivers),
	Linux-Renesas

On Wed, Jul 12, 2023 at 02:27:45PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 12, 2023 at 11:04:16AM +0200, Geert Uytterhoeven wrote:
> > Hoi Peter,
> > 
> > On Wed, Jul 12, 2023 at 10:05 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Tue, Jul 11, 2023 at 11:39:17AM -1000, Tejun Heo wrote:
> > > > I wonder whether the right thing to do here is somehow scaling the threshold
> > > > according to the relative processing power. It's difficult to come up with a
> > > > threshold which works well across the latest & fastest and really tiny CPUs.
> > > > I'll think about it some more but if you have some ideas, please feel free
> > > > to suggest.
> > >
> > > We could scale by BogoMIPS I suppose, it's a bogus measurement, as per
> > > the name, but it does have some relation to how fast the machine is.
> > 
> > That's gonna fail miserably on e.g. ARM and RISC-V, where BogoMIPS
> > depends on some timer frequency.
> > 
> > R-Car M2-W with 1.5 GHz Cortex-A15: 40.00 BogoMIPS
> > R-Car V4H with 1.8 GHz Cortex-A76: 33.33 BogoMIPS
> > 
> > while the real slow 48 MHz VexRiscV gets 128 BogoMIPS.
> 
> Hehe, OK, really bogus then. Lets file this idea in the bit-bucket then.

I think it can still be useful. On ryzen 3975wx, it's 6989.92, so while it
may be off by some hundreds of percents, there are still orders of magnitude
signal range and that should be enough to suppress most spurious warnings.
I'll post something later today.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Consider switching to WQ_UNBOUND messages (was: Re: [PATCH v2 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism)
  2023-07-12  9:57             ` Geert Uytterhoeven
@ 2023-07-17 23:03               ` Tejun Heo
  2023-07-18  9:54                 ` Geert Uytterhoeven
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2023-07-17 23:03 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Lai Jiangshan, torvalds@linux-foundation.org, Peter Zijlstra,
	Linux Kernel Mailing List, kernel-team, Linux PM list,
	DRI Development, linux-rtc, linux-riscv, netdev,
	Linux Fbdev development list, Linux MMC List,
	open list:LIBATA SUBSYSTEM (Serial and Parallel ATA drivers),
	Linux-Renesas

Hello, Geert.

Can you please the following patch and see how many reports you get? Looking
back at your reports, I think some of them probably should be converted to
UNBOUND but we should have a better idea with the adjusted threshold.

Thanks.

From 8555cbd4b22e5f85eb2bdcb84fd1d1f519a0a0d3 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Mon, 17 Jul 2023 12:50:02 -1000
Subject: [PATCH] workqueue: Scale up wq_cpu_intensive_thresh_us if BogoMIPS is
 below 1000

wq_cpu_intensive_thresh_us is used to detect CPU-hogging per-cpu work items.
Once detected, they're excluded from concurrency management to prevent them
from blocking other per-cpu work items. If CONFIG_WQ_CPU_INTENSIVE_REPORT is
enabled, repeat offenders are also reported so that the code can be updated.

The default threshold is 10ms which is long enough to do fair bit of work on
modern CPUs while short enough to be usually not noticeable. This
unfortunately leads to a lot of, arguable spurious, detections on very slow
CPUs. Using the same threshold across CPUs whose performance levels may be
apart by multiple levels of magnitude doesn't make whole lot of sense.

This patch scales up wq_cpu_intensive_thresh_us upto 1 second when BogoMIPS
is below 1000. This is obviously very inaccurate but it doesn't have to be
accurate to be useful. The mechanism is still useful when the threshold is
fully scaled up and the benefits of reports are usually shared with everyone
regardless of who's reporting, so as long as there are sufficient number of
fast machines reporting, we don't lose much.

Some (or is it all?) ARM CPUs systemtically report significantly lower
BogoMIPS. While this doesn't break anything, given how widespread ARM CPUs
are, it's at least a missed opportunity and it probably would be a good idea
to teach workqueue about it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
---
 kernel/workqueue.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 02a8f402eeb5..0d7a3d29762f 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -52,6 +52,7 @@
 #include <linux/sched/debug.h>
 #include <linux/nmi.h>
 #include <linux/kvm_para.h>
+#include <linux/delay.h>
 
 #include "workqueue_internal.h"
 
@@ -338,8 +339,10 @@ static cpumask_var_t *wq_numa_possible_cpumask;
  * Per-cpu work items which run for longer than the following threshold are
  * automatically considered CPU intensive and excluded from concurrency
  * management to prevent them from noticeably delaying other per-cpu work items.
+ * ULONG_MAX indicates that the user hasn't overridden it with a boot parameter.
+ * The actual value is initialized in wq_cpu_intensive_thresh_init().
  */
-static unsigned long wq_cpu_intensive_thresh_us = 10000;
+static unsigned long wq_cpu_intensive_thresh_us = ULONG_MAX;
 module_param_named(cpu_intensive_thresh_us, wq_cpu_intensive_thresh_us, ulong, 0644);
 
 static bool wq_disable_numa;
@@ -6513,6 +6516,42 @@ void __init workqueue_init_early(void)
 	       !system_freezable_power_efficient_wq);
 }
 
+static void __init wq_cpu_intensive_thresh_init(void)
+{
+	unsigned long thresh;
+	unsigned long mips;
+
+	/* if the user set it to a specific value, keep it */
+	if (wq_cpu_intensive_thresh_us != ULONG_MAX)
+		return;
+
+	/*
+	 * The default of 10ms is derived from the fact that most modern (as of
+	 * 2023) processors can do a lot in 10ms and that it's just below what
+	 * most consider human-perceivable. However, the kernel also runs on a
+	 * lot slower CPUs including microcontrollers where the threshold is way
+	 * too low.
+	 *
+	 * Let's scale up the threshold upto 1 second if BogoMips is below 1000.
+	 * This is by no means accurate but it doesn't have to be. The mechanism
+	 * is still useful even when the threshold is fully scaled up. Also, as
+	 * the reports would usually be applicable to everyone, some machines
+	 * operating on longer thresholds won't significantly diminish their
+	 * usefulness.
+	 */
+	thresh = 10 * USEC_PER_MSEC;
+
+	/* see init/calibrate.c for lpj -> BogoMIPS calculation */
+	mips = max_t(unsigned long, loops_per_jiffy / 500000 * HZ, 1);
+	if (mips < 1000)
+		thresh = min_t(unsigned long, thresh * 1000 / mips, USEC_PER_SEC);
+
+	pr_debug("wq_cpu_intensive_thresh: lpj=%lu mips=%lu thresh_us=%lu\n",
+		 loops_per_jiffy, mips, thresh);
+
+	wq_cpu_intensive_thresh_us = thresh;
+}
+
 /**
  * workqueue_init - bring workqueue subsystem fully online
  *
@@ -6528,6 +6567,8 @@ void __init workqueue_init(void)
 	struct worker_pool *pool;
 	int cpu, bkt;
 
+	wq_cpu_intensive_thresh_init();
+
 	/*
 	 * It'd be simpler to initialize NUMA in workqueue_init_early() but
 	 * CPU to node mapping may not be available that early on some
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Consider switching to WQ_UNBOUND messages (was: Re: [PATCH v2 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism)
  2023-07-17 23:03               ` Tejun Heo
@ 2023-07-18  9:54                 ` Geert Uytterhoeven
  2023-07-18 22:01                   ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Geert Uytterhoeven @ 2023-07-18  9:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Lai Jiangshan, torvalds@linux-foundation.org, Peter Zijlstra,
	Linux Kernel Mailing List, kernel-team, Linux PM list,
	DRI Development, linux-rtc, linux-riscv, netdev,
	Linux Fbdev development list, Linux MMC List,
	open list:LIBATA SUBSYSTEM (Serial and Parallel ATA drivers),
	Linux-Renesas

Hi Tejun,

On Tue, Jul 18, 2023 at 1:03 AM Tejun Heo <tj@kernel.org> wrote:
> Can you please the following patch and see how many reports you get? Looking
> back at your reports, I think some of them probably should be converted to
> UNBOUND but we should have a better idea with the adjusted threshold.
>
> Thanks.
>
> From 8555cbd4b22e5f85eb2bdcb84fd1d1f519a0a0d3 Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@kernel.org>
> Date: Mon, 17 Jul 2023 12:50:02 -1000
> Subject: [PATCH] workqueue: Scale up wq_cpu_intensive_thresh_us if BogoMIPS is
>  below 1000
>
> wq_cpu_intensive_thresh_us is used to detect CPU-hogging per-cpu work items.
> Once detected, they're excluded from concurrency management to prevent them
> from blocking other per-cpu work items. If CONFIG_WQ_CPU_INTENSIVE_REPORT is
> enabled, repeat offenders are also reported so that the code can be updated.
>
> The default threshold is 10ms which is long enough to do fair bit of work on
> modern CPUs while short enough to be usually not noticeable. This
> unfortunately leads to a lot of, arguable spurious, detections on very slow
> CPUs. Using the same threshold across CPUs whose performance levels may be
> apart by multiple levels of magnitude doesn't make whole lot of sense.
>
> This patch scales up wq_cpu_intensive_thresh_us upto 1 second when BogoMIPS
> is below 1000. This is obviously very inaccurate but it doesn't have to be
> accurate to be useful. The mechanism is still useful when the threshold is
> fully scaled up and the benefits of reports are usually shared with everyone
> regardless of who's reporting, so as long as there are sufficient number of
> fast machines reporting, we don't lose much.
>
> Some (or is it all?) ARM CPUs systemtically report significantly lower
> BogoMIPS. While this doesn't break anything, given how widespread ARM CPUs
> are, it's at least a missed opportunity and it probably would be a good idea
> to teach workqueue about it.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>

Thanks!

I gave it a try on a system with an 800 MHz Cortex A9, only to discover
it makes no difference, as that machine has 1600 BogoMIPS:

workqueue: drm_fb_helper_damage_work hogged CPU for >10000us 4 times,
consider switching to WQ_UNBOUND
workqueue: drm_fb_helper_damage_work hogged CPU for >10000us 8 times,
consider switching to WQ_UNBOUND
workqueue: genpd_power_off_work_fn hogged CPU for >10000us 4 times,
consider switching to WQ_UNBOUND
workqueue: blk_mq_run_work_fn hogged CPU for >10000us 4 times,
consider switching to WQ_UNBOUND
workqueue: pm_runtime_work hogged CPU for >10000us 4 times, consider
switching to WQ_UNBOUND
workqueue: phy_state_machine hogged CPU for >10000us 4 times, consider
switching to WQ_UNBOUND
workqueue: drm_mode_rmfb_work_fn hogged CPU for >10000us 4 times,
consider switching to WQ_UNBOUND
workqueue: sync_hw_clock hogged CPU for >10000us 4 times, consider
switching to WQ_UNBOUND
workqueue: rtc_timer_do_work hogged CPU for >10000us 4 times, consider
switching to WQ_UNBOUND

Artificially low BogoMIPS numbers only happen on systems that have
the related timers (Cortex A7/A15 and later, Cortex A9 MPCore,
and arm64).

I will test on more systems, but that will probably not happen until
next week...

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Consider switching to WQ_UNBOUND messages (was: Re: [PATCH v2 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism)
  2023-07-18  9:54                 ` Geert Uytterhoeven
@ 2023-07-18 22:01                   ` Tejun Heo
  2023-07-25 14:46                     ` Geert Uytterhoeven
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2023-07-18 22:01 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Lai Jiangshan, torvalds@linux-foundation.org, Peter Zijlstra,
	Linux Kernel Mailing List, kernel-team, Linux PM list,
	DRI Development, linux-rtc, linux-riscv, netdev,
	Linux Fbdev development list, Linux MMC List,
	open list:LIBATA SUBSYSTEM (Serial and Parallel ATA drivers),
	Linux-Renesas

Hello,

On Tue, Jul 18, 2023 at 11:54:58AM +0200, Geert Uytterhoeven wrote:
> I gave it a try on a system with an 800 MHz Cortex A9, only to discover
> it makes no difference, as that machine has 1600 BogoMIPS:

Oops.

> workqueue: blk_mq_run_work_fn hogged CPU for >10000us 4 times,
> consider switching to WQ_UNBOUND

It could be that we actually want to switch to UNBOUND for some reports but
the above triggering most likely indicates that the threshold is too
aggressive.

> Artificially low BogoMIPS numbers only happen on systems that have
> the related timers (Cortex A7/A15 and later, Cortex A9 MPCore,
> and arm64).

Ah, I see. Thanks for the explanation.

> I will test on more systems, but that will probably not happen until
> next week...

Thanks, really appreciate it. Can you try the following instead when you
have time? I just pushed up the lower boundary to 4000 MIPS. The scaling is
still capped at 1s.

From 8555cbd4b22e5f85eb2bdcb84fd1d1f519a0a0d3 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Mon, 17 Jul 2023 12:50:02 -1000
Subject: [PATCH] workqueue: Scale up wq_cpu_intensive_thresh_us if BogoMIPS is
 below 4000

wq_cpu_intensive_thresh_us is used to detect CPU-hogging per-cpu work items.
Once detected, they're excluded from concurrency management to prevent them
from blocking other per-cpu work items. If CONFIG_WQ_CPU_INTENSIVE_REPORT is
enabled, repeat offenders are also reported so that the code can be updated.

The default threshold is 10ms which is long enough to do fair bit of work on
modern CPUs while short enough to be usually not noticeable. This
unfortunately leads to a lot of, arguable spurious, detections on very slow
CPUs. Using the same threshold across CPUs whose performance levels may be
apart by multiple levels of magnitude doesn't make whole lot of sense.

This patch scales up wq_cpu_intensive_thresh_us upto 1 second when BogoMIPS
is below 4000. This is obviously very inaccurate but it doesn't have to be
accurate to be useful. The mechanism is still useful when the threshold is
fully scaled up and the benefits of reports are usually shared with everyone
regardless of who's reporting, so as long as there are sufficient number of
fast machines reporting, we don't lose much.

Some (or is it all?) ARM CPUs systemtically report significantly lower
BogoMIPS. While this doesn't break anything, given how widespread ARM CPUs
are, it's at least a missed opportunity and it probably would be a good idea
to teach workqueue about it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
---
 kernel/workqueue.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 02a8f402eeb5..0d7a3d29762f 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -52,6 +52,7 @@
 #include <linux/sched/debug.h>
 #include <linux/nmi.h>
 #include <linux/kvm_para.h>
+#include <linux/delay.h>
 
 #include "workqueue_internal.h"
 
@@ -338,8 +339,10 @@ static cpumask_var_t *wq_numa_possible_cpumask;
  * Per-cpu work items which run for longer than the following threshold are
  * automatically considered CPU intensive and excluded from concurrency
  * management to prevent them from noticeably delaying other per-cpu work items.
+ * ULONG_MAX indicates that the user hasn't overridden it with a boot parameter.
+ * The actual value is initialized in wq_cpu_intensive_thresh_init().
  */
-static unsigned long wq_cpu_intensive_thresh_us = 10000;
+static unsigned long wq_cpu_intensive_thresh_us = ULONG_MAX;
 module_param_named(cpu_intensive_thresh_us, wq_cpu_intensive_thresh_us, ulong, 0644);
 
 static bool wq_disable_numa;
@@ -6513,6 +6516,42 @@ void __init workqueue_init_early(void)
 	       !system_freezable_power_efficient_wq);
 }
 
+static void __init wq_cpu_intensive_thresh_init(void)
+{
+	unsigned long thresh;
+	unsigned long mips;
+
+	/* if the user set it to a specific value, keep it */
+	if (wq_cpu_intensive_thresh_us != ULONG_MAX)
+		return;
+
+	/*
+	 * The default of 10ms is derived from the fact that most modern (as of
+	 * 2023) processors can do a lot in 10ms and that it's just below what
+	 * most consider human-perceivable. However, the kernel also runs on a
+	 * lot slower CPUs including microcontrollers where the threshold is way
+	 * too low.
+	 *
+	 * Let's scale up the threshold upto 1 second if BogoMips is below 4000.
+	 * This is by no means accurate but it doesn't have to be. The mechanism
+	 * is still useful even when the threshold is fully scaled up. Also, as
+	 * the reports would usually be applicable to everyone, some machines
+	 * operating on longer thresholds won't significantly diminish their
+	 * usefulness.
+	 */
+	thresh = 10 * USEC_PER_MSEC;
+
+	/* see init/calibrate.c for lpj -> BogoMIPS calculation */
+	mips = max_t(unsigned long, loops_per_jiffy / 500000 * HZ, 1);
+	if (mips < 4000)
+		thresh = min_t(unsigned long, thresh * 4000 / mips, USEC_PER_SEC);
+
+	pr_debug("wq_cpu_intensive_thresh: lpj=%lu mips=%lu thresh_us=%lu\n",
+		 loops_per_jiffy, mips, thresh);
+
+	wq_cpu_intensive_thresh_us = thresh;
+}
+
 /**
  * workqueue_init - bring workqueue subsystem fully online
  *
@@ -6528,6 +6567,8 @@ void __init workqueue_init(void)
 	struct worker_pool *pool;
 	int cpu, bkt;
 
+	wq_cpu_intensive_thresh_init();
+
 	/*
 	 * It'd be simpler to initialize NUMA in workqueue_init_early() but
 	 * CPU to node mapping may not be available that early on some
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Consider switching to WQ_UNBOUND messages (was: Re: [PATCH v2 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism)
  2023-07-18 22:01                   ` Tejun Heo
@ 2023-07-25 14:46                     ` Geert Uytterhoeven
  2023-07-25 21:52                       ` [PATCH wq/for-6.5-fixes] workqueue: Drop the special locking rule for worker->flags and worker_pool->flags Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Geert Uytterhoeven @ 2023-07-25 14:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Lai Jiangshan, torvalds@linux-foundation.org, Peter Zijlstra,
	Linux Kernel Mailing List, kernel-team, Linux PM list,
	DRI Development, linux-rtc, linux-riscv, netdev,
	Linux Fbdev development list, Linux MMC List,
	open list:LIBATA SUBSYSTEM (Serial and Parallel ATA drivers),
	Linux-Renesas

Hi Tejun,

On Wed, Jul 19, 2023 at 12:01 AM Tejun Heo <tj@kernel.org> wrote:
> On Tue, Jul 18, 2023 at 11:54:58AM +0200, Geert Uytterhoeven wrote:
> > I gave it a try on a system with an 800 MHz Cortex A9, only to discover
> > it makes no difference, as that machine has 1600 BogoMIPS:
>
> Oops.
>
> > workqueue: blk_mq_run_work_fn hogged CPU for >10000us 4 times,
> > consider switching to WQ_UNBOUND
>
> It could be that we actually want to switch to UNBOUND for some reports but
> the above triggering most likely indicates that the threshold is too
> aggressive.
>
> > Artificially low BogoMIPS numbers only happen on systems that have
> > the related timers (Cortex A7/A15 and later, Cortex A9 MPCore,
> > and arm64).
>
> Ah, I see. Thanks for the explanation.
>
> > I will test on more systems, but that will probably not happen until
> > next week...
>
> Thanks, really appreciate it. Can you try the following instead when you
> have time? I just pushed up the lower boundary to 4000 MIPS. The scaling is
> still capped at 1s.

Thanks, with the below, I see no more WQ_UNBOUND messages.

> From 8555cbd4b22e5f85eb2bdcb84fd1d1f519a0a0d3 Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@kernel.org>
> Date: Mon, 17 Jul 2023 12:50:02 -1000
> Subject: [PATCH] workqueue: Scale up wq_cpu_intensive_thresh_us if BogoMIPS is
>  below 4000

> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c

> @@ -6513,6 +6516,42 @@ void __init workqueue_init_early(void)
>                !system_freezable_power_efficient_wq);
>  }
>
> +static void __init wq_cpu_intensive_thresh_init(void)
> +{
> +       unsigned long thresh;
> +       unsigned long mips;

This fails to build on mips.
Apparently mips is a predefined preprocessor macro:

$ echo | mipsel-linux-gnu-gcc -dM -E - | grep -w mips
#define mips 1

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH wq/for-6.5-fixes] workqueue: Drop the special locking rule for worker->flags and worker_pool->flags
  2023-07-25 14:46                     ` Geert Uytterhoeven
@ 2023-07-25 21:52                       ` Tejun Heo
  0 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2023-07-25 21:52 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Lai Jiangshan, torvalds@linux-foundation.org, Peter Zijlstra,
	Linux Kernel Mailing List, kernel-team, Linux PM list,
	DRI Development, linux-rtc, linux-riscv, netdev,
	Linux Fbdev development list, Linux MMC List,
	open list:LIBATA SUBSYSTEM (Serial and Parallel ATA drivers),
	Linux-Renesas

From aa6fde93f3a49e42c0fe0490d7f3711bac0d162e Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Mon, 17 Jul 2023 12:50:02 -1000
Subject: [PATCH] workqueue: Scale up wq_cpu_intensive_thresh_us if BogoMIPS is
 below 4000

wq_cpu_intensive_thresh_us is used to detect CPU-hogging per-cpu work items.
Once detected, they're excluded from concurrency management to prevent them
from blocking other per-cpu work items. If CONFIG_WQ_CPU_INTENSIVE_REPORT is
enabled, repeat offenders are also reported so that the code can be updated.

The default threshold is 10ms which is long enough to do fair bit of work on
modern CPUs while short enough to be usually not noticeable. This
unfortunately leads to a lot of, arguable spurious, detections on very slow
CPUs. Using the same threshold across CPUs whose performance levels may be
apart by multiple levels of magnitude doesn't make whole lot of sense.

This patch scales up wq_cpu_intensive_thresh_us upto 1 second when BogoMIPS
is below 4000. This is obviously very inaccurate but it doesn't have to be
accurate to be useful. The mechanism is still useful when the threshold is
fully scaled up and the benefits of reports are usually shared with everyone
regardless of who's reporting, so as long as there are sufficient number of
fast machines reporting, we don't lose much.

Some (or is it all?) ARM CPUs systemtically report significantly lower
BogoMIPS. While this doesn't break anything, given how widespread ARM CPUs
are, it's at least a missed opportunity and it probably would be a good idea
to teach workqueue about it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
---
Hello,

Geert, applied this to wq/for-6.5-fixes. The local variable `mips` is
renamed to `bogo` to avoid collision.

Thanks for testing!

 kernel/workqueue.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 02a8f402eeb5..800b4208dba9 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -52,6 +52,7 @@
 #include <linux/sched/debug.h>
 #include <linux/nmi.h>
 #include <linux/kvm_para.h>
+#include <linux/delay.h>
 
 #include "workqueue_internal.h"
 
@@ -338,8 +339,10 @@ static cpumask_var_t *wq_numa_possible_cpumask;
  * Per-cpu work items which run for longer than the following threshold are
  * automatically considered CPU intensive and excluded from concurrency
  * management to prevent them from noticeably delaying other per-cpu work items.
+ * ULONG_MAX indicates that the user hasn't overridden it with a boot parameter.
+ * The actual value is initialized in wq_cpu_intensive_thresh_init().
  */
-static unsigned long wq_cpu_intensive_thresh_us = 10000;
+static unsigned long wq_cpu_intensive_thresh_us = ULONG_MAX;
 module_param_named(cpu_intensive_thresh_us, wq_cpu_intensive_thresh_us, ulong, 0644);
 
 static bool wq_disable_numa;
@@ -6513,6 +6516,42 @@ void __init workqueue_init_early(void)
 	       !system_freezable_power_efficient_wq);
 }
 
+static void __init wq_cpu_intensive_thresh_init(void)
+{
+	unsigned long thresh;
+	unsigned long bogo;
+
+	/* if the user set it to a specific value, keep it */
+	if (wq_cpu_intensive_thresh_us != ULONG_MAX)
+		return;
+
+	/*
+	 * The default of 10ms is derived from the fact that most modern (as of
+	 * 2023) processors can do a lot in 10ms and that it's just below what
+	 * most consider human-perceivable. However, the kernel also runs on a
+	 * lot slower CPUs including microcontrollers where the threshold is way
+	 * too low.
+	 *
+	 * Let's scale up the threshold upto 1 second if BogoMips is below 4000.
+	 * This is by no means accurate but it doesn't have to be. The mechanism
+	 * is still useful even when the threshold is fully scaled up. Also, as
+	 * the reports would usually be applicable to everyone, some machines
+	 * operating on longer thresholds won't significantly diminish their
+	 * usefulness.
+	 */
+	thresh = 10 * USEC_PER_MSEC;
+
+	/* see init/calibrate.c for lpj -> BogoMIPS calculation */
+	bogo = max_t(unsigned long, loops_per_jiffy / 500000 * HZ, 1);
+	if (bogo < 4000)
+		thresh = min_t(unsigned long, thresh * 4000 / bogo, USEC_PER_SEC);
+
+	pr_debug("wq_cpu_intensive_thresh: lpj=%lu BogoMIPS=%lu thresh_us=%lu\n",
+		 loops_per_jiffy, bogo, thresh);
+
+	wq_cpu_intensive_thresh_us = thresh;
+}
+
 /**
  * workqueue_init - bring workqueue subsystem fully online
  *
@@ -6528,6 +6567,8 @@ void __init workqueue_init(void)
 	struct worker_pool *pool;
 	int cpu, bkt;
 
+	wq_cpu_intensive_thresh_init();
+
 	/*
 	 * It'd be simpler to initialize NUMA in workqueue_init_early() but
 	 * CPU to node mapping may not be available that early on some
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2023-07-25 21:52 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-11 18:19 [PATCHSET v3 wq/for-6.5] workqueue: Implement automatic CPU intensive detection and add monitoring Tejun Heo
2023-05-11 18:19 ` [PATCH 1/7] workqueue: Add pwq->stats[] and a monitoring script Tejun Heo
2023-05-11 18:19 ` [PATCH 2/7] workqueue: Re-order struct worker fields Tejun Heo
2023-05-11 18:19 ` [PATCH 3/7] workqueue: Move worker_set/clr_flags() upwards Tejun Heo
2023-05-11 18:19 ` [PATCH 4/7] workqueue: Improve locking rule description for worker fields Tejun Heo
2023-05-11 18:19 ` [PATCH 5/7] workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE Tejun Heo
2023-05-11 21:23   ` Peter Zijlstra
2023-05-11 22:47     ` Tejun Heo
2023-05-11 18:19 ` [PATCH 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism Tejun Heo
2023-05-11 21:26   ` Peter Zijlstra
2023-05-11 22:52     ` Tejun Heo
2023-05-12 19:42   ` [PATCH v2 " Tejun Heo
2023-07-11 13:55     ` Consider switching to WQ_UNBOUND messages (was: Re: [PATCH v2 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism) Geert Uytterhoeven
2023-07-11 14:06       ` Geert Uytterhoeven
2023-07-11 21:39         ` Tejun Heo
2023-07-12  0:30           ` Tejun Heo
2023-07-12  9:57             ` Geert Uytterhoeven
2023-07-17 23:03               ` Tejun Heo
2023-07-18  9:54                 ` Geert Uytterhoeven
2023-07-18 22:01                   ` Tejun Heo
2023-07-25 14:46                     ` Geert Uytterhoeven
2023-07-25 21:52                       ` [PATCH wq/for-6.5-fixes] workqueue: Drop the special locking rule for worker->flags and worker_pool->flags Tejun Heo
2023-07-12  8:05           ` Consider switching to WQ_UNBOUND messages (was: Re: [PATCH v2 6/7] workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism) Peter Zijlstra
2023-07-12  9:04             ` Geert Uytterhoeven
2023-07-12 12:27               ` Peter Zijlstra
2023-07-13 18:53                 ` Tejun Heo
2023-05-11 18:19 ` [PATCH 7/7] workqueue: Track and monitor per-workqueue CPU time usage Tejun Heo
2023-05-11 21:11   ` Peter Zijlstra
2023-05-11 23:03     ` Tejun Heo
  -- strict thread matches above, loose matches on Subject: below --
2023-05-18  3:00 [PATCHSET v4 wq/for-6.5] workqueue: Implement automatic CPU intensive detection and add monitoring Tejun Heo
2023-05-18  3:00 ` [PATCH 3/7] workqueue: Move worker_set/clr_flags() upwards Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox