From: Petr Mladek <pmladek@suse.com>
To: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>,
Michal Koutny <mkoutny@suse.com>,
linux-kernel@vger.kernel.org, Petr Mladek <pmladek@suse.com>
Subject: [PATCH v2 2/5] workqueue: Warn when a new worker could not be created
Date: Tue, 7 Mar 2023 13:53:32 +0100 [thread overview]
Message-ID: <20230307125335.28805-3-pmladek@suse.com> (raw)
In-Reply-To: <20230307125335.28805-1-pmladek@suse.com>
The workqueue watchdog reports a lockup when there was not any progress
in the worker pool for a long time. The progress means that a pending
work item starts being proceed.
The progress is guaranteed by using idle workers or creating new workers
for pending work items.
There are several reasons why a new worker could not be created:
+ there is not enough memory
+ there is no free pool ID (IDR API)
+ the system reached PID limit
+ the process creating the new worker was interrupted
+ the last idle worker (manager) has not been scheduled for a long
time. It was not able to even start creating the kthread.
None of these failures is reported at the moment. The only clue is that
show_one_worker_pool() prints that there is a manager. It is the last
idle worker that is responsible for creating a new one. But it is not
clear if create_worker() is failing and why.
Make the debugging easier by printing errors in create_worker().
The error code is important, especially from kthread_create_on_node().
It helps to distinguish the various reasons. For example, reaching
memory limit (-ENOMEM), other system limits (-EAGAIN), or process
interrupted (-EINTR).
Use pr_once() to avoid repeating the same error every CREATE_COOLDOWN
for each stuck worker pool.
Ratelimited printk() might be better. It would help to know if the problem
remains. It would be more clear if the create_worker() errors and workqueue
stalls are related. Also old messages might get lost when the internal log
buffer is full. The problem is that printk() might touch the watchdog.
For example, see touch_nmi_watchdog() in serial8250_console_write().
It would require synchronization of the begin and length of the ratelimit
interval with the workqueue watchdog. Otherwise, the error messages
might break the watchdog. This does not look worth the complexity.
Signed-off-by: Petr Mladek <pmladek@suse.com>
---
kernel/workqueue.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2be9b0ecf22c..36ad9a4d65e4 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1938,12 +1938,17 @@ static struct worker *create_worker(struct worker_pool *pool)
/* ID is needed to determine kthread name */
id = ida_alloc(&pool->worker_ida, GFP_KERNEL);
- if (id < 0)
+ if (id < 0) {
+ pr_err_once("workqueue: Failed to allocate a worker ID: %pe\n",
+ ERR_PTR(id));
return NULL;
+ }
worker = alloc_worker(pool->node);
- if (!worker)
+ if (!worker) {
+ pr_err_once("workqueue: Failed to allocate a worker\n");
goto fail;
+ }
worker->id = id;
@@ -1955,8 +1960,11 @@ static struct worker *create_worker(struct worker_pool *pool)
worker->task = kthread_create_on_node(worker_thread, worker, pool->node,
"kworker/%s", id_buf);
- if (IS_ERR(worker->task))
+ if (IS_ERR(worker->task)) {
+ pr_err_once("workqueue: Failed to create a worker thread: %pe",
+ worker->task);
goto fail;
+ }
set_user_nice(worker->task, pool->attrs->nice);
kthread_bind_mask(worker->task, pool->attrs->cpumask);
--
2.35.3
next prev parent reply other threads:[~2023-03-07 12:54 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-07 12:53 [PATCH v2 0/5] workqueue: Debugging improvements Petr Mladek
2023-03-07 12:53 ` [PATCH v2 1/5] workqueue: Fix hung time report of worker pools Petr Mladek
2023-03-07 12:53 ` Petr Mladek [this message]
2023-03-07 12:53 ` [PATCH v2 3/5] workqueue: Interrupted create_worker() is not a repeated event Petr Mladek
2023-03-07 12:53 ` [PATCH v2 4/5] workqueue: Warn when a rescuer could not be created Petr Mladek
2023-03-07 12:53 ` [PATCH v2 5/5] workqueue: Print backtraces from CPUs with hung CPU bound workqueues Petr Mladek
2023-03-17 22:04 ` [PATCH v2 0/5] workqueue: Debugging improvements Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230307125335.28805-3-pmladek@suse.com \
--to=pmladek@suse.com \
--cc=jiangshanlai@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mkoutny@suse.com \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox