All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stefan Metzmacher <metze@samba.org>,
	io-uring <io-uring@vger.kernel.org>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	Al Viro <viro@zeniv.linux.org.uk>
Subject: Re: [PATCH 09/18] io-wq: fork worker threads from original task
Date: Thu, 4 Mar 2021 13:23:32 -0700	[thread overview]
Message-ID: <d0eb6092-cbc9-dd37-e49f-6ff8c6b8b136@kernel.dk> (raw)
In-Reply-To: <7dc54165-ac8a-ab3b-c03d-9e696b8a577e@kernel.dk>

[-- Attachment #1: Type: text/plain, Size: 2037 bytes --]

On 3/4/21 1:00 PM, Jens Axboe wrote:
> On 3/4/21 12:54 PM, Jens Axboe wrote:
>> On 3/4/21 12:46 PM, Linus Torvalds wrote:
>>> On Thu, Mar 4, 2021 at 11:19 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> Took a quick look at this, and I agree that's _much_ better. In fact, it
>>>> boils down to just calling copy_process() and then having the caller do
>>>> wake_up_new_task(). So not sure if it's worth adding an
>>>> create_io_thread() helper, or just make copy_process() available
>>>> instead. This is ignoring the trace point for now...
>>>
>>> I really don't want to expose copy_process() outside of kernel/fork.c.
>>>
>>> The whole three-phase "copy - setup - activate" model is a really
>>> really good thing, and it's how we've done things internally almost
>>> forever, but I really don't want to expose those middle stages to any
>>> outsiders.
>>>
>>> So I'd really prefer a simple new "create_io_worker()", even if it's
>>> literally just some four-line function that does
>>>
>>>    p = copy_process(..);
>>>    if (!IS_ERR(p)) {
>>>       block all signals in p
>>>       set PF_IO_WORKER flag
>>>       wake_up_new_task(p);
>>>    }
>>>    return p;
>>>
>>> I very much want that to be inside kernel/fork.c and have all these
>>> rules about creating new threads localized there.
>>
>> I agree, here are the two current patches. Just need to add the signal
>> blocking, which I'd love to do in create_io_thread(), but seems to
>> require either an allocation or provide a helper to do it in the thread
>> itself (with an on-stack mask).
> 
> Nevermind, it's actually copied, so we can do it in create_io_thread().
> I know you'd prefer not to expose the 'task created but not active' state,
> but:
> 
> 1) That allows us to do further setup in the creator and hence eliminate
>    wait+complete for that
> 
> 2) It's not exported, so not available to drivers etc.
> 

Here's a version that includes the signal blocking too, inside the
create_io_thread() helper. I'll run this through the usual testing.

-- 
Jens Axboe


[-- Attachment #2: 0001-kernel-provide-create_io_thread-helper.patch --]
[-- Type: text/x-patch, Size: 2906 bytes --]

From 81910fbd73e7eecea2827c407dbcaab49085c5e3 Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe@kernel.dk>
Date: Thu, 4 Mar 2021 12:21:05 -0700
Subject: [PATCH 1/2] kernel: provide create_io_thread() helper

Provide a generic helper for setting up an io_uring worker. Returns a
task_struct so that the caller can do whatever setup is needed, then call
wake_up_new_task() to kick it into gear.

Add a kernel_clone_args member, io_thread, which tells copy_process() to
mark the task with PF_IO_WORKER.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/sched/task.h |  2 ++
 kernel/fork.c              | 28 ++++++++++++++++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index c0f71f2e7160..ef02be869cf2 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -31,6 +31,7 @@ struct kernel_clone_args {
 	/* Number of elements in *set_tid */
 	size_t set_tid_size;
 	int cgroup;
+	int io_thread;
 	struct cgroup *cgrp;
 	struct css_set *cset;
 };
@@ -82,6 +83,7 @@ extern void exit_files(struct task_struct *);
 extern void exit_itimers(struct signal_struct *);
 
 extern pid_t kernel_clone(struct kernel_clone_args *kargs);
+struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
 struct task_struct *fork_idle(int);
 struct mm_struct *copy_init_mm(void);
 extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
diff --git a/kernel/fork.c b/kernel/fork.c
index d66cd1014211..08708865c58f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1940,6 +1940,8 @@ static __latent_entropy struct task_struct *copy_process(
 	p = dup_task_struct(current, node);
 	if (!p)
 		goto fork_out;
+	if (args->io_thread)
+		p->flags |= PF_IO_WORKER;
 
 	/*
 	 * This _must_ happen before we call free_task(), i.e. before we jump
@@ -2410,6 +2412,32 @@ struct mm_struct *copy_init_mm(void)
 	return dup_mm(NULL, &init_mm);
 }
 
+/*
+ * This is like kernel_clone(), but shaved down and tailored to just
+ * creating io_uring workers. It returns a created task, or an error pointer.
+ * The returned task is inactive, and the caller must fire it up through
+ * wake_up_new_task(p). All signals are blocked in the created task.
+ */
+struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
+{
+	unsigned long flags = CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
+				CLONE_IO|SIGCHLD;
+	struct kernel_clone_args args = {
+		.flags		= ((lower_32_bits(flags) | CLONE_VM |
+				    CLONE_UNTRACED) & ~CSIGNAL),
+		.exit_signal	= (lower_32_bits(flags) & CSIGNAL),
+		.stack		= (unsigned long)fn,
+		.stack_size	= (unsigned long)arg,
+		.io_thread	= 1,
+	};
+	struct task_struct *tsk;
+
+	tsk = copy_process(NULL, 0, node, &args);
+	if (!IS_ERR(tsk))
+		sigfillset(&tsk->blocked);
+	return tsk;
+}
+
 /*
  *  Ok, this is the main fork-routine.
  *
-- 
2.30.1


[-- Attachment #3: 0002-io_uring-move-to-using-create_io_thread.patch --]
[-- Type: text/x-patch, Size: 8493 bytes --]

From f11913b472cdc46082466dbb6cd56f105e5dcdd7 Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe@kernel.dk>
Date: Thu, 4 Mar 2021 12:39:36 -0700
Subject: [PATCH 2/2] io_uring: move to using create_io_thread()

This allows us to do task creation and setup without needing to use
completions to try and synchronize with the starting thread. Get rid of
the old io_wq_fork_thread() wrapper, and the 'wq' and 'worker' startup
completion events - we can now do setup before the task is running.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io-wq.c    | 69 ++++++++++++++-------------------------------------
 fs/io-wq.h    |  2 --
 fs/io_uring.c | 36 +++++++++++++--------------
 3 files changed, 35 insertions(+), 72 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 19f18389ead2..cee41b81747c 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -54,7 +54,6 @@ struct io_worker {
 	spinlock_t lock;
 
 	struct completion ref_done;
-	struct completion started;
 
 	struct rcu_head rcu;
 };
@@ -116,7 +115,6 @@ struct io_wq {
 	struct io_wq_hash *hash;
 
 	refcount_t refs;
-	struct completion started;
 	struct completion exited;
 
 	atomic_t worker_refs;
@@ -273,14 +271,6 @@ static void io_wqe_dec_running(struct io_worker *worker)
 		io_wqe_wake_worker(wqe, acct);
 }
 
-static void io_worker_start(struct io_worker *worker)
-{
-	current->flags |= PF_NOFREEZE;
-	worker->flags |= (IO_WORKER_F_UP | IO_WORKER_F_RUNNING);
-	io_wqe_inc_running(worker);
-	complete(&worker->started);
-}
-
 /*
  * Worker will start processing some work. Move it to the busy list, if
  * it's currently on the freelist
@@ -490,8 +480,6 @@ static int io_wqe_worker(void *data)
 	struct io_wqe *wqe = worker->wqe;
 	struct io_wq *wq = wqe->wq;
 
-	io_worker_start(worker);
-
 	while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
 		set_current_state(TASK_INTERRUPTIBLE);
 loop:
@@ -576,12 +564,6 @@ static int task_thread(void *data, int index)
 	sprintf(buf, "iou-wrk-%d", wq->task_pid);
 	set_task_comm(current, buf);
 
-	current->pf_io_worker = worker;
-	worker->task = current;
-
-	set_cpus_allowed_ptr(current, cpumask_of_node(wqe->node));
-	current->flags |= PF_NO_SETAFFINITY;
-
 	raw_spin_lock_irq(&wqe->lock);
 	hlist_nulls_add_head_rcu(&worker->nulls_node, &wqe->free_list);
 	list_add_tail_rcu(&worker->all_list, &wqe->all_list);
@@ -607,25 +589,10 @@ static int task_thread_unbound(void *data)
 	return task_thread(data, IO_WQ_ACCT_UNBOUND);
 }
 
-pid_t io_wq_fork_thread(int (*fn)(void *), void *arg)
-{
-	unsigned long flags = CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
-				CLONE_IO|SIGCHLD;
-	struct kernel_clone_args args = {
-		.flags		= ((lower_32_bits(flags) | CLONE_VM |
-				    CLONE_UNTRACED) & ~CSIGNAL),
-		.exit_signal	= (lower_32_bits(flags) & CSIGNAL),
-		.stack		= (unsigned long)fn,
-		.stack_size	= (unsigned long)arg,
-	};
-
-	return kernel_clone(&args);
-}
-
 static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
 {
 	struct io_worker *worker;
-	pid_t pid;
+	struct task_struct *tsk;
 
 	__set_current_state(TASK_RUNNING);
 
@@ -638,21 +605,26 @@ static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
 	worker->wqe = wqe;
 	spin_lock_init(&worker->lock);
 	init_completion(&worker->ref_done);
-	init_completion(&worker->started);
 
 	atomic_inc(&wq->worker_refs);
 
 	if (index == IO_WQ_ACCT_BOUND)
-		pid = io_wq_fork_thread(task_thread_bound, worker);
+		tsk = create_io_thread(task_thread_bound, worker, wqe->node);
 	else
-		pid = io_wq_fork_thread(task_thread_unbound, worker);
-	if (pid < 0) {
+		tsk = create_io_thread(task_thread_unbound, worker, wqe->node);
+	if (IS_ERR(tsk)) {
 		if (atomic_dec_and_test(&wq->worker_refs))
 			complete(&wq->worker_done);
 		kfree(worker);
 		return false;
 	}
-	wait_for_completion(&worker->started);
+	worker->flags |= (IO_WORKER_F_UP | IO_WORKER_F_RUNNING);
+	io_wqe_inc_running(worker);
+	tsk->pf_io_worker = worker;
+	worker->task = tsk;
+	set_cpus_allowed_ptr(tsk, cpumask_of_node(wqe->node));
+	tsk->flags |= PF_NOFREEZE | PF_NO_SETAFFINITY;
+	wake_up_new_task(tsk);
 	return true;
 }
 
@@ -696,6 +668,7 @@ static bool io_wq_for_each_worker(struct io_wqe *wqe,
 
 static bool io_wq_worker_wake(struct io_worker *worker, void *data)
 {
+	set_notify_signal(worker->task);
 	wake_up_process(worker->task);
 	return false;
 }
@@ -752,10 +725,6 @@ static int io_wq_manager(void *data)
 
 	sprintf(buf, "iou-mgr-%d", wq->task_pid);
 	set_task_comm(current, buf);
-	current->flags |= PF_IO_WORKER;
-	wq->manager = get_task_struct(current);
-
-	complete(&wq->started);
 
 	do {
 		set_current_state(TASK_INTERRUPTIBLE);
@@ -815,21 +784,20 @@ static void io_wqe_insert_work(struct io_wqe *wqe, struct io_wq_work *work)
 
 static int io_wq_fork_manager(struct io_wq *wq)
 {
-	int ret;
+	struct task_struct *tsk;
 
 	if (wq->manager)
 		return 0;
 
 	reinit_completion(&wq->worker_done);
-	current->flags |= PF_IO_WORKER;
-	ret = io_wq_fork_thread(io_wq_manager, wq);
-	current->flags &= ~PF_IO_WORKER;
-	if (ret >= 0) {
-		wait_for_completion(&wq->started);
+	tsk = create_io_thread(io_wq_manager, wq, NUMA_NO_NODE);
+	if (!IS_ERR(tsk)) {
+		wq->manager = get_task_struct(tsk);
+		wake_up_new_task(tsk);
 		return 0;
 	}
 
-	return ret;
+	return PTR_ERR(tsk);
 }
 
 static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
@@ -1062,7 +1030,6 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
 	}
 
 	wq->task_pid = current->pid;
-	init_completion(&wq->started);
 	init_completion(&wq->exited);
 	refcount_set(&wq->refs, 1);
 
diff --git a/fs/io-wq.h b/fs/io-wq.h
index 42f0be64a84d..5fbf7997149e 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -119,8 +119,6 @@ void io_wq_put_and_exit(struct io_wq *wq);
 void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work);
 void io_wq_hash_work(struct io_wq_work *work, void *val);
 
-pid_t io_wq_fork_thread(int (*fn)(void *), void *arg);
-
 static inline bool io_wq_is_hashed(struct io_wq_work *work)
 {
 	return work->flags & IO_WQ_WORK_HASHED;
diff --git a/fs/io_uring.c b/fs/io_uring.c
index e55369555e5c..d885fbd53bbc 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -6668,7 +6668,6 @@ static int io_sq_thread(void *data)
 
 	sprintf(buf, "iou-sqp-%d", sqd->task_pid);
 	set_task_comm(current, buf);
-	sqd->thread = current;
 	current->pf_io_worker = NULL;
 
 	if (sqd->sq_cpu != -1)
@@ -6677,8 +6676,6 @@ static int io_sq_thread(void *data)
 		set_cpus_allowed_ptr(current, cpu_online_mask);
 	current->flags |= PF_NO_SETAFFINITY;
 
-	complete(&sqd->completion);
-
 	wait_for_completion(&sqd->startup);
 
 	while (!io_sq_thread_should_stop(sqd)) {
@@ -7818,21 +7815,22 @@ void __io_uring_free(struct task_struct *tsk)
 
 static int io_sq_thread_fork(struct io_sq_data *sqd, struct io_ring_ctx *ctx)
 {
+	struct task_struct *tsk;
 	int ret;
 
 	clear_bit(IO_SQ_THREAD_SHOULD_STOP, &sqd->state);
 	reinit_completion(&sqd->completion);
 	ctx->sqo_exec = 0;
 	sqd->task_pid = current->pid;
-	current->flags |= PF_IO_WORKER;
-	ret = io_wq_fork_thread(io_sq_thread, sqd);
-	current->flags &= ~PF_IO_WORKER;
-	if (ret < 0) {
-		sqd->thread = NULL;
+	tsk = create_io_thread(io_sq_thread, sqd, NUMA_NO_NODE);
+	if (IS_ERR(tsk))
 		return ret;
-	}
-	wait_for_completion(&sqd->completion);
-	return io_uring_alloc_task_context(sqd->thread, ctx);
+	ret = io_uring_alloc_task_context(tsk, ctx);
+	if (ret)
+		set_bit(IO_SQ_THREAD_SHOULD_STOP, &sqd->state);
+	sqd->thread = tsk;
+	wake_up_new_task(tsk);
+	return ret;
 }
 
 static int io_sq_offload_create(struct io_ring_ctx *ctx,
@@ -7855,6 +7853,7 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 		fdput(f);
 	}
 	if (ctx->flags & IORING_SETUP_SQPOLL) {
+		struct task_struct *tsk;
 		struct io_sq_data *sqd;
 
 		ret = -EPERM;
@@ -7896,15 +7895,14 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 		}
 
 		sqd->task_pid = current->pid;
-		current->flags |= PF_IO_WORKER;
-		ret = io_wq_fork_thread(io_sq_thread, sqd);
-		current->flags &= ~PF_IO_WORKER;
-		if (ret < 0) {
-			sqd->thread = NULL;
+		tsk = create_io_thread(io_sq_thread, sqd, NUMA_NO_NODE);
+		if (IS_ERR(tsk))
 			goto err;
-		}
-		wait_for_completion(&sqd->completion);
-		ret = io_uring_alloc_task_context(sqd->thread, ctx);
+		ret = io_uring_alloc_task_context(tsk, ctx);
+		if (ret)
+			set_bit(IO_SQ_THREAD_SHOULD_STOP, &sqd->state);
+		sqd->thread = tsk;
+		wake_up_new_task(tsk);
 		if (ret)
 			goto err;
 	} else if (p->flags & IORING_SETUP_SQ_AFF) {
-- 
2.30.1


  reply	other threads:[~2021-03-04 20:24 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-19 17:09 [PATCHSET RFC 0/18] Remove kthread usage from io_uring Jens Axboe
2021-02-19 17:09 ` [PATCH 01/18] io_uring: remove the need for relying on an io-wq fallback worker Jens Axboe
2021-02-19 20:25   ` Eric W. Biederman
2021-02-19 20:37     ` Jens Axboe
2021-02-22 13:46   ` Pavel Begunkov
2021-02-19 17:09 ` [PATCH 02/18] io-wq: don't create any IO workers upfront Jens Axboe
2021-02-19 17:09 ` [PATCH 03/18] io_uring: disable io-wq attaching Jens Axboe
2021-02-19 17:09 ` [PATCH 04/18] io-wq: get rid of wq->use_refs Jens Axboe
2021-02-19 17:09 ` [PATCH 05/18] io_uring: tie async worker side to the task context Jens Axboe
2021-02-20  8:11   ` Hao Xu
2021-02-20 14:38     ` Jens Axboe
2021-02-21  9:16       ` Hao Xu
2021-02-19 17:09 ` [PATCH 06/18] io-wq: don't pass 'wqe' needlessly around Jens Axboe
2021-02-19 17:09 ` [PATCH 07/18] arch: setup PF_IO_WORKER threads like PF_KTHREAD Jens Axboe
2021-02-19 22:21   ` Eric W. Biederman
2021-02-19 23:26     ` Jens Axboe
2021-02-19 17:10 ` [PATCH 08/18] kernel: treat PF_IO_WORKER like PF_KTHREAD for ptrace/signals Jens Axboe
2021-02-19 17:10 ` [PATCH 09/18] io-wq: fork worker threads from original task Jens Axboe
2021-03-04 12:23   ` Stefan Metzmacher
2021-03-04 13:05     ` Jens Axboe
2021-03-04 13:19       ` Stefan Metzmacher
2021-03-04 16:13         ` Stefan Metzmacher
2021-03-04 16:42           ` Jens Axboe
2021-03-04 17:09             ` Stefan Metzmacher
2021-03-04 17:32               ` Jens Axboe
2021-03-04 18:19                 ` Jens Axboe
2021-03-04 18:56                   ` Linus Torvalds
2021-03-04 19:19                     ` Jens Axboe
2021-03-04 19:46                       ` Linus Torvalds
2021-03-04 19:54                         ` Jens Axboe
2021-03-04 20:00                           ` Jens Axboe
2021-03-04 20:23                             ` Jens Axboe [this message]
2021-03-04 20:50                           ` Linus Torvalds
2021-03-04 20:54                             ` Jens Axboe
2021-03-05 19:16           ` Eric W. Biederman
2021-03-05 19:00       ` Eric W. Biederman
2021-02-19 17:10 ` [PATCH 10/18] io-wq: worker idling always returns false Jens Axboe
2021-02-19 17:10 ` [PATCH 11/18] io_uring: remove any grabbing of context Jens Axboe
2021-02-19 17:10 ` [PATCH 12/18] io_uring: remove io_identity Jens Axboe
2021-02-19 17:10 ` [PATCH 13/18] io-wq: only remove worker from free_list, if it was there Jens Axboe
2021-02-19 17:10 ` [PATCH 14/18] io-wq: make io_wq_fork_thread() available to other users Jens Axboe
2021-02-19 17:10 ` [PATCH 15/18] io_uring: move SQPOLL thread io-wq forked worker Jens Axboe
2021-02-19 17:10 ` [PATCH 16/18] Revert "proc: don't allow async path resolution of /proc/thread-self components" Jens Axboe
2021-02-19 17:10 ` [PATCH 17/18] Revert "proc: don't allow async path resolution of /proc/self components" Jens Axboe
2021-02-19 17:10 ` [PATCH 18/18] net: remove cmsg restriction from io_uring based send/recvmsg calls Jens Axboe
2021-02-19 23:44 ` [PATCHSET RFC 0/18] Remove kthread usage from io_uring Stefan Metzmacher
2021-02-19 23:51   ` Jens Axboe
2021-02-21  5:04 ` Linus Torvalds
2021-02-21 21:22   ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d0eb6092-cbc9-dd37-e49f-6ff8c6b8b136@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=ebiederm@xmission.com \
    --cc=io-uring@vger.kernel.org \
    --cc=metze@samba.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.