* [PATCH 0/3] vhost: Fix freezer/ps regressions
@ 2023-05-22 2:51 Mike Christie
2023-05-22 2:51 ` [PATCH 1/3] signal: Don't always put SIGKILL in shared_pending Mike Christie
` (2 more replies)
0 siblings, 3 replies; 51+ messages in thread
From: Mike Christie @ 2023-05-22 2:51 UTC (permalink / raw)
To: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
brauner
The following patches made over Linus's tree fix the 2 bugs:
1. vhost worker task shows up as a process forked from the parent
that did VHOST_SET_OWNER ioctl instead of a process under root/kthreadd.
This was causing breaking scripts.
2. vhost_tasks didn't disable or add support for freeze requests.
The following patches fix these issues by making the vhost_task task
a thread under the process that did the VHOST_SET_OWNER and uses
get_signal() to handle freeze and SIGSTOP/KILL signals which is required
when using CLONE_THREAD (really CLONE_THREAD requires CLONE_SIGHAND
which requires SIGKILL/STOP to be supported).
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 51+ messages in thread* [PATCH 1/3] signal: Don't always put SIGKILL in shared_pending 2023-05-22 2:51 [PATCH 0/3] vhost: Fix freezer/ps regressions Mike Christie @ 2023-05-22 2:51 ` Mike Christie 2023-05-23 15:30 ` Eric W. Biederman 2023-05-22 2:51 ` [PATCH 2/3] signal: Don't exit for PF_USER_WORKER tasks Mike Christie 2023-05-22 2:51 ` [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression Mike Christie 2 siblings, 1 reply; 51+ messages in thread From: Mike Christie @ 2023-05-22 2:51 UTC (permalink / raw) To: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds, linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha, brauner When get_pending detects the task has been marked to be killed we try to clean up the SIGKLL by doing a sigdelset and recalc_sigpending, but we still leave it in shared_pending. If the signal is being short circuit delivered there is no need to put in shared_pending so this adds a check in complete_signal. This patch was modified from Eric Biederman <ebiederm@xmission.com> original patch. Signed-off-by: Mike Christie <michael.christie@oracle.com> --- kernel/signal.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/kernel/signal.c b/kernel/signal.c index 8f6330f0e9ca..3dc99b9aec7f 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1052,6 +1052,14 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type) signal->flags = SIGNAL_GROUP_EXIT; signal->group_exit_code = sig; signal->group_stop_count = 0; + + /* + * The signal is being short circuit delivered so + * don't set pending. + */ + if (type != PIDTYPE_PID) + sigdelset(&signal->shared_pending.signal, sig); + t = p; do { task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); -- 2.25.1 _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH 1/3] signal: Don't always put SIGKILL in shared_pending 2023-05-22 2:51 ` [PATCH 1/3] signal: Don't always put SIGKILL in shared_pending Mike Christie @ 2023-05-23 15:30 ` Eric W. Biederman 0 siblings, 0 replies; 51+ messages in thread From: Eric W. Biederman @ 2023-05-23 15:30 UTC (permalink / raw) To: Mike Christie Cc: axboe, brauner, mst, linux-kernel, oleg, stefanha, linux, nicolas.dichtel, virtualization, torvalds Mike Christie <michael.christie@oracle.com> writes: > When get_pending detects the task has been marked to be killed we try to ^^^^^^^^^^^ get_signal > clean up the SIGKLL by doing a sigdelset and recalc_sigpending, but we > still leave it in shared_pending. If the signal is being short circuit > delivered there is no need to put in shared_pending so this adds a check > in complete_signal. > > This patch was modified from Eric Biederman <ebiederm@xmission.com> > original patch. > > Signed-off-by: Mike Christie <michael.christie@oracle.com> > --- > kernel/signal.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/kernel/signal.c b/kernel/signal.c > index 8f6330f0e9ca..3dc99b9aec7f 100644 > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -1052,6 +1052,14 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type) > signal->flags = SIGNAL_GROUP_EXIT; > signal->group_exit_code = sig; > signal->group_stop_count = 0; > + > + /* > + * The signal is being short circuit delivered so > + * don't set pending. > + */ > + if (type != PIDTYPE_PID) > + sigdelset(&signal->shared_pending.signal, sig); > + > t = p; > do { > task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); Oleg Nesterov <oleg@redhat.com> writes: > > Eric, sorry. I fail to understand this patch. > > How can it help? And whom? You were looking at why recalc_sigpending was resulting in TIF_SIGPENDING set. The big bug was that get_signal was getting called by the thread after the thread had realized it was part of a group exit. The minor bug is that SIGKILL was stuck in shared_pending and causing recalc_sigpending to set TIF_SIGPENDING after get_signal removed the per thread flag that asks the thread to exit. The fact is that fatal signals (that pass all of the checks) are delivered right there in complete_signal so it does not make sense from a data structure consistency standpoint to leave the fatal signal (like SIGKILL) in shared_pending. Outside of this case it will only affect coredumps and other analyzers that run at process exit. One thing I am looking at is that the vhost code shares a common problem with the coredump code to pipes. There is code that tests signal_pending() and does something with it after signal processing has completed. Fixing the data structure to be consistent seems like one way to handle that situation. Eric _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH 2/3] signal: Don't exit for PF_USER_WORKER tasks 2023-05-22 2:51 [PATCH 0/3] vhost: Fix freezer/ps regressions Mike Christie 2023-05-22 2:51 ` [PATCH 1/3] signal: Don't always put SIGKILL in shared_pending Mike Christie @ 2023-05-22 2:51 ` Mike Christie 2023-05-22 2:51 ` [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression Mike Christie 2 siblings, 0 replies; 51+ messages in thread From: Mike Christie @ 2023-05-22 2:51 UTC (permalink / raw) To: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds, linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha, brauner vhost_tasks also need to perform cleanup before exiting so this changes the check in get_signal to be more generic so both io thread and vhost tasks can do their cleanup. Signed-off-by: Mike Christie <michael.christie@oracle.com> --- kernel/signal.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/signal.c b/kernel/signal.c index 3dc99b9aec7f..8050fe23c732 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2869,11 +2869,11 @@ bool get_signal(struct ksignal *ksig) } /* - * PF_IO_WORKER threads will catch and exit on fatal signals + * PF_USER_WORKER threads will catch and exit on fatal signals * themselves. They have cleanup that must be performed, so * we cannot call do_exit() on their behalf. */ - if (current->flags & PF_IO_WORKER) + if (current->flags & PF_USER_WORKER) goto out; /* -- 2.25.1 _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-22 2:51 [PATCH 0/3] vhost: Fix freezer/ps regressions Mike Christie 2023-05-22 2:51 ` [PATCH 1/3] signal: Don't always put SIGKILL in shared_pending Mike Christie 2023-05-22 2:51 ` [PATCH 2/3] signal: Don't exit for PF_USER_WORKER tasks Mike Christie @ 2023-05-22 2:51 ` Mike Christie 2023-05-22 12:30 ` Oleg Nesterov 2023-05-22 19:40 ` Michael S. Tsirkin 2 siblings, 2 replies; 51+ messages in thread From: Mike Christie @ 2023-05-22 2:51 UTC (permalink / raw) To: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds, linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha, brauner When switching from kthreads to vhost_tasks two bugs were added: 1. The vhost worker tasks's now show up as processes so scripts doing ps or ps a would not incorrectly detect the vhost task as another process. 2. kthreads disabled freeze by setting PF_NOFREEZE, but vhost tasks's didn't disable or add support for them. To fix both bugs, this switches the vhost task to be thread in the process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that SIGKILL/STOP support is required because CLONE_THREAD requires CLONE_SIGHAND which requires those 2 signals to be suppported. This a modified version of patch originally written by Linus which handles his review comment to himself to rename ignore_signals to block_signals to better represent what it now does. And it includes a change to vhost_worker() to support SIGSTOP/KILL and freeze, and it drops the wait use per Oleg's review comment that it's no longer needed when using CLONE_THREAD. Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads") Signed-off-by: Mike Christie <michael.christie@oracle.com> --- drivers/vhost/vhost.c | 17 ++++++++++++++++- include/linux/sched/task.h | 2 +- kernel/fork.c | 12 +++--------- kernel/signal.c | 1 + kernel/vhost_task.c | 16 ++++------------ 5 files changed, 25 insertions(+), 23 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index a92af08e7864..bf83e9340e40 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -338,6 +338,7 @@ static int vhost_worker(void *data) struct vhost_worker *worker = data; struct vhost_work *work, *work_next; struct llist_node *node; + bool dead = false; for (;;) { /* mb paired w/ kthread_stop */ @@ -349,8 +350,22 @@ static int vhost_worker(void *data) } node = llist_del_all(&worker->work_list); - if (!node) + if (!node) { schedule(); + /* + * When we get a SIGKILL our release function will + * be called. That will stop new IOs from being queued + * and check for outstanding cmd responses. It will then + * call vhost_task_stop to tell us to return and exit. + */ + if (!dead && signal_pending(current)) { + struct ksignal ksig; + + dead = get_signal(&ksig); + if (dead) + clear_thread_flag(TIF_SIGPENDING); + } + } node = llist_reverse_order(node); /* make sure flag is seen after deletion */ diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 537cbf9a2ade..249a5ece9def 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -29,7 +29,7 @@ struct kernel_clone_args { u32 io_thread:1; u32 user_worker:1; u32 no_files:1; - u32 ignore_signals:1; + u32 block_signals:1; unsigned long stack; unsigned long stack_size; unsigned long tls; diff --git a/kernel/fork.c b/kernel/fork.c index ed4e01daccaa..9e04ab5c3946 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2338,14 +2338,10 @@ __latent_entropy struct task_struct *copy_process( p->flags |= PF_KTHREAD; if (args->user_worker) p->flags |= PF_USER_WORKER; - if (args->io_thread) { - /* - * Mark us an IO worker, and block any signal that isn't - * fatal or STOP - */ + if (args->io_thread) p->flags |= PF_IO_WORKER; + if (args->block_signals) siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP)); - } if (args->name) strscpy_pad(p->comm, args->name, sizeof(p->comm)); @@ -2517,9 +2513,6 @@ __latent_entropy struct task_struct *copy_process( if (retval) goto bad_fork_cleanup_io; - if (args->ignore_signals) - ignore_signals(p); - stackleak_task_init(p); if (pid != &init_struct_pid) { @@ -2861,6 +2854,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node) .fn_arg = arg, .io_thread = 1, .user_worker = 1, + .block_signals = 1, }; return copy_process(NULL, 0, node, &args); diff --git a/kernel/signal.c b/kernel/signal.c index 8050fe23c732..a0f00a078cbb 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2891,6 +2891,7 @@ bool get_signal(struct ksignal *ksig) return ksig->sig > 0; } +EXPORT_SYMBOL_GPL(get_signal); /** * signal_delivered - called after signal delivery to update blocked signals diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c index b7cbd66f889e..7a2d7d9fe772 100644 --- a/kernel/vhost_task.c +++ b/kernel/vhost_task.c @@ -31,22 +31,13 @@ static int vhost_task_fn(void *data) */ void vhost_task_stop(struct vhost_task *vtsk) { - pid_t pid = vtsk->task->pid; - set_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags); wake_up_process(vtsk->task); /* * Make sure vhost_task_fn is no longer accessing the vhost_task before - * freeing it below. If userspace crashed or exited without closing, - * then the vhost_task->task could already be marked dead so - * kernel_wait will return early. + * freeing it below. */ wait_for_completion(&vtsk->exited); - /* - * If we are just closing/removing a device and the parent process is - * not exiting then reap the task. - */ - kernel_wait4(pid, NULL, __WCLONE, NULL); kfree(vtsk); } EXPORT_SYMBOL_GPL(vhost_task_stop); @@ -75,13 +66,14 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg, const char *name) { struct kernel_clone_args args = { - .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM, + .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM | + CLONE_THREAD | CLONE_SIGHAND, .exit_signal = 0, .fn = vhost_task_fn, .name = name, .user_worker = 1, .no_files = 1, - .ignore_signals = 1, + .block_signals = 1, }; struct vhost_task *vtsk; struct task_struct *tsk; -- 2.25.1 _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-22 2:51 ` [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression Mike Christie @ 2023-05-22 12:30 ` Oleg Nesterov 2023-05-22 17:00 ` Mike Christie 2023-05-22 19:40 ` Michael S. Tsirkin 1 sibling, 1 reply; 51+ messages in thread From: Oleg Nesterov @ 2023-05-22 12:30 UTC (permalink / raw) To: Mike Christie Cc: axboe, brauner, mst, linux-kernel, linux, ebiederm, stefanha, nicolas.dichtel, virtualization, torvalds Confused, please help... On 05/21, Mike Christie wrote: > > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -338,6 +338,7 @@ static int vhost_worker(void *data) > struct vhost_worker *worker = data; > struct vhost_work *work, *work_next; > struct llist_node *node; > + bool dead = false; > > for (;;) { > /* mb paired w/ kthread_stop */ > @@ -349,8 +350,22 @@ static int vhost_worker(void *data) > } > > node = llist_del_all(&worker->work_list); > - if (!node) > + if (!node) { > schedule(); > + /* > + * When we get a SIGKILL our release function will > + * be called. That will stop new IOs from being queued > + * and check for outstanding cmd responses. It will then > + * call vhost_task_stop to tell us to return and exit. > + */ But who will call the release function / vhost_task_stop() and when this will happen after this thread gets SIGKILL ? > + if (!dead && signal_pending(current)) { > + struct ksignal ksig; > + > + dead = get_signal(&ksig); > + if (dead) > + clear_thread_flag(TIF_SIGPENDING); If you do clear_thread_flag(TIF_SIGPENDING), then why do we need 1/3 ? Also. Suppose that vhost_worker() dequeues SIGKILL and clears TIF_SIGPENDING. SIGSTOP, PTRACE_INTERRUPT, freezer can come and set TIF_SIGPENDING again. In this case the main for (;;) loop will spin without sleeping until vhost_task_should_stop() becomes true? Oleg. _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-22 12:30 ` Oleg Nesterov @ 2023-05-22 17:00 ` Mike Christie 2023-05-22 17:47 ` Oleg Nesterov 0 siblings, 1 reply; 51+ messages in thread From: Mike Christie @ 2023-05-22 17:00 UTC (permalink / raw) To: Oleg Nesterov Cc: axboe, brauner, mst, linux-kernel, linux, ebiederm, stefanha, nicolas.dichtel, virtualization, torvalds On 5/22/23 7:30 AM, Oleg Nesterov wrote: > Confused, please help... > > On 05/21, Mike Christie wrote: >> >> --- a/drivers/vhost/vhost.c >> +++ b/drivers/vhost/vhost.c >> @@ -338,6 +338,7 @@ static int vhost_worker(void *data) >> struct vhost_worker *worker = data; >> struct vhost_work *work, *work_next; >> struct llist_node *node; >> + bool dead = false; >> >> for (;;) { >> /* mb paired w/ kthread_stop */ >> @@ -349,8 +350,22 @@ static int vhost_worker(void *data) >> } >> >> node = llist_del_all(&worker->work_list); >> - if (!node) >> + if (!node) { >> schedule(); >> + /* >> + * When we get a SIGKILL our release function will >> + * be called. That will stop new IOs from being queued >> + * and check for outstanding cmd responses. It will then >> + * call vhost_task_stop to tell us to return and exit. >> + */ > > But who will call the release function / vhost_task_stop() and when this > will happen after this thread gets SIGKILL ? When we get a SIGKILL, the thread that owns the device/vhost_task will also exit since it's the same thread group and it does: do_exit -> exit_files -> put_files_struct -> close_files -> fput which eventually (the actual put is done via the delayed work path in there) runs the file_operation->release. So the release function is being called in parallel more or less as the code above. > >> + if (!dead && signal_pending(current)) { >> + struct ksignal ksig; >> + >> + dead = get_signal(&ksig); >> + if (dead) >> + clear_thread_flag(TIF_SIGPENDING); > > If you do clear_thread_flag(TIF_SIGPENDING), then why do we need 1/3 ? You're right. I don't need it. I thought __fatal_signal_pending checked the shared pending as well but it only checks pending. > Also. Suppose that vhost_worker() dequeues SIGKILL and clears TIF_SIGPENDING. > > SIGSTOP, PTRACE_INTERRUPT, freezer can come and set TIF_SIGPENDING again. > In this case the main for (;;) loop will spin without sleeping until > vhost_task_should_stop() becomes true? I see. So I either have to be able to call get_signal after SIGKILL or at this time work like a kthread and ignore signals like a if (dead && signal_pending()) flush_signals() ? _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-22 17:00 ` Mike Christie @ 2023-05-22 17:47 ` Oleg Nesterov 2023-05-23 12:15 ` Oleg Nesterov 0 siblings, 1 reply; 51+ messages in thread From: Oleg Nesterov @ 2023-05-22 17:47 UTC (permalink / raw) To: Mike Christie Cc: axboe, brauner, mst, linux-kernel, linux, ebiederm, stefanha, nicolas.dichtel, virtualization, torvalds On 05/22, Mike Christie wrote: > > On 5/22/23 7:30 AM, Oleg Nesterov wrote: > >> + /* > >> + * When we get a SIGKILL our release function will > >> + * be called. That will stop new IOs from being queued > >> + * and check for outstanding cmd responses. It will then > >> + * call vhost_task_stop to tell us to return and exit. > >> + */ > > > > But who will call the release function / vhost_task_stop() and when this > > will happen after this thread gets SIGKILL ? > > When we get a SIGKILL, the thread that owns the device/vhost_task will > also exit since it's the same thread group and it does: > > do_exit -> exit_files -> put_files_struct -> close_files -> fput Ah. thanks. I confused CLONE_FS in vhost_task_create() with CLONE_FILES. > > Also. Suppose that vhost_worker() dequeues SIGKILL and clears TIF_SIGPENDING. > > > > SIGSTOP, PTRACE_INTERRUPT, freezer can come and set TIF_SIGPENDING again. > > In this case the main for (;;) loop will spin without sleeping until > > vhost_task_should_stop() becomes true? > > I see. So I either have to be able to call get_signal after SIGKILL or > at this time work like a kthread and ignore signals like a > > if (dead && signal_pending()) > flush_signals() > ? Right now I think that "int dead" should die, and you should simply do get_signal() + clear(SIGPENDING) if signal_pending() == T , but let me think tomorrow. Oleg. _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-22 17:47 ` Oleg Nesterov @ 2023-05-23 12:15 ` Oleg Nesterov 2023-05-23 15:57 ` Eric W. Biederman ` (3 more replies) 0 siblings, 4 replies; 51+ messages in thread From: Oleg Nesterov @ 2023-05-23 12:15 UTC (permalink / raw) To: Mike Christie Cc: axboe, brauner, mst, linux-kernel, linux, ebiederm, stefanha, nicolas.dichtel, virtualization, torvalds On 05/22, Oleg Nesterov wrote: > > Right now I think that "int dead" should die, No, probably we shouldn't call get_signal() if we have already dequeued SIGKILL. > but let me think tomorrow. May be something like this... I don't like it but I can't suggest anything better right now. bool killed = false; for (;;) { ... node = llist_del_all(&worker->work_list); if (!node) { schedule(); /* * When we get a SIGKILL our release function will * be called. That will stop new IOs from being queued * and check for outstanding cmd responses. It will then * call vhost_task_stop to tell us to return and exit. */ if (signal_pending(current)) { struct ksignal ksig; if (!killed) killed = get_signal(&ksig); clear_thread_flag(TIF_SIGPENDING); } continue; } ------------------------------------------------------------------------------- But let me ask a couple of questions. Let's forget this patch, let's look at the current code: node = llist_del_all(&worker->work_list); if (!node) schedule(); node = llist_reverse_order(node); ... process works ... To me this looks a bit confusing. Shouldn't we do if (!node) { schedule(); continue; } just to make the code a bit more clear? If node == NULL then llist_reverse_order() and llist_for_each_entry_safe() will do nothing. But this is minor. /* make sure flag is seen after deletion */ smp_wmb(); llist_for_each_entry_safe(work, work_next, node, node) { clear_bit(VHOST_WORK_QUEUED, &work->flags); I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED, vhost_work_queue() can add this work again and change work->node->next. That is why we use _safe, but we need to ensure that llist_for_each_safe() completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared. So it seems that smp_wmb() can't help and should be removed, instead we need llist_for_each_entry_safe(...) { smp_mb__before_atomic(); clear_bit(VHOST_WORK_QUEUED, &work->flags); Also, if the work->fn pointer is not stable, we should read it before smp_mb__before_atomic() as well. No? __set_current_state(TASK_RUNNING); Why do we set TASK_RUNNING inside the loop? Does this mean that work->fn() can return with current->state != RUNNING ? work->fn(work); Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right before we call work->fn(). Is it "safe" to run this callback with signal_pending() or fatal_signal_pending() ? Finally. I never looked into drivers/vhost/ before so I don't understand this code at all, but let me ask anyway... Can we change vhost_dev_flush() to run the pending callbacks rather than wait for vhost_worker() ? I guess we can't, ->mm won't be correct, but can you confirm? Oleg. _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-23 12:15 ` Oleg Nesterov @ 2023-05-23 15:57 ` Eric W. Biederman 2023-05-24 14:10 ` Oleg Nesterov 2023-05-31 5:22 ` Jason Wang 2023-05-24 0:02 ` Mike Christie ` (2 subsequent siblings) 3 siblings, 2 replies; 51+ messages in thread From: Eric W. Biederman @ 2023-05-23 15:57 UTC (permalink / raw) To: Oleg Nesterov Cc: axboe, brauner, mst, linux, linux-kernel, stefanha, nicolas.dichtel, virtualization, torvalds Oleg Nesterov <oleg@redhat.com> writes: > On 05/22, Oleg Nesterov wrote: >> >> Right now I think that "int dead" should die, > > No, probably we shouldn't call get_signal() if we have already > dequeued SIGKILL. Very much agreed. It is one thing to add a patch to move do_exit out of get_signal. It is another to keep calling get_signal after that. Nothing tests that case, and so we get some weird behaviors. >> but let me think tomorrow. > > May be something like this... I don't like it but I can't suggest anything better > right now. > > bool killed = false; > > for (;;) { > ... > > node = llist_del_all(&worker->work_list); > if (!node) { > schedule(); > /* > * When we get a SIGKILL our release function will > * be called. That will stop new IOs from being queued > * and check for outstanding cmd responses. It will then > * call vhost_task_stop to tell us to return and exit. > */ > if (signal_pending(current)) { > struct ksignal ksig; > > if (!killed) > killed = get_signal(&ksig); > > clear_thread_flag(TIF_SIGPENDING); > } > > continue; > } I want to point out that we need to consider not just SIGKILL, but SIGABRT that causes a coredump, as well as the process peforming an ordinary exit(2). All of which will cause get_signal to return SIGKILL in this context. > > ------------------------------------------------------------------------------- > But let me ask a couple of questions. I share most of these questions. > Let's forget this patch, let's look at the > current code: > > node = llist_del_all(&worker->work_list); > if (!node) > schedule(); > > node = llist_reverse_order(node); > ... process works ... > > To me this looks a bit confusing. Shouldn't we do > > if (!node) { > schedule(); > continue; > } > > just to make the code a bit more clear? If node == NULL then > llist_reverse_order() and llist_for_each_entry_safe() will do nothing. > But this is minor. > > > > /* make sure flag is seen after deletion */ > smp_wmb(); > llist_for_each_entry_safe(work, work_next, node, node) { > clear_bit(VHOST_WORK_QUEUED, &work->flags); > > I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED, > vhost_work_queue() can add this work again and change work->node->next. > > That is why we use _safe, but we need to ensure that llist_for_each_safe() > completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared. > > So it seems that smp_wmb() can't help and should be removed, instead we need > > llist_for_each_entry_safe(...) { > smp_mb__before_atomic(); > clear_bit(VHOST_WORK_QUEUED, &work->flags); > > Also, if the work->fn pointer is not stable, we should read it before > smp_mb__before_atomic() as well. > > No? > > > __set_current_state(TASK_RUNNING); > > Why do we set TASK_RUNNING inside the loop? Does this mean that work->fn() > can return with current->state != RUNNING ? > > > work->fn(work); > > Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right > before we call work->fn(). Is it "safe" to run this callback with > signal_pending() or fatal_signal_pending() ? > > > Finally. I never looked into drivers/vhost/ before so I don't understand > this code at all, but let me ask anyway... Can we change vhost_dev_flush() > to run the pending callbacks rather than wait for vhost_worker() ? > I guess we can't, ->mm won't be correct, but can you confirm? In a conversation long ago I remember hearing that vhost does not support file descriptor passing. Which means all of the file descriptors should be in the same process. Looking at the vhost code what I am seeing happening is that the vhost_worker persists until vhost_dev_cleanup is called from one of the vhost_???_release() functions. The release functions are only called after the last flush function completes. See __fput if you want to trace the details. On one hand this all seems reasonable. On the other hand I am not seeing the code that prevents file descriptor passing. It is probably not the worst thing in the world, but what this means is now if you pass a copy of the vhost file descriptor to another process the vhost_worker will persist, and thus the process will persist until that copy of the file descriptor is closed. Eric _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-23 15:57 ` Eric W. Biederman @ 2023-05-24 14:10 ` Oleg Nesterov 2023-05-24 14:44 ` Eric W. Biederman 2023-05-31 5:22 ` Jason Wang 1 sibling, 1 reply; 51+ messages in thread From: Oleg Nesterov @ 2023-05-24 14:10 UTC (permalink / raw) To: Eric W. Biederman Cc: axboe, brauner, mst, linux, linux-kernel, stefanha, nicolas.dichtel, virtualization, torvalds On 05/23, Eric W. Biederman wrote: > > I want to point out that we need to consider not just SIGKILL, but > SIGABRT that causes a coredump, as well as the process peforming > an ordinary exit(2). All of which will cause get_signal to return > SIGKILL in this context. Yes, but probably SIGABRT/exit doesn't really differ from SIGKILL wrt vhost_worker(). > It is probably not the worst thing in the world, but what this means > is now if you pass a copy of the vhost file descriptor to another > process the vhost_worker will persist, and thus the process will persist > until that copy of the file descriptor is closed. Hadn't thought about it. I am fighting with internal bugzillas today, will try to write another email tomorrow. But before that, I would like to have an answer to my "main" question in my previois email. Otherwise I am still not sure I understand what exactly we need to fix. Oleg. _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-24 14:10 ` Oleg Nesterov @ 2023-05-24 14:44 ` Eric W. Biederman 2023-05-25 11:55 ` Oleg Nesterov 0 siblings, 1 reply; 51+ messages in thread From: Eric W. Biederman @ 2023-05-24 14:44 UTC (permalink / raw) To: Oleg Nesterov Cc: axboe, brauner, mst, linux, linux-kernel, stefanha, nicolas.dichtel, virtualization, torvalds Oleg Nesterov <oleg@redhat.com> writes: > On 05/23, Eric W. Biederman wrote: >> >> I want to point out that we need to consider not just SIGKILL, but >> SIGABRT that causes a coredump, as well as the process peforming >> an ordinary exit(2). All of which will cause get_signal to return >> SIGKILL in this context. > > Yes, but probably SIGABRT/exit doesn't really differ from SIGKILL wrt > vhost_worker(). Actually I think it reveals that exiting with SIGABRT will cause a deadlock. coredump_wait will wait for all of the threads to reach coredump_task_exit. Meanwhile vhost_worker is waiting for all of the other threads to reach exit_files to close their file descriptors. So it looks like the final pieces of work will actually need to be moved into to vhost_xxx_flush or vhost_xxx_release to avoid the exiting threads from waiting on each other, instead of depending upon the vhost_worker to do the work. Which gets back to most of your other questions. >> It is probably not the worst thing in the world, but what this means >> is now if you pass a copy of the vhost file descriptor to another >> process the vhost_worker will persist, and thus the process will persist >> until that copy of the file descriptor is closed. > > Hadn't thought about it. > > I am fighting with internal bugzillas today, will try to write another > email tomorrow. > > But before that, I would like to have an answer to my "main" question in > my previois email. Otherwise I am still not sure I understand what exactly > we need to fix. Let me repeat your "main" question just for clarity here. If a signal comes in after the signal_pending check but before the "work->fn(work)" call is "work->fn(work)" expected to run correctly with signal_pending() or fatal_signal_pending returning true? Eric _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-24 14:44 ` Eric W. Biederman @ 2023-05-25 11:55 ` Oleg Nesterov 2023-05-25 15:30 ` Eric W. Biederman 0 siblings, 1 reply; 51+ messages in thread From: Oleg Nesterov @ 2023-05-25 11:55 UTC (permalink / raw) To: Eric W. Biederman Cc: axboe, brauner, mst, linux, linux-kernel, stefanha, nicolas.dichtel, virtualization, torvalds On 05/24, Eric W. Biederman wrote: > > Oleg Nesterov <oleg@redhat.com> writes: > > > Yes, but probably SIGABRT/exit doesn't really differ from SIGKILL wrt > > vhost_worker(). > > Actually I think it reveals that exiting with SIGABRT will cause > a deadlock. > > coredump_wait will wait for all of the threads to reach > coredump_task_exit. Meanwhile vhost_worker is waiting for > all of the other threads to reach exit_files to close their > file descriptors. Indeed, I didn't think about this. So why do we actually need CLONE_THREAD ? Can't vhost_worker() be a kernel thread? kthread_create() won't be convenient, but how about kernel_thread() ? it inherits mm/cgroups/rlimits/etc, kthread_stop() should work just fine. Oleg. _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-25 11:55 ` Oleg Nesterov @ 2023-05-25 15:30 ` Eric W. Biederman 2023-05-25 16:20 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: Eric W. Biederman @ 2023-05-25 15:30 UTC (permalink / raw) To: Oleg Nesterov Cc: axboe, brauner, mst, linux, linux-kernel, stefanha, nicolas.dichtel, virtualization, torvalds Oleg Nesterov <oleg@redhat.com> writes: > On 05/24, Eric W. Biederman wrote: >> >> Oleg Nesterov <oleg@redhat.com> writes: >> >> > Yes, but probably SIGABRT/exit doesn't really differ from SIGKILL wrt >> > vhost_worker(). >> >> Actually I think it reveals that exiting with SIGABRT will cause >> a deadlock. >> >> coredump_wait will wait for all of the threads to reach >> coredump_task_exit. Meanwhile vhost_worker is waiting for >> all of the other threads to reach exit_files to close their >> file descriptors. > > Indeed, I didn't think about this. > > > So why do we actually need CLONE_THREAD ? Can't vhost_worker() be a kernel thread? > > kthread_create() won't be convenient, but how about kernel_thread() ? it inherits > mm/cgroups/rlimits/etc, kthread_stop() should work just fine. Basically with no patches that is where Linus's kernel is. User complained about the new thread showing up in ps. So the exploration of how we could use CLONE_THREAD instead of what is currently merged began. Eric _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-25 15:30 ` Eric W. Biederman @ 2023-05-25 16:20 ` Linus Torvalds 2023-05-27 9:49 ` Eric W. Biederman 0 siblings, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2023-05-25 16:20 UTC (permalink / raw) To: Eric W. Biederman Cc: axboe, brauner, mst, linux-kernel, Oleg Nesterov, stefanha, linux, nicolas.dichtel, virtualization On Thu, May 25, 2023 at 8:30 AM Eric W. Biederman <ebiederm@xmission.com> wrote: > > Basically with no patches that is where Linus's kernel is. > > User complained about the new thread showing up in ps. Well, not only that, but it actively broke existing workflows for managing things. Showing up in 'ps' wasn't just some purely cosmetic issue, but had semantic meaning. And honestly, I think the core issue is that we should just make this work. Kernel threads randomly switching to user memory threads was wrong, so CLONE_VM is absolutely the right thing to do. But while "CLONE_VM without real threading" is a very traditional thing in Linux - it was the original model for clone(), after all - I don't believe it is the *correct* model. There was a very real reason clone() has grown CLONE_THREAD and friends. So honestly, I really think we want to complete the vhost move to CLONE_THREAD (and thus CLONE_SIGNAL). Not because the old kthread model didn't _work_, but because it's really really wrong to try to randomly take on user-space attributes at run-time. And once you do the "user threads in kernel space" model, at that point you really do want to act like a proper thread. Both because of that 'ps' issue (which is really just "show the world what your relationship is), but simply because that is the modern threading model that we use for everything else, and special cases are bad. So I'd really like to finish this. Even if we end up with a hack or two in signal handling that we can hopefully fix up later by having vhost fix up some of its current assumptions. It has worked wonderfully well for io_uring - but we *did* have quite a bit of conversion patches over some time as people found issues. Which is why I don't expect the vhost conevrsion to be entirely pain-free either, and I don't think we necessarily have to get to a "perfect and clean" state immediately, just a "working and conceptually in the right direction" state. Linus _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-25 16:20 ` Linus Torvalds @ 2023-05-27 9:49 ` Eric W. Biederman 2023-05-27 16:12 ` Linus Torvalds 2023-05-30 15:01 ` Eric W. Biederman 0 siblings, 2 replies; 51+ messages in thread From: Eric W. Biederman @ 2023-05-27 9:49 UTC (permalink / raw) To: Linus Torvalds Cc: axboe, brauner, mst, linux-kernel, Oleg Nesterov, stefanha, linux, nicolas.dichtel, virtualization Linus Torvalds <torvalds@linux-foundation.org> writes: > So I'd really like to finish this. Even if we end up with a hack or > two in signal handling that we can hopefully fix up later by having > vhost fix up some of its current assumptions. The real sticky widget for me is how to handle one of these processes coredumping. It really looks like it will result in a reliable hang. Limiting ourselves to changes that will only affect vhost, all I can see would be allowing the vhost_worker thread to exit as soon as get_signal reports the process is exiting. Then vhost_dev_flush would need to process the pending work. Something like this: diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index a92af08e7864..fb5ebc50c553 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -234,14 +234,31 @@ EXPORT_SYMBOL_GPL(vhost_poll_stop); void vhost_dev_flush(struct vhost_dev *dev) { struct vhost_flush_struct flush; + struct vhost_worker *worker = dev->worker; + struct llist_node *node, *head; + + if (!worker) + return; + + init_completion(&flush.wait_event); + vhost_work_init(&flush.work, vhost_flush_work); - if (dev->worker) { - init_completion(&flush.wait_event); - vhost_work_init(&flush.work, vhost_flush_work); + vhost_work_queue(dev, &flush.work); - vhost_work_queue(dev, &flush.work); - wait_for_completion(&flush.wait_event); + /* Either vhost_worker runs the pending work or we do */ + node = llist_del_all(&worker->work_list); + if (node) { + node = llist_reverse_order(node); + /* make sure flag is seen after deletion */ + smp_wmb(); + llist_for_each_entry_safe(work, work_next, node, node) { + clear_bit(VHOST_WORK_QUEUED, &work->flags); + work->fn(work); + cond_resched(); + } } + + wait_for_completion(&flush.wait_event); } EXPORT_SYMBOL_GPL(vhost_dev_flush); @@ -338,6 +355,7 @@ static int vhost_worker(void *data) struct vhost_worker *worker = data; struct vhost_work *work, *work_next; struct llist_node *node; + struct ksignal ksig; for (;;) { /* mb paired w/ kthread_stop */ @@ -348,6 +366,9 @@ static int vhost_worker(void *data) break; } + if (get_signal(&ksig)) + break; + node = llist_del_all(&worker->work_list); if (!node) schedule(); diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c index b7cbd66f889e..613d52f01c07 100644 --- a/kernel/vhost_task.c +++ b/kernel/vhost_task.c @@ -47,6 +47,7 @@ void vhost_task_stop(struct vhost_task *vtsk) * not exiting then reap the task. */ kernel_wait4(pid, NULL, __WCLONE, NULL); + put_task_struct(vtsk->task); kfree(vtsk); } EXPORT_SYMBOL_GPL(vhost_task_stop); @@ -101,7 +102,7 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg, return NULL; } - vtsk->task = tsk; + vtsk->task = get_task_struct(tsk); return vtsk; } EXPORT_SYMBOL_GPL(vhost_task_create); Eric _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-27 9:49 ` Eric W. Biederman @ 2023-05-27 16:12 ` Linus Torvalds 2023-05-28 1:17 ` Eric W. Biederman 2023-05-30 15:01 ` Eric W. Biederman 1 sibling, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2023-05-27 16:12 UTC (permalink / raw) To: Eric W. Biederman Cc: axboe, brauner, mst, linux-kernel, Oleg Nesterov, stefanha, linux, nicolas.dichtel, virtualization [-- Attachment #1: Type: text/plain, Size: 1482 bytes --] On Sat, May 27, 2023 at 2:49 AM Eric W. Biederman <ebiederm@xmission.com> wrote: > > The real sticky widget for me is how to handle one of these processes > coredumping. It really looks like it will result in a reliable hang. Well, if *that* is the main worry, I think that's trivial enough to deal with. In particular, we could make the rule just be that user worker threads simply do not participate in core-dumps. THAT isn't hard. All we need to do is (a) not count those threads in zap_threads() (b) make sure that they don't add themselves to the "dumper" list by not calling "coredujmp_task_exit()" (c) not initiate core-dumping themselves. and I think that's pretty much it. In fact, that really seems like a good model *regardless*, because honestly, a PF_IO_WORKER doesn't have valid register state for the core dump anyway, and anything that would have caused a IO thread to get a SIGSEGV *should* have caused a kernel oops already. So the only worry is that the core dump will now happen while an IO worker is still busy and so it's not "atomic" wrt possible VM changes, but while that used to be a big problem back in the dark ages when we didn't get the VM locks for core dumping, that got fixed a few years ago because it already caused lots of potential issues. End result: I think the attached patch is probably missing something, but the approach "FeelsRight(tm)" to me. Comments? Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 2133 bytes --] fs/coredump.c | 2 +- kernel/exit.c | 6 ++++++ kernel/signal.c | 18 ++++++++++-------- 3 files changed, 17 insertions(+), 9 deletions(-) diff --git a/fs/coredump.c b/fs/coredump.c index ece7badf701b..46f8145b39e6 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -368,7 +368,7 @@ static int zap_process(struct task_struct *start, int exit_code) for_each_thread(start, t) { task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); - if (t != current && !(t->flags & PF_POSTCOREDUMP)) { + if (t != current && !(t->flags & (PF_POSTCOREDUMP | PF_IO_WORKER))) { sigaddset(&t->pending.signal, SIGKILL); signal_wake_up(t, 1); nr++; diff --git a/kernel/exit.c b/kernel/exit.c index 34b90e2e7cf7..fde57b9f4494 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -400,6 +400,12 @@ static void coredump_task_exit(struct task_struct *tsk) { struct core_state *core_state; + /* + * IO workers do not participate in dumping core + */ + if (tsk->flags & PF_IO_WORKER) + return; + /* * Serialize with any possible pending coredump. * We must hold siglock around checking core_state diff --git a/kernel/signal.c b/kernel/signal.c index 8f6330f0e9ca..e0acb11d3a1d 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2845,6 +2845,16 @@ bool get_signal(struct ksignal *ksig) */ current->flags |= PF_SIGNALED; + /* + * PF_IO_WORKER threads will catch and exit on fatal signals + * themselves and do not participate in core dumping. + * + * They have cleanup that must be performed, so we cannot + * call do_exit() on their behalf. + */ + if (current->flags & PF_IO_WORKER) + goto out; + if (sig_kernel_coredump(signr)) { if (print_fatal_signals) print_fatal_signal(ksig->info.si_signo); @@ -2860,14 +2870,6 @@ bool get_signal(struct ksignal *ksig) do_coredump(&ksig->info); } - /* - * PF_IO_WORKER threads will catch and exit on fatal signals - * themselves. They have cleanup that must be performed, so - * we cannot call do_exit() on their behalf. - */ - if (current->flags & PF_IO_WORKER) - goto out; - /* * Death signals, no core dump. */ [-- Attachment #3: Type: text/plain, Size: 183 bytes --] _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-27 16:12 ` Linus Torvalds @ 2023-05-28 1:17 ` Eric W. Biederman 2023-05-28 1:21 ` Linus Torvalds 2023-05-29 11:19 ` Oleg Nesterov 0 siblings, 2 replies; 51+ messages in thread From: Eric W. Biederman @ 2023-05-28 1:17 UTC (permalink / raw) To: Linus Torvalds Cc: axboe, brauner, mst, linux-kernel, Oleg Nesterov, stefanha, linux, nicolas.dichtel, virtualization Linus Torvalds <torvalds@linux-foundation.org> writes: > On Sat, May 27, 2023 at 2:49 AM Eric W. Biederman <ebiederm@xmission.com> wrote: >> >> The real sticky widget for me is how to handle one of these processes >> coredumping. It really looks like it will result in a reliable hang. > > Well, if *that* is the main worry, I think that's trivial enough to deal with. > > In particular, we could make the rule just be that user worker threads > simply do not participate in core-dumps. > > THAT isn't hard. > > All we need to do is > > (a) not count those threads in zap_threads() > > (b) make sure that they don't add themselves to the "dumper" list by > not calling "coredujmp_task_exit()" > > (c) not initiate core-dumping themselves. > > and I think that's pretty much it. > > In fact, that really seems like a good model *regardless*, because > honestly, a PF_IO_WORKER doesn't have valid register state for the > core dump anyway, and anything that would have caused a IO thread to > get a SIGSEGV *should* have caused a kernel oops already. > > So the only worry is that the core dump will now happen while an IO > worker is still busy and so it's not "atomic" wrt possible VM changes, > but while that used to be a big problem back in the dark ages when we > didn't get the VM locks for core dumping, that got fixed a few years > ago because it already caused lots of potential issues. > > End result: I think the attached patch is probably missing something, > but the approach "FeelsRight(tm)" to me. > > Comments? It seems like a good approach for including in the -rc series. I think the change should look more like my change below. nits: - The threads all need to participate in the group exit even if they aren't going to be in the coredump. - For vhost_worker we need s/PF_IO_WORKER/PF_USER_WORKER/. - Moving PF_IO_WORKER above the sig_kernel_coredump(signr) test is unnecessary. The sig_kernel_coredump(signr) test can only become true if a process exit has not been initiated yet. More importantly moving the test obscures the fact that only do_group_exit is moved out of get_signal for the PF_IO_WORKER special case. Long term I think we want an approach that stops the worker threads during the coredumps. It will just yield a better quality of implementation if we minimize the amount of concurrency during the coredump. I have a pending patchset that moves the coredump rendezvous into get_signal. At which point stopping all of the threads is just like SIGSTOP something the worker threads can use and it won't introduce any issues. Today the problem is some of the worker thread code must run before the coredump stop. Looking forward I don't see not asking the worker threads to stop for the coredump right now causing any problems in the future. So I think we can use this to resolve the coredump issue I spotted. diff --git a/fs/coredump.c b/fs/coredump.c index ece7badf701b..620f7f9dc894 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -371,7 +371,8 @@ static int zap_process(struct task_struct *start, int exit_code) if (t != current && !(t->flags & PF_POSTCOREDUMP)) { sigaddset(&t->pending.signal, SIGKILL); signal_wake_up(t, 1); - nr++; + if (!(t->flags & PF_IO_WORKER)) + nr++; } } diff --git a/kernel/exit.c b/kernel/exit.c index 34b90e2e7cf7..6082dba9131a 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -411,7 +411,9 @@ static void coredump_task_exit(struct task_struct *tsk) tsk->flags |= PF_POSTCOREDUMP; core_state = tsk->signal->core_state; spin_unlock_irq(&tsk->sighand->siglock); - if (core_state) { + + /* I/O Workers don't participate in coredumps */ + if (core_state && !(tsk->flags & PF_IO_WORKER) { struct core_thread self; self.task = current; > current->flags |= PF_SIGNALED; > > + /* > + * PF_IO_WORKER threads will catch and exit on fatal signals > + * themselves and do not participate in core dumping. > + * > + * They have cleanup that must be performed, so we cannot > + * call do_exit() on their behalf. > + */ > + if (current->flags & PF_IO_WORKER) > + goto out; > + > if (sig_kernel_coredump(signr)) { > if (print_fatal_signals) > print_fatal_signal(ksig->info.si_signo); > @@ -2860,14 +2870,6 @@ bool get_signal(struct ksignal *ksig) > do_coredump(&ksig->info); > } > > - /* > - * PF_IO_WORKER threads will catch and exit on fatal signals > - * themselves. They have cleanup that must be performed, so > - * we cannot call do_exit() on their behalf. > - */ > - if (current->flags & PF_IO_WORKER) > - goto out; > - > /* > * Death signals, no core dump. > */ Eric _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-28 1:17 ` Eric W. Biederman @ 2023-05-28 1:21 ` Linus Torvalds 2023-05-29 11:19 ` Oleg Nesterov 1 sibling, 0 replies; 51+ messages in thread From: Linus Torvalds @ 2023-05-28 1:21 UTC (permalink / raw) To: Eric W. Biederman Cc: axboe, brauner, mst, linux-kernel, Oleg Nesterov, stefanha, linux, nicolas.dichtel, virtualization On Sat, May 27, 2023 at 6:17 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > It seems like a good approach for including in the -rc series. > I think the change should look more like my change below. I have no objections. My patch was a fairly "hack and slash" thing to just disassociate the IO workers entirely from the core dumping. Yours seems to be slightly more surgical. Linus _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-28 1:17 ` Eric W. Biederman 2023-05-28 1:21 ` Linus Torvalds @ 2023-05-29 11:19 ` Oleg Nesterov 2023-05-29 16:09 ` michael.christie ` (2 more replies) 1 sibling, 3 replies; 51+ messages in thread From: Oleg Nesterov @ 2023-05-29 11:19 UTC (permalink / raw) To: Eric W. Biederman Cc: axboe, brauner, mst, linux, linux-kernel, stefanha, nicolas.dichtel, virtualization, Linus Torvalds On 05/27, Eric W. Biederman wrote: > > Looking forward I don't see not asking the worker threads to stop > for the coredump right now causing any problems in the future. > So I think we can use this to resolve the coredump issue I spotted. But we have almost the same problem with exec. Execing thread will wait for vhost_worker() while vhost_worker will wait for .release -> vhost_task_stop(). And even O_CLOEXEC won't help, do_close_on_exec() is called after de_thread(). Or suppose that vhost_worker's sub-thread forks a child with CLONE_FILES... If we want CLONE_THREAD, I think vhost_worker() should exit after get_signal() returns SIGKILL. Perhaps it should "disable" vhost_work_queue() somehow and flush the pending works on ->work_list before exit, I dunno. But imo it should not wait for the final fput(). Oleg. _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-29 11:19 ` Oleg Nesterov @ 2023-05-29 16:09 ` michael.christie 2023-05-29 17:46 ` Oleg Nesterov 2023-05-30 2:38 ` Eric W. Biederman 2023-05-29 16:11 ` michael.christie [not found] ` <20230530-autor-faxnummer-01e0a31c0fb8@brauner> 2 siblings, 2 replies; 51+ messages in thread From: michael.christie @ 2023-05-29 16:09 UTC (permalink / raw) To: Oleg Nesterov, Eric W. Biederman Cc: axboe, brauner, mst, linux-kernel, linux, virtualization, stefanha, nicolas.dichtel, Linus Torvalds On 5/29/23 6:19 AM, Oleg Nesterov wrote: > On 05/27, Eric W. Biederman wrote: >> >> Looking forward I don't see not asking the worker threads to stop >> for the coredump right now causing any problems in the future. >> So I think we can use this to resolve the coredump issue I spotted. > > But we have almost the same problem with exec. > > Execing thread will wait for vhost_worker() while vhost_worker will wait for > .release -> vhost_task_stop(). For this type of case, what is the goal or correct behavior in the end? When get_signal returns true we can code things like you mention below and clean up the task_struct. However, we now have a non-functioning vhost device open and just sitting around taking up memory and it can't do any IO. For this type of case, do we expect just not to crash/hang, or was this new exec'd thread suppose to be able to use the vhost device? I would normally say it probably wants to use the vhost device still. However, I don't think this comes up so just not hanging might be ok. Before 6.4-rc1, we ignored signals so it would have worked if we are concerned about a possible regression if this was a common thing. > > And even O_CLOEXEC won't help, do_close_on_exec() is called after de_thread(). > > Or suppose that vhost_worker's sub-thread forks a child with CLONE_FILES... You mean the vhost_task's task/thread doing a function that does a copy_process right? That type of thing is not needed. I can add a check in vhost_task_create for this so new code doesn't try to do it. I don't think it will come up that some code vhost is using will call kernel_thread/copy_process directly since those calls are so rare and the functions are not exported to modules. > > If we want CLONE_THREAD, I think vhost_worker() should exit after get_signal() > returns SIGKILL. Perhaps it should "disable" vhost_work_queue() somehow and > flush the pending works on ->work_list before exit, I dunno. But imo it should > not wait for the final fput(). > > Oleg. > _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-29 16:09 ` michael.christie @ 2023-05-29 17:46 ` Oleg Nesterov 2023-05-29 17:54 ` Oleg Nesterov 2023-05-29 19:35 ` Mike Christie 2023-05-30 2:38 ` Eric W. Biederman 1 sibling, 2 replies; 51+ messages in thread From: Oleg Nesterov @ 2023-05-29 17:46 UTC (permalink / raw) To: michael.christie Cc: axboe, brauner, mst, linux-kernel, linux, Eric W. Biederman, stefanha, nicolas.dichtel, virtualization, Linus Torvalds Mike, sorry, I don't understand your email. Just in case, let me remind I know nothing about drivers/vhost/ On 05/29, michael.christie@oracle.com wrote: > > On 5/29/23 6:19 AM, Oleg Nesterov wrote: > > On 05/27, Eric W. Biederman wrote: > >> > >> Looking forward I don't see not asking the worker threads to stop > >> for the coredump right now causing any problems in the future. > >> So I think we can use this to resolve the coredump issue I spotted. > > > > But we have almost the same problem with exec. > > > > Execing thread will wait for vhost_worker() while vhost_worker will wait for > > .release -> vhost_task_stop(). > > For this type of case, what is the goal or correct behavior in the end? > > When get_signal returns true we can code things like you mention below and you have mentioned in the next email that you have already coded something like this, so perhaps we can delay the further discussions until you send the new code? > and > clean up the task_struct. Hmm... If we you CLONE_THREAD the exiting vhost_worker() will auto-reap itself, > However, we now have a non-functioning vhost device > open and just sitting around taking up memory and it can't do any IO. can't comment, see above. > For this type of case, do we expect just not to crash/hang, or was this new > exec'd thread suppose to be able to use the vhost device? I just tried to point out that (unless I missed something) there are more corner cases, not just coredump. > > Or suppose that vhost_worker's sub-thread forks a child with CLONE_FILES... > > You mean the vhost_task's task/thread doing a function that does a copy_process > right? I meant that the vhost_task's sub-thread can do sys_clone(CLONE_FILES) from userspace. Yes, this implies copy_process() but I still can't understand you. > That type of thing is not needed. Do you mean that userspace should never do this? But this doesn't matter, the kernel should handle this case anyway. Or what? In short let me repeat that I don't understand you and - of course! - quite possibly I missed something. Oleg. _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-29 17:46 ` Oleg Nesterov @ 2023-05-29 17:54 ` Oleg Nesterov 2023-05-29 19:03 ` Mike Christie 2023-05-29 19:35 ` Mike Christie 1 sibling, 1 reply; 51+ messages in thread From: Oleg Nesterov @ 2023-05-29 17:54 UTC (permalink / raw) To: michael.christie Cc: axboe, brauner, mst, linux-kernel, linux, Eric W. Biederman, stefanha, nicolas.dichtel, virtualization, Linus Torvalds On 05/29, Oleg Nesterov wrote: > > Mike, sorry, I don't understand your email. > > Just in case, let me remind I know nothing about drivers/vhost/ And... this is slightly off-topic but you didn't answer my previous question and I am just curious. Do you agree that, even if we forget about CLONE_THREAD/signals, vhost_worker() needs fixes anyway because it can race with vhost_work_queue() ? Oleg. _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-29 17:54 ` Oleg Nesterov @ 2023-05-29 19:03 ` Mike Christie 0 siblings, 0 replies; 51+ messages in thread From: Mike Christie @ 2023-05-29 19:03 UTC (permalink / raw) To: Oleg Nesterov Cc: axboe, brauner, mst, linux-kernel, linux, Eric W. Biederman, stefanha, nicolas.dichtel, virtualization, Linus Torvalds On 5/29/23 12:54 PM, Oleg Nesterov wrote: > On 05/29, Oleg Nesterov wrote: >> >> Mike, sorry, I don't understand your email. >> >> Just in case, let me remind I know nothing about drivers/vhost/ > > And... this is slightly off-topic but you didn't answer my previous > question and I am just curious. Do you agree that, even if we forget > about CLONE_THREAD/signals, vhost_worker() needs fixes anyway because > it can race with vhost_work_queue() ? You saw the reply where I wrote I was waiting for the vhost devs to reply because it's their code, right? Just wanted to make sure you know I'm not ignoring your mails. The info is very valuable to me since I don't know the signal code. - I think you are right about using a continue after schedule. - The work fn is stable. It's setup once and never changes. - I have no idea why we do the __set_current_state(TASK_RUNNING). As far as I can tell the work functions do not change the task state and that might have been a left over from something else. However, the vhost devs might know of some case. - For the barrier use, I think you are right, but I couldn't trigger an issue even if I hack up different timings. So I was hoping the vhost devs know of something else in there. _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-29 17:46 ` Oleg Nesterov 2023-05-29 17:54 ` Oleg Nesterov @ 2023-05-29 19:35 ` Mike Christie 2023-05-29 19:46 ` michael.christie 1 sibling, 1 reply; 51+ messages in thread From: Mike Christie @ 2023-05-29 19:35 UTC (permalink / raw) To: Oleg Nesterov Cc: axboe, brauner, mst, linux-kernel, linux, Eric W. Biederman, stefanha, nicolas.dichtel, virtualization, Linus Torvalds On 5/29/23 12:46 PM, Oleg Nesterov wrote: > Mike, sorry, I don't understand your email. > > Just in case, let me remind I know nothing about drivers/vhost/ > No problem. I get it. I don't know the signal/thread code so it's one of those things where I'm also learning as I go. > On 05/29, michael.christie@oracle.com wrote: >> >> On 5/29/23 6:19 AM, Oleg Nesterov wrote: >>> On 05/27, Eric W. Biederman wrote: >>>> >>>> Looking forward I don't see not asking the worker threads to stop >>>> for the coredump right now causing any problems in the future. >>>> So I think we can use this to resolve the coredump issue I spotted. >>> >>> But we have almost the same problem with exec. >>> >>> Execing thread will wait for vhost_worker() while vhost_worker will wait for >>> .release -> vhost_task_stop(). >> >> For this type of case, what is the goal or correct behavior in the end? >> >> When get_signal returns true we can code things like you mention below > > and you have mentioned in the next email that you have already coded something > like this, so perhaps we can delay the further discussions until you send the > new code? > Ok. Let me post that. You guys and the vhost devs can argue about if it's too ugly to merge :) >> and >> clean up the task_struct. > > Hmm... If we you CLONE_THREAD the exiting vhost_worker() will auto-reap itself, Oh wait, are you saying that when we get auto-reaped then we would do the last fput and call the file_operations->release function right? We actually set task_struct->files = NULL for the vhost_task task_struct, so I think we call release a little sooner than you think. vhost_task_create() sets kernel_clone_args->no_files, so the vhost_task task_struc that gets created works like kthreads where it doesn't do a CLONE_FILES and it doesn't do a dup_fd. So when we do de_thread() -> zap_other_threads(), that will kill all the threads in the group right? So when they exit, it will call our release function since we don't have refcount on ourself. > >> However, we now have a non-functioning vhost device >> open and just sitting around taking up memory and it can't do any IO. > > can't comment, see above. > >> For this type of case, do we expect just not to crash/hang, or was this new >> exec'd thread suppose to be able to use the vhost device? > > I just tried to point out that (unless I missed something) there are more corner > cases, not just coredump. Ok. I see. > >>> Or suppose that vhost_worker's sub-thread forks a child with CLONE_FILES... >> >> You mean the vhost_task's task/thread doing a function that does a copy_process >> right? > > I meant that the vhost_task's sub-thread can do sys_clone(CLONE_FILES) from > userspace. I think we were talking about different things. I was saying that when the vhost layer does vhost_task_create() -> copy_process(), the kernel_clone_args.fn is vhost_task_fn() -> vhost_worker(). I thought you were concerned about vhost_worker() or some function it calls, calling copy_process() with CLONE_FILES. _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-29 19:35 ` Mike Christie @ 2023-05-29 19:46 ` michael.christie 2023-05-30 2:48 ` Eric W. Biederman 0 siblings, 1 reply; 51+ messages in thread From: michael.christie @ 2023-05-29 19:46 UTC (permalink / raw) To: Oleg Nesterov Cc: axboe, brauner, mst, linux-kernel, linux, Eric W. Biederman, stefanha, nicolas.dichtel, virtualization, Linus Torvalds On 5/29/23 2:35 PM, Mike Christie wrote: >> Hmm... If we you CLONE_THREAD the exiting vhost_worker() will auto-reap itself, > Oh wait, are you saying that when we get auto-reaped then we would do the last > fput and call the file_operations->release function right? We actually set > task_struct->files = NULL for the vhost_task task_struct, so I think we call > release a little sooner than you think. > > vhost_task_create() sets kernel_clone_args->no_files, so the vhost_task task_struc > that gets created works like kthreads where it doesn't do a CLONE_FILES and it > doesn't do a dup_fd. > > So when we do de_thread() -> zap_other_threads(), that will kill all the threads > in the group right? So when they exit, it will call our release function since > we don't have refcount on ourself. > Just to make sure I'm on the same page now. In the past thread when were discussing the patch below and you guys were saying that it doesn't really ignore SIGKILL because we will hit the SIGNAL_GROUP_EXIT/group_exec_task checks and the parent is going to exit, it was on purpose. Instead of a signal to tell me when do exit, I was using the parent exiting and doing the last fput on the vhost device's file which calls our release function. That then allowed the vhost code to use the vhost_task to handle the outstanding IOs and then do vhost_task_should_stop to tell the vhost_task to exit when the oustanding IO had completed. On the vhost side of things it's really nice, because all the shutdown paths work the same and we don't need rcu/locking in the main IO path to handle the vhost_task exiting while we are using it. diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 537cbf9a2ade..e0f5ac90a228 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -29,7 +29,6 @@ struct kernel_clone_args { u32 io_thread:1; u32 user_worker:1; u32 no_files:1; - u32 ignore_signals:1; unsigned long stack; unsigned long stack_size; unsigned long tls; diff --git a/kernel/fork.c b/kernel/fork.c index ed4e01daccaa..fd2970b598b2 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2336,8 +2336,15 @@ __latent_entropy struct task_struct *copy_process( p->flags &= ~PF_KTHREAD; if (args->kthread) p->flags |= PF_KTHREAD; - if (args->user_worker) + if (args->user_worker) { + /* + * User worker are similar to io_threads but they do not + * support signals and cleanup is driven via another kernel + * interface so even SIGKILL is blocked. + */ p->flags |= PF_USER_WORKER; + siginitsetinv(&p->blocked, 0); + } if (args->io_thread) { /* * Mark us an IO worker, and block any signal that isn't @@ -2517,8 +2524,8 @@ __latent_entropy struct task_struct *copy_process( if (retval) goto bad_fork_cleanup_io; - if (args->ignore_signals) - ignore_signals(p); + if (args->user_worker) + p->flags |= PF_NOFREEZE; stackleak_task_init(p); @@ -2860,7 +2867,6 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node) .fn = fn, .fn_arg = arg, .io_thread = 1, - .user_worker = 1, }; return copy_process(NULL, 0, node, &args); diff --git a/kernel/signal.c b/kernel/signal.c index 8f6330f0e9ca..bc7e26072437 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -995,6 +995,19 @@ static inline bool wants_signal(int sig, struct task_struct *p) return task_curr(p) || !task_sigpending(p); } +static void try_set_pending_sigkill(struct task_struct *t) +{ + /* + * User workers don't support signals and their exit is driven through + * their kernel layer, so by default block even SIGKILL. + */ + if (sigismember(&t->blocked, SIGKILL)) + return; + + sigaddset(&t->pending.signal, SIGKILL); + signal_wake_up(t, 1); +} + static void complete_signal(int sig, struct task_struct *p, enum pid_type type) { struct signal_struct *signal = p->signal; @@ -1055,8 +1068,7 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type) t = p; do { task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); - sigaddset(&t->pending.signal, SIGKILL); - signal_wake_up(t, 1); + try_set_pending_sigkill(t); } while_each_thread(p, t); return; } @@ -1373,8 +1385,7 @@ int zap_other_threads(struct task_struct *p) /* Don't bother with already dead threads */ if (t->exit_state) continue; - sigaddset(&t->pending.signal, SIGKILL); - signal_wake_up(t, 1); + try_set_pending_sigkill(t); } return count; diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c index b7cbd66f889e..2d8d3ebaec4d 100644 --- a/kernel/vhost_task.c +++ b/kernel/vhost_task.c @@ -75,13 +75,13 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg, const char *name) { struct kernel_clone_args args = { - .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM, + .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM | + CLONE_THREAD | CLONE_SIGHAND, .exit_signal = 0, .fn = vhost_task_fn, .name = name, .user_worker = 1, .no_files = 1, - .ignore_signals = 1, }; struct vhost_task *vtsk; struct task_struct *tsk; diff --git a/mm/vmscan.c b/mm/vmscan.c index d257916f39e5..255a2147e5c1 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1207,12 +1207,12 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason) DEFINE_WAIT(wait); /* - * Do not throttle user workers, kthreads other than kswapd or + * Do not throttle IO/user workers, kthreads other than kswapd or * workqueues. They may be required for reclaim to make * forward progress (e.g. journalling workqueues or kthreads). */ if (!current_is_kswapd() && - current->flags & (PF_USER_WORKER|PF_KTHREAD)) { + current->flags & (PF_USER_WORKER|PF_IO_WORKER|PF_KTHREAD)) { cond_resched(); return; } _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-29 19:46 ` michael.christie @ 2023-05-30 2:48 ` Eric W. Biederman 0 siblings, 0 replies; 51+ messages in thread From: Eric W. Biederman @ 2023-05-30 2:48 UTC (permalink / raw) To: michael.christie Cc: axboe, brauner, mst, linux-kernel, Oleg Nesterov, stefanha, linux, nicolas.dichtel, virtualization, Linus Torvalds michael.christie@oracle.com writes: > On 5/29/23 2:35 PM, Mike Christie wrote: >>> Hmm... If we you CLONE_THREAD the exiting vhost_worker() will auto-reap itself, >> Oh wait, are you saying that when we get auto-reaped then we would do the last >> fput and call the file_operations->release function right? We actually set >> task_struct->files = NULL for the vhost_task task_struct, so I think we call >> release a little sooner than you think. >> >> vhost_task_create() sets kernel_clone_args->no_files, so the vhost_task task_struc >> that gets created works like kthreads where it doesn't do a CLONE_FILES and it >> doesn't do a dup_fd. >> >> So when we do de_thread() -> zap_other_threads(), that will kill all the threads >> in the group right? So when they exit, it will call our release function since >> we don't have refcount on ourself. >> > > Just to make sure I'm on the same page now. > > In the past thread when were discussing the patch below and you guys were saying > that it doesn't really ignore SIGKILL because we will hit the > SIGNAL_GROUP_EXIT/group_exec_task checks and the parent is going to exit, it was > on purpose. > > Instead of a signal to tell me when do exit, I was using the parent exiting and doing > the last fput on the vhost device's file which calls our release function. That then > allowed the vhost code to use the vhost_task to handle the outstanding IOs and then > do vhost_task_should_stop to tell the vhost_task to exit when the oustanding IO > had completed. > > On the vhost side of things it's really nice, because all the shutdown paths work > the same and we don't need rcu/locking in the main IO path to handle the vhost_task > exiting while we are using it. The code below does nothing for exec. You really need to call get_signal to handle SIGSTOP/freeze/process exit. A variant on my coredump proposal looks like it can handle exec as well. It isn't pretty but it looks good enough for right now. If you could test the patch I posted up thread I think that is something imperfect that is good enough for now. Eric _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-29 16:09 ` michael.christie 2023-05-29 17:46 ` Oleg Nesterov @ 2023-05-30 2:38 ` Eric W. Biederman 2023-05-30 15:34 ` Mike Christie 2023-05-31 3:30 ` Mike Christie 1 sibling, 2 replies; 51+ messages in thread From: Eric W. Biederman @ 2023-05-30 2:38 UTC (permalink / raw) To: michael.christie Cc: axboe, brauner, mst, linux-kernel, Oleg Nesterov, stefanha, linux, nicolas.dichtel, virtualization, Linus Torvalds michael.christie@oracle.com writes: > On 5/29/23 6:19 AM, Oleg Nesterov wrote: >> On 05/27, Eric W. Biederman wrote: >>> >>> Looking forward I don't see not asking the worker threads to stop >>> for the coredump right now causing any problems in the future. >>> So I think we can use this to resolve the coredump issue I spotted. >> >> But we have almost the same problem with exec. >> >> Execing thread will wait for vhost_worker() while vhost_worker will wait for >> .release -> vhost_task_stop(). > > For this type of case, what is the goal or correct behavior in the end? > > When get_signal returns true we can code things like you mention below and > clean up the task_struct. However, we now have a non-functioning vhost device > open and just sitting around taking up memory and it can't do any IO. > > For this type of case, do we expect just not to crash/hang, or was this new > exec'd thread suppose to be able to use the vhost device? Looking at the vhost code, from before commit 6e890c5d5021 ("vhost: use vhost_tasks for worker threads") exec effectively brakes the connection with the vhost device. Aka the vhost device permanently keeps a connection to the mm that opened it. Frankly I think the vhost code will misbehave in fascinating ways if you actually try and use it after exec. > I would normally say it probably wants to use the vhost device still. However, > I don't think this comes up so just not hanging might be ok. Yes we just need to not hang, and fail as gracefully after exec. >> And even O_CLOEXEC won't help, do_close_on_exec() is called after de_thread(). >> >> Or suppose that vhost_worker's sub-thread forks a child with CLONE_FILES... > > You mean the vhost_task's task/thread doing a function that does a copy_process > right? That type of thing is not needed. I can add a check in vhost_task_create > for this so new code doesn't try to do it. I don't think it will come up that some > code vhost is using will call kernel_thread/copy_process directly since those > calls are so rare and the functions are not exported to modules. Oleg was referring to the fact that during exec de_thread comes before do_close_on_exec. >> If we want CLONE_THREAD, I think vhost_worker() should exit after get_signal() >> returns SIGKILL. Perhaps it should "disable" vhost_work_queue() somehow and >> flush the pending works on ->work_list before exit, I dunno. But imo it should >> not wait for the final fput(). In the long term I agree. Exiting more or less immediately and leaving the work to vhost_dev_flush looks like where the code needs to go. For right now just to get us something before the final -rc I believe the changes below with some testing are good enough. We can handle exec just like the coredump. I have made those tests specific to the vhost case for now so that there is no danger of trigger any other funnies. If someone does not set O_CLOEXEC or performs file descriptor passing we might get a weird thread with an old mm. I have also added in the needed changes to change a few additional PF_IO_WORKER test to PF_USER_WORK. Mike is there any chance you can test the change below? From: "Eric W. Biederman" <ebiederm@xmission.com> Date: Mon, 29 May 2023 19:13:59 -0500 Subject: [PATCH] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression When switching from kthreads to vhost_tasks two bugs were added: 1. The vhost worker tasks's now show up as processes so scripts doing ps or ps a would not incorrectly detect the vhost task as another process. 2. kthreads disabled freeze by setting PF_NOFREEZE, but vhost tasks's didn't disable or add support for them. To fix both bugs, this switches the vhost task to be thread in the process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that SIGKILL/STOP support is required because CLONE_THREAD requires CLONE_SIGHAND which requires those 2 signals to be suppported. This is a modified version of the patch written by Mike Christie <michael.christie@oracle.com> which was a modified version of patch originally written by Linus. Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER. Including ignoring signals, setting up the register state, and having get_signal return instead of calling do_group_exit. Tidied up the vhost_task abstraction so that the definition of vhost_task only needs to be visible inside of vhost_task.c. Making it easier to review the code and tell what needs to be done where. As part of this the main loop has been moved from vhost_worker into vhost_task_fn. vhost_worker now returns true if work was done. The main loop has been updated to call get_signal which handles SIGSTOP, freezing, and collects the message that tells the thread to exit as part of process exit. This collection clears __fatal_signal_pending. This collection is not guaranteed to clear signal_pending() so clear that explicitly so the schedule() sleeps. For now the vhost thread continues to exist and run work until the last file descriptor is closed and the release function is called as part of freeing struct file. To avoid hangs in the coredump rendezvous and when killing threads in a multi-threaded exec. The coredump code and de_thread have been modified to ignore vhost threads. Remvoing the special case for exec appears to require teaching vhost_dev_flush how to directly complete transactions in case the the vhost thread is no longer running. Removing the special case for coredump rendezvous requires either the above fix needed for exec or moving the coredump rendezvous into get_signal. Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads") Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> --- arch/x86/include/asm/fpu/sched.h | 2 +- arch/x86/kernel/fpu/context.h | 2 +- arch/x86/kernel/fpu/core.c | 2 +- drivers/vhost/vhost.c | 21 ++----- fs/coredump.c | 4 +- include/linux/sched/task.h | 1 - include/linux/sched/vhost_task.h | 15 +---- kernel/exit.c | 5 +- kernel/fork.c | 13 ++--- kernel/signal.c | 8 ++- kernel/vhost_task.c | 94 +++++++++++++++++++++----------- 11 files changed, 91 insertions(+), 76 deletions(-) diff --git a/arch/x86/include/asm/fpu/sched.h b/arch/x86/include/asm/fpu/sched.h index c2d6cd78ed0c..78fcde7b1f07 100644 --- a/arch/x86/include/asm/fpu/sched.h +++ b/arch/x86/include/asm/fpu/sched.h @@ -39,7 +39,7 @@ extern void fpu_flush_thread(void); static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu) { if (cpu_feature_enabled(X86_FEATURE_FPU) && - !(current->flags & (PF_KTHREAD | PF_IO_WORKER))) { + !(current->flags & (PF_KTHREAD | PF_USER_WORKER))) { save_fpregs_to_fpstate(old_fpu); /* * The save operation preserved register state, so the diff --git a/arch/x86/kernel/fpu/context.h b/arch/x86/kernel/fpu/context.h index 9fcfa5c4dad7..af5cbdd9bd29 100644 --- a/arch/x86/kernel/fpu/context.h +++ b/arch/x86/kernel/fpu/context.h @@ -57,7 +57,7 @@ static inline void fpregs_restore_userregs(void) struct fpu *fpu = ¤t->thread.fpu; int cpu = smp_processor_id(); - if (WARN_ON_ONCE(current->flags & (PF_KTHREAD | PF_IO_WORKER))) + if (WARN_ON_ONCE(current->flags & (PF_KTHREAD | PF_USER_WORKER))) return; if (!fpregs_state_valid(fpu, cpu)) { diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index caf33486dc5e..1015af1ae562 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -426,7 +426,7 @@ void kernel_fpu_begin_mask(unsigned int kfpu_mask) this_cpu_write(in_kernel_fpu, true); - if (!(current->flags & (PF_KTHREAD | PF_IO_WORKER)) && + if (!(current->flags & (PF_KTHREAD | PF_USER_WORKER)) && !test_thread_flag(TIF_NEED_FPU_LOAD)) { set_thread_flag(TIF_NEED_FPU_LOAD); save_fpregs_to_fpstate(¤t->thread.fpu); diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index a92af08e7864..221d1b6c1be5 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -256,7 +256,7 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) * test_and_set_bit() implies a memory barrier. */ llist_add(&work->node, &dev->worker->work_list); - wake_up_process(dev->worker->vtsk->task); + vhost_task_wake(dev->worker->vtsk); } } EXPORT_SYMBOL_GPL(vhost_work_queue); @@ -333,25 +333,14 @@ static void vhost_vq_reset(struct vhost_dev *dev, __vhost_vq_meta_reset(vq); } -static int vhost_worker(void *data) +static bool vhost_worker(void *data) { struct vhost_worker *worker = data; struct vhost_work *work, *work_next; struct llist_node *node; - for (;;) { - /* mb paired w/ kthread_stop */ - set_current_state(TASK_INTERRUPTIBLE); - - if (vhost_task_should_stop(worker->vtsk)) { - __set_current_state(TASK_RUNNING); - break; - } - - node = llist_del_all(&worker->work_list); - if (!node) - schedule(); - + node = llist_del_all(&worker->work_list); + if (node) { node = llist_reverse_order(node); /* make sure flag is seen after deletion */ smp_wmb(); @@ -365,7 +354,7 @@ static int vhost_worker(void *data) } } - return 0; + return !!node; } static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq) diff --git a/fs/coredump.c b/fs/coredump.c index ece7badf701b..88740c51b942 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -371,7 +371,9 @@ static int zap_process(struct task_struct *start, int exit_code) if (t != current && !(t->flags & PF_POSTCOREDUMP)) { sigaddset(&t->pending.signal, SIGKILL); signal_wake_up(t, 1); - nr++; + /* The vhost_worker does not particpate in coredumps */ + if ((t->flags & (PF_USER_WORKER | PF_IO_WORKER)) != PF_USER_WORKER) + nr++; } } diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 537cbf9a2ade..e0f5ac90a228 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -29,7 +29,6 @@ struct kernel_clone_args { u32 io_thread:1; u32 user_worker:1; u32 no_files:1; - u32 ignore_signals:1; unsigned long stack; unsigned long stack_size; unsigned long tls; diff --git a/include/linux/sched/vhost_task.h b/include/linux/sched/vhost_task.h index 6123c10b99cf..837a23624a66 100644 --- a/include/linux/sched/vhost_task.h +++ b/include/linux/sched/vhost_task.h @@ -2,22 +2,13 @@ #ifndef _LINUX_VHOST_TASK_H #define _LINUX_VHOST_TASK_H -#include <linux/completion.h> -struct task_struct; +struct vhost_task; -struct vhost_task { - int (*fn)(void *data); - void *data; - struct completion exited; - unsigned long flags; - struct task_struct *task; -}; - -struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg, +struct vhost_task *vhost_task_create(bool (*fn)(void *), void *arg, const char *name); void vhost_task_start(struct vhost_task *vtsk); void vhost_task_stop(struct vhost_task *vtsk); -bool vhost_task_should_stop(struct vhost_task *vtsk); +void vhost_task_wake(struct vhost_task *vtsk); #endif diff --git a/kernel/exit.c b/kernel/exit.c index 34b90e2e7cf7..edb50b4c9972 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -411,7 +411,10 @@ static void coredump_task_exit(struct task_struct *tsk) tsk->flags |= PF_POSTCOREDUMP; core_state = tsk->signal->core_state; spin_unlock_irq(&tsk->sighand->siglock); - if (core_state) { + + /* The vhost_worker does not particpate in coredumps */ + if (core_state && + ((tsk->flags & (PF_IO_WORKER | PF_USER_WORKER)) != PF_USER_WORKER)) { struct core_thread self; self.task = current; diff --git a/kernel/fork.c b/kernel/fork.c index ed4e01daccaa..81cba91f30bb 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2336,16 +2336,16 @@ __latent_entropy struct task_struct *copy_process( p->flags &= ~PF_KTHREAD; if (args->kthread) p->flags |= PF_KTHREAD; - if (args->user_worker) - p->flags |= PF_USER_WORKER; - if (args->io_thread) { + if (args->user_worker) { /* - * Mark us an IO worker, and block any signal that isn't + * Mark us a user worker, and block any signal that isn't * fatal or STOP */ - p->flags |= PF_IO_WORKER; + p->flags |= PF_USER_WORKER; siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP)); } + if (args->io_thread) + p->flags |= PF_IO_WORKER; if (args->name) strscpy_pad(p->comm, args->name, sizeof(p->comm)); @@ -2517,9 +2517,6 @@ __latent_entropy struct task_struct *copy_process( if (retval) goto bad_fork_cleanup_io; - if (args->ignore_signals) - ignore_signals(p); - stackleak_task_init(p); if (pid != &init_struct_pid) { diff --git a/kernel/signal.c b/kernel/signal.c index 8f6330f0e9ca..2547fa73bde5 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1368,7 +1368,9 @@ int zap_other_threads(struct task_struct *p) while_each_thread(p, t) { task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); - count++; + /* Don't require de_thread to wait for the vhost_worker */ + if ((t->flags & (PF_IO_WORKER | PF_USER_WORKER)) != PF_USER_WORKER) + count++; /* Don't bother with already dead threads */ if (t->exit_state) @@ -2861,11 +2863,11 @@ bool get_signal(struct ksignal *ksig) } /* - * PF_IO_WORKER threads will catch and exit on fatal signals + * PF_USER_WORKER threads will catch and exit on fatal signals * themselves. They have cleanup that must be performed, so * we cannot call do_exit() on their behalf. */ - if (current->flags & PF_IO_WORKER) + if (current->flags & PF_USER_WORKER) goto out; /* diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c index b7cbd66f889e..f3ce0fa6edd7 100644 --- a/kernel/vhost_task.c +++ b/kernel/vhost_task.c @@ -12,58 +12,90 @@ enum vhost_task_flags { VHOST_TASK_FLAGS_STOP, }; +struct vhost_task { + bool (*fn)(void *data); + void *data; + struct completion exited; + unsigned long flags; + struct task_struct *task; +}; + static int vhost_task_fn(void *data) { struct vhost_task *vtsk = data; - int ret; + bool dead = false; + + for (;;) { + bool did_work; + + /* mb paired w/ kthread_stop */ + set_current_state(TASK_INTERRUPTIBLE); + + if (test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)) { + __set_current_state(TASK_RUNNING); + break; + } + + if (!dead && signal_pending(current)) { + struct ksignal ksig; + /* + * Calling get_signal will block in SIGSTOP, + * or clear fatal_signal_pending, but remember + * what was set. + * + * This thread won't actually exit until all + * of the file descriptors are closed, and + * the release function is called. + */ + dead = get_signal(&ksig); + if (dead) + clear_thread_flag(TIF_SIGPENDING); + } + + did_work = vtsk->fn(vtsk->data); + if (!did_work) + schedule(); + } - ret = vtsk->fn(vtsk->data); complete(&vtsk->exited); - do_exit(ret); + do_exit(0); } +/** + * vhost_task_wake - wakeup the vhost_task + * @vtsk: vhost_task to wake + * + * wake up the vhost_task worker thread + */ +void vhost_task_wake(struct vhost_task *vtsk) +{ + wake_up_process(vtsk->task); +} +EXPORT_SYMBOL_GPL(vhost_task_wake); + /** * vhost_task_stop - stop a vhost_task * @vtsk: vhost_task to stop * - * Callers must call vhost_task_should_stop and return from their worker - * function when it returns true; + * vhost_task_fn ensures the worker thread exits after + * VHOST_TASK_FLAGS_SOP becomes true. */ void vhost_task_stop(struct vhost_task *vtsk) { - pid_t pid = vtsk->task->pid; - set_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags); - wake_up_process(vtsk->task); + vhost_task_wake(vtsk); /* * Make sure vhost_task_fn is no longer accessing the vhost_task before - * freeing it below. If userspace crashed or exited without closing, - * then the vhost_task->task could already be marked dead so - * kernel_wait will return early. + * freeing it below. */ wait_for_completion(&vtsk->exited); - /* - * If we are just closing/removing a device and the parent process is - * not exiting then reap the task. - */ - kernel_wait4(pid, NULL, __WCLONE, NULL); kfree(vtsk); } EXPORT_SYMBOL_GPL(vhost_task_stop); /** - * vhost_task_should_stop - should the vhost task return from the work function - * @vtsk: vhost_task to stop - */ -bool vhost_task_should_stop(struct vhost_task *vtsk) -{ - return test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags); -} -EXPORT_SYMBOL_GPL(vhost_task_should_stop); - -/** - * vhost_task_create - create a copy of a process to be used by the kernel - * @fn: thread stack + * vhost_task_create - create a copy of a task to be used by the kernel + * @fn: vhost worker function * @arg: data to be passed to fn * @name: the thread's name * @@ -71,17 +103,17 @@ EXPORT_SYMBOL_GPL(vhost_task_should_stop); * failure. The returned task is inactive, and the caller must fire it up * through vhost_task_start(). */ -struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg, +struct vhost_task *vhost_task_create(bool (*fn)(void *), void *arg, const char *name) { struct kernel_clone_args args = { - .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM, + .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM | + CLONE_THREAD | CLONE_SIGHAND, .exit_signal = 0, .fn = vhost_task_fn, .name = name, .user_worker = 1, .no_files = 1, - .ignore_signals = 1, }; struct vhost_task *vtsk; struct task_struct *tsk; -- 2.35.3 _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-30 2:38 ` Eric W. Biederman @ 2023-05-30 15:34 ` Mike Christie 2023-05-31 3:30 ` Mike Christie 1 sibling, 0 replies; 51+ messages in thread From: Mike Christie @ 2023-05-30 15:34 UTC (permalink / raw) To: Eric W. Biederman Cc: axboe, brauner, mst, linux-kernel, Oleg Nesterov, stefanha, linux, nicolas.dichtel, virtualization, Linus Torvalds On 5/29/23 9:38 PM, Eric W. Biederman wrote: > Mike is there any chance you can test the change below? Yeah, I'm firing up tests now. _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-30 2:38 ` Eric W. Biederman 2023-05-30 15:34 ` Mike Christie @ 2023-05-31 3:30 ` Mike Christie 1 sibling, 0 replies; 51+ messages in thread From: Mike Christie @ 2023-05-31 3:30 UTC (permalink / raw) To: Eric W. Biederman Cc: axboe, brauner, mst, linux-kernel, Oleg Nesterov, stefanha, linux, nicolas.dichtel, virtualization, Linus Torvalds On 5/29/23 9:38 PM, Eric W. Biederman wrote: > diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c > index b7cbd66f889e..f3ce0fa6edd7 100644 > --- a/kernel/vhost_task.c > +++ b/kernel/vhost_task.c ... > static int vhost_task_fn(void *data) > { > struct vhost_task *vtsk = data; > - int ret; > + bool dead = false; > + > + for (;;) { > + bool did_work; > + > + /* mb paired w/ kthread_stop */ > + set_current_state(TASK_INTERRUPTIBLE); > + > + if (test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)) { > + __set_current_state(TASK_RUNNING); > + break; > + } > + > + if (!dead && signal_pending(current)) { > + struct ksignal ksig; > + /* > + * Calling get_signal will block in SIGSTOP, > + * or clear fatal_signal_pending, but remember > + * what was set. > + * > + * This thread won't actually exit until all > + * of the file descriptors are closed, and > + * the release function is called. > + */ > + dead = get_signal(&ksig); Hey Eric, the patch works well. Thanks! There was just one small issue. get_signal() does try_to_freeze() -> ... __might_sleep() which wants the state to be TASK_RUNNING. We just need the patch below on top of yours which I think also cleans up some of the state setting weirdness with the code like where vhost.c calls __set_current_state(TASK_RUNNING) for each work. It looks like that was not needed for any reason like a work->fn changing the state and the old code could have done: node = llist_del_all(&worker->work_list); if (!node) { schedule(); continue; } else { __set_current_state(TASK_RUNNING); } So I think we can do the following on top of your patch: diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 221d1b6c1be5..074273020849 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -346,7 +346,6 @@ static bool vhost_worker(void *data) smp_wmb(); llist_for_each_entry_safe(work, work_next, node, node) { clear_bit(VHOST_WORK_QUEUED, &work->flags); - __set_current_state(TASK_RUNNING); kcov_remote_start_common(worker->kcov_handle); work->fn(work); kcov_remote_stop(); diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c index f3ce0fa6edd7..fead2ed29561 100644 --- a/kernel/vhost_task.c +++ b/kernel/vhost_task.c @@ -29,12 +29,8 @@ static int vhost_task_fn(void *data) bool did_work; /* mb paired w/ kthread_stop */ - set_current_state(TASK_INTERRUPTIBLE); - - if (test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)) { - __set_current_state(TASK_RUNNING); + if (test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)) break; - } if (!dead && signal_pending(current)) { struct ksignal ksig; @@ -53,8 +49,10 @@ static int vhost_task_fn(void *data) } did_work = vtsk->fn(vtsk->data); - if (!did_work) + if (!did_work) { + set_current_state(TASK_INTERRUPTIBLE); schedule(); + } } complete(&vtsk->exited); _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-29 11:19 ` Oleg Nesterov 2023-05-29 16:09 ` michael.christie @ 2023-05-29 16:11 ` michael.christie [not found] ` <20230530-autor-faxnummer-01e0a31c0fb8@brauner> 2 siblings, 0 replies; 51+ messages in thread From: michael.christie @ 2023-05-29 16:11 UTC (permalink / raw) To: Oleg Nesterov, Eric W. Biederman Cc: axboe, brauner, mst, linux-kernel, linux, virtualization, stefanha, nicolas.dichtel, Linus Torvalds On 5/29/23 6:19 AM, Oleg Nesterov wrote: > If we want CLONE_THREAD, I think vhost_worker() should exit after get_signal() > returns SIGKILL. Perhaps it should "disable" vhost_work_queue() somehow and > flush the pending works on ->work_list before exit, I dunno. But imo it should > not wait for the final fput(). Just a FYI, I coded this. It's pre-RFC quality. It's invasive. If we want to go this route then we have to do a temp hack for 6.4 or revert then do this route for 6.5 or 6.6 (vhost devs will need time to review and we need time to full test). _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
[parent not found: <20230530-autor-faxnummer-01e0a31c0fb8@brauner>]
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression [not found] ` <20230530-autor-faxnummer-01e0a31c0fb8@brauner> @ 2023-05-30 17:55 ` Oleg Nesterov 0 siblings, 0 replies; 51+ messages in thread From: Oleg Nesterov @ 2023-05-30 17:55 UTC (permalink / raw) To: Christian Brauner Cc: axboe, mst, linux, linux-kernel, Eric W. Biederman, stefanha, nicolas.dichtel, virtualization, Linus Torvalds On 05/30, Christian Brauner wrote: > > On Mon, May 29, 2023 at 01:19:39PM +0200, Oleg Nesterov wrote: > > > > If we want CLONE_THREAD, I think vhost_worker() should exit after get_signal() > > returns SIGKILL. Perhaps it should "disable" vhost_work_queue() somehow and > > Yes, and that's what I proposed at the beginning of this tread. Yes. And you know, I misunderstood you even if I had the same feeling from the very beginning too (except I wasn't and still not sure CLONE_THREAD is a good idea). Because... OK, I think it doesn't matter now ;) Mike, Eric, et al. I'll try to (at least) read your emails tomorrow. Another day spent on redhat bugzillas. Oleg. _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-27 9:49 ` Eric W. Biederman 2023-05-27 16:12 ` Linus Torvalds @ 2023-05-30 15:01 ` Eric W. Biederman 1 sibling, 0 replies; 51+ messages in thread From: Eric W. Biederman @ 2023-05-30 15:01 UTC (permalink / raw) To: Linus Torvalds Cc: axboe, brauner, mst, linux-kernel, Oleg Nesterov, stefanha, linux, nicolas.dichtel, virtualization "Eric W. Biederman" <ebiederm@xmission.com> writes: > Linus Torvalds <torvalds@linux-foundation.org> writes: > >> So I'd really like to finish this. Even if we end up with a hack or >> two in signal handling that we can hopefully fix up later by having >> vhost fix up some of its current assumptions. > > > The real sticky widget for me is how to handle one of these processes > coredumping. It really looks like it will result in a reliable hang. > > Limiting ourselves to changes that will only affect vhost, all I can > see would be allowing the vhost_worker thread to exit as soon as > get_signal reports the process is exiting. Then vhost_dev_flush > would need to process the pending work. > Oleg recently pointed out that the trickiest case currently appears to be what happens if someone calls exec, in a process using vhost. do_close_on_exec is called after de_thread, and after the mm has changed. Which means that my idea of moving the work from vhost_worker into vhost_dev_flush can't work. At the point that flush is called it has the wrong mm. Which means the flush or cancel of the pending work needs to happen in the vhost thread, we can't assume there is any other thread available to do the work. What makes this all nice is that the vhost code has vhost_dev_check_owner which ensures only one mm can initiate I/O. Which means file descriptor passing is essentially an academic concern. In the case of both process exit, and exec except for a racing on which piece of code shuts down first there should be no more I/O going into the work queues. But it is going to take someone who understands and cares about vhost to figure out how to stop new I/O from going into the work queues and to ensure that on-going work is dealt with. Eric _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-23 15:57 ` Eric W. Biederman 2023-05-24 14:10 ` Oleg Nesterov @ 2023-05-31 5:22 ` Jason Wang 1 sibling, 0 replies; 51+ messages in thread From: Jason Wang @ 2023-05-31 5:22 UTC (permalink / raw) To: Eric W. Biederman, Oleg Nesterov Cc: axboe, brauner, mst, linux, linux-kernel, stefanha, nicolas.dichtel, virtualization, torvalds 在 2023/5/23 23:57, Eric W. Biederman 写道: > Oleg Nesterov <oleg@redhat.com> writes: > >> On 05/22, Oleg Nesterov wrote: >>> Right now I think that "int dead" should die, >> No, probably we shouldn't call get_signal() if we have already >> dequeued SIGKILL. > Very much agreed. It is one thing to add a patch to move do_exit > out of get_signal. It is another to keep calling get_signal after > that. Nothing tests that case, and so we get some weird behaviors. > > >>> but let me think tomorrow. >> May be something like this... I don't like it but I can't suggest anything better >> right now. >> >> bool killed = false; >> >> for (;;) { >> ... >> >> node = llist_del_all(&worker->work_list); >> if (!node) { >> schedule(); >> /* >> * When we get a SIGKILL our release function will >> * be called. That will stop new IOs from being queued >> * and check for outstanding cmd responses. It will then >> * call vhost_task_stop to tell us to return and exit. >> */ >> if (signal_pending(current)) { >> struct ksignal ksig; >> >> if (!killed) >> killed = get_signal(&ksig); >> >> clear_thread_flag(TIF_SIGPENDING); >> } >> >> continue; >> } > I want to point out that we need to consider not just SIGKILL, but > SIGABRT that causes a coredump, as well as the process peforming > an ordinary exit(2). All of which will cause get_signal to return > SIGKILL in this context. > >> ------------------------------------------------------------------------------- >> But let me ask a couple of questions. > I share most of these questions. > >> Let's forget this patch, let's look at the >> current code: >> >> node = llist_del_all(&worker->work_list); >> if (!node) >> schedule(); >> >> node = llist_reverse_order(node); >> ... process works ... >> >> To me this looks a bit confusing. Shouldn't we do >> >> if (!node) { >> schedule(); >> continue; >> } >> >> just to make the code a bit more clear? If node == NULL then >> llist_reverse_order() and llist_for_each_entry_safe() will do nothing. >> But this is minor. >> >> >> >> /* make sure flag is seen after deletion */ >> smp_wmb(); >> llist_for_each_entry_safe(work, work_next, node, node) { >> clear_bit(VHOST_WORK_QUEUED, &work->flags); >> >> I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED, >> vhost_work_queue() can add this work again and change work->node->next. >> >> That is why we use _safe, but we need to ensure that llist_for_each_safe() >> completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared. >> >> So it seems that smp_wmb() can't help and should be removed, instead we need >> >> llist_for_each_entry_safe(...) { >> smp_mb__before_atomic(); >> clear_bit(VHOST_WORK_QUEUED, &work->flags); >> >> Also, if the work->fn pointer is not stable, we should read it before >> smp_mb__before_atomic() as well. >> >> No? >> >> >> __set_current_state(TASK_RUNNING); >> >> Why do we set TASK_RUNNING inside the loop? Does this mean that work->fn() >> can return with current->state != RUNNING ? >> >> >> work->fn(work); >> >> Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right >> before we call work->fn(). Is it "safe" to run this callback with >> signal_pending() or fatal_signal_pending() ? >> >> >> Finally. I never looked into drivers/vhost/ before so I don't understand >> this code at all, but let me ask anyway... Can we change vhost_dev_flush() >> to run the pending callbacks rather than wait for vhost_worker() ? >> I guess we can't, ->mm won't be correct, but can you confirm? > In a conversation long ago I remember hearing that vhost does not > support file descriptor passing. Which means all of the file > descriptors should be in the same process. It's not. Actually passing vhost fd is pretty common since Qemu is usually running without privilege. So it's the charge of the management layer to open vhost fd and pass it to Qemu. > > Looking at the vhost code what I am seeing happening is that the > vhost_worker persists until vhost_dev_cleanup is called from > one of the vhost_???_release() functions. The release functions > are only called after the last flush function completes. See __fput > if you want to trace the details. > > > On one hand this all seems reasonable. On the other hand I am not > seeing the code that prevents file descriptor passing. Yes. > > > It is probably not the worst thing in the world, but what this means > is now if you pass a copy of the vhost file descriptor to another > process the vhost_worker will persist, and thus the process will persist > until that copy of the file descriptor is closed. Right. Thanks > > Eric > _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-23 12:15 ` Oleg Nesterov 2023-05-23 15:57 ` Eric W. Biederman @ 2023-05-24 0:02 ` Mike Christie 2023-05-25 16:15 ` Mike Christie 2023-05-31 5:22 ` Jason Wang 3 siblings, 0 replies; 51+ messages in thread From: Mike Christie @ 2023-05-24 0:02 UTC (permalink / raw) To: Oleg Nesterov Cc: axboe, brauner, mst, linux-kernel, linux, ebiederm, stefanha, nicolas.dichtel, virtualization, torvalds Hey Oleg, For all these questions below let me get back to you by tomorrow. I need to confirm if something would be considered a regression by the core vhost developers. On 5/23/23 7:15 AM, Oleg Nesterov wrote: > On 05/22, Oleg Nesterov wrote: >> >> Right now I think that "int dead" should die, > > No, probably we shouldn't call get_signal() if we have already dequeued SIGKILL. > >> but let me think tomorrow. > > May be something like this... I don't like it but I can't suggest anything better > right now. > > bool killed = false; > > for (;;) { > ... > > node = llist_del_all(&worker->work_list); > if (!node) { > schedule(); > /* > * When we get a SIGKILL our release function will > * be called. That will stop new IOs from being queued > * and check for outstanding cmd responses. It will then > * call vhost_task_stop to tell us to return and exit. > */ > if (signal_pending(current)) { > struct ksignal ksig; > > if (!killed) > killed = get_signal(&ksig); > > clear_thread_flag(TIF_SIGPENDING); > } > > continue; > } > > ------------------------------------------------------------------------------- > But let me ask a couple of questions. Let's forget this patch, let's look at the > current code: > > node = llist_del_all(&worker->work_list); > if (!node) > schedule(); > > node = llist_reverse_order(node); > ... process works ... > > To me this looks a bit confusing. Shouldn't we do > > if (!node) { > schedule(); > continue; > } > > just to make the code a bit more clear? If node == NULL then > llist_reverse_order() and llist_for_each_entry_safe() will do nothing. > But this is minor. > > > > /* make sure flag is seen after deletion */ > smp_wmb(); > llist_for_each_entry_safe(work, work_next, node, node) { > clear_bit(VHOST_WORK_QUEUED, &work->flags); > > I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED, > vhost_work_queue() can add this work again and change work->node->next. > > That is why we use _safe, but we need to ensure that llist_for_each_safe() > completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared. > > So it seems that smp_wmb() can't help and should be removed, instead we need > > llist_for_each_entry_safe(...) { > smp_mb__before_atomic(); > clear_bit(VHOST_WORK_QUEUED, &work->flags); > > Also, if the work->fn pointer is not stable, we should read it before > smp_mb__before_atomic() as well. > > No? > > > __set_current_state(TASK_RUNNING); > > Why do we set TASK_RUNNING inside the loop? Does this mean that work->fn() > can return with current->state != RUNNING ? > > > work->fn(work); > > Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right > before we call work->fn(). Is it "safe" to run this callback with > signal_pending() or fatal_signal_pending() ? > > > Finally. I never looked into drivers/vhost/ before so I don't understand > this code at all, but let me ask anyway... Can we change vhost_dev_flush() > to run the pending callbacks rather than wait for vhost_worker() ? > I guess we can't, ->mm won't be correct, but can you confirm? > > Oleg. > _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-23 12:15 ` Oleg Nesterov 2023-05-23 15:57 ` Eric W. Biederman 2023-05-24 0:02 ` Mike Christie @ 2023-05-25 16:15 ` Mike Christie 2023-05-28 1:41 ` Eric W. Biederman 2023-05-31 5:22 ` Jason Wang 3 siblings, 1 reply; 51+ messages in thread From: Mike Christie @ 2023-05-25 16:15 UTC (permalink / raw) To: Oleg Nesterov Cc: axboe, brauner, mst, linux-kernel, linux, ebiederm, stefanha, nicolas.dichtel, virtualization, torvalds On 5/23/23 7:15 AM, Oleg Nesterov wrote: > > Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right > before we call work->fn(). Is it "safe" to run this callback with > signal_pending() or fatal_signal_pending() ? The questions before this one I'll leave for the core vhost devs since they know best. For this question and the one below, when we get SIGKILL we think it's ok to fail the operation as in it's ok to not execute it like normal (send it down to the net/target/scsi/block layers for execution). However, we need to do some processing of the work to release mem, refcounts, etc. For example we use the vhost_task to handle completions from the layers we interact with. So when we get a SIGKILL, we could have N commands in the target/block/scsi/nvme layers below the vhost layer. When those complete, the current code assumes we have the vhost_task to handle the responses. So for pending works on that work_list, we can pretty easily kill them. However, we don't have a way to kill those outstanding requests to some other layer, so we have to wait and handle them somewhere. If you are saying that once we get SIGKILL then we just can't use the task anymore and we have to drop down to workqueue/kthread or change up the completion paths so they can execute in the context they are completed from by lower levels, then I can code this. On the vhost side, it's just really ugly vs the code we have now that used to use kthreads (or in 6.4-rc looks like a process) so we couldn't get signals and just had the one code path that removes us. From the vhost point of view signals are not useful to us and it's just adding complexity for something that I don't think is useful to users. However after discussing this with guys for a week and going over the signal code, I can see your point of views where you guys are thinking its ugly for the signal code to try and support what we want :) > > > Finally. I never looked into drivers/vhost/ before so I don't understand > this code at all, but let me ask anyway... Can we change vhost_dev_flush() > to run the pending callbacks rather than wait for vhost_worker() ? > I guess we can't, ->mm won't be correct, but can you confirm? > > Oleg. > _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-25 16:15 ` Mike Christie @ 2023-05-28 1:41 ` Eric W. Biederman 2023-05-28 19:29 ` Mike Christie 0 siblings, 1 reply; 51+ messages in thread From: Eric W. Biederman @ 2023-05-28 1:41 UTC (permalink / raw) To: Mike Christie Cc: axboe, brauner, mst, linux-kernel, Oleg Nesterov, stefanha, linux, nicolas.dichtel, virtualization, torvalds Mike Christie <michael.christie@oracle.com> writes: > On 5/23/23 7:15 AM, Oleg Nesterov wrote: >> >> Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right >> before we call work->fn(). Is it "safe" to run this callback with >> signal_pending() or fatal_signal_pending() ? > > The questions before this one I'll leave for the core vhost devs since > they know best. Let me ask a clarifying question: Is it only the call to schedule() in vhost_worker that you are worried about not sleeping if signal_pending() or fatal_signal_pending()? Is there concern that the worker functions aka "work->fn()" will also have killable or interruptible sleeps that also will misbehave. We can handle schedule() in vhost_worker without problem. If a worker function has interruptible or killable sleeps that will turn into busy waits or worse not sleeping long enough that seems like a problem. There is no way to guarantee that the outer loop of vhost_worker will protect the worker functions from signal_pending() or fatal_signal_pending() becoming true. Eric _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-28 1:41 ` Eric W. Biederman @ 2023-05-28 19:29 ` Mike Christie 0 siblings, 0 replies; 51+ messages in thread From: Mike Christie @ 2023-05-28 19:29 UTC (permalink / raw) To: Eric W. Biederman Cc: axboe, brauner, mst, linux-kernel, Oleg Nesterov, stefanha, linux, nicolas.dichtel, virtualization, torvalds On 5/27/23 8:41 PM, Eric W. Biederman wrote: > Mike Christie <michael.christie@oracle.com> writes: > >> On 5/23/23 7:15 AM, Oleg Nesterov wrote: >>> >>> Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right >>> before we call work->fn(). Is it "safe" to run this callback with >>> signal_pending() or fatal_signal_pending() ? >> >> The questions before this one I'll leave for the core vhost devs since >> they know best. > > Let me ask a clarifying question: > > Is it only the call to schedule() in vhost_worker that you are worried > about not sleeping if signal_pending() or fatal_signal_pending()? It will only be the vhost_worker call to schedule(). When we do the file_operation's release call, we normally set things up so the work->fn just fails and cleans up. I can pretty easily move that code into a helper and do: if (get_signal(ksig)) new_function_to_tell_drivers_that_all_work_fns_should_fail() _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-23 12:15 ` Oleg Nesterov ` (2 preceding siblings ...) 2023-05-25 16:15 ` Mike Christie @ 2023-05-31 5:22 ` Jason Wang 2023-05-31 7:25 ` Oleg Nesterov 3 siblings, 1 reply; 51+ messages in thread From: Jason Wang @ 2023-05-31 5:22 UTC (permalink / raw) To: Oleg Nesterov, Mike Christie Cc: axboe, brauner, mst, linux-kernel, linux, ebiederm, stefanha, nicolas.dichtel, virtualization, torvalds 在 2023/5/23 20:15, Oleg Nesterov 写道: > On 05/22, Oleg Nesterov wrote: >> Right now I think that "int dead" should die, > No, probably we shouldn't call get_signal() if we have already dequeued SIGKILL. > >> but let me think tomorrow. > May be something like this... I don't like it but I can't suggest anything better > right now. > > bool killed = false; > > for (;;) { > ... > > node = llist_del_all(&worker->work_list); > if (!node) { > schedule(); > /* > * When we get a SIGKILL our release function will > * be called. That will stop new IOs from being queued > * and check for outstanding cmd responses. It will then > * call vhost_task_stop to tell us to return and exit. > */ > if (signal_pending(current)) { > struct ksignal ksig; > > if (!killed) > killed = get_signal(&ksig); > > clear_thread_flag(TIF_SIGPENDING); > } > > continue; > } > > ------------------------------------------------------------------------------- > But let me ask a couple of questions. Let's forget this patch, let's look at the > current code: > > node = llist_del_all(&worker->work_list); > if (!node) > schedule(); > > node = llist_reverse_order(node); > ... process works ... > > To me this looks a bit confusing. Shouldn't we do > > if (!node) { > schedule(); > continue; > } > > just to make the code a bit more clear? If node == NULL then > llist_reverse_order() and llist_for_each_entry_safe() will do nothing. > But this is minor. Yes. > > > > /* make sure flag is seen after deletion */ > smp_wmb(); > llist_for_each_entry_safe(work, work_next, node, node) { > clear_bit(VHOST_WORK_QUEUED, &work->flags); > > I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED, > vhost_work_queue() can add this work again and change work->node->next. > > That is why we use _safe, but we need to ensure that llist_for_each_safe() > completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared. This should be fine since store is not speculated, so work->node->next needs to be loaded before VHOST_WORK_QUEUED is cleared to meet the loop condition. > So it seems that smp_wmb() can't help and should be removed, instead we need > > llist_for_each_entry_safe(...) { > smp_mb__before_atomic(); > clear_bit(VHOST_WORK_QUEUED, &work->flags); > > Also, if the work->fn pointer is not stable, we should read it before > smp_mb__before_atomic() as well. The fn won't be changed after it is initialized. > > No? > > > __set_current_state(TASK_RUNNING); > > Why do we set TASK_RUNNING inside the loop? Does this mean that work->fn() > can return with current->state != RUNNING ? It is because the state were set to TASK_INTERRUPTIBLE in the beginning of the loop otherwise it might be side effect while executing work->fn(). > > > work->fn(work); > > Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right > before we call work->fn(). Is it "safe" to run this callback with > signal_pending() or fatal_signal_pending() ? It looks safe since: 1) vhost hold refcnt of the mm 2) release will sync with the worker > > > Finally. I never looked into drivers/vhost/ before so I don't understand > this code at all, but let me ask anyway... Can we change vhost_dev_flush() > to run the pending callbacks rather than wait for vhost_worker() ? > I guess we can't, ->mm won't be correct, but can you confirm? Yes. Thanks > > Oleg. > _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-31 5:22 ` Jason Wang @ 2023-05-31 7:25 ` Oleg Nesterov 2023-05-31 8:17 ` Jason Wang 0 siblings, 1 reply; 51+ messages in thread From: Oleg Nesterov @ 2023-05-31 7:25 UTC (permalink / raw) To: Jason Wang Cc: axboe, brauner, mst, linux, linux-kernel, ebiederm, stefanha, nicolas.dichtel, virtualization, torvalds On 05/31, Jason Wang wrote: > > 在 2023/5/23 20:15, Oleg Nesterov 写道: > > > > /* make sure flag is seen after deletion */ > > smp_wmb(); > > llist_for_each_entry_safe(work, work_next, node, node) { > > clear_bit(VHOST_WORK_QUEUED, &work->flags); > > > >I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED, > >vhost_work_queue() can add this work again and change work->node->next. > > > >That is why we use _safe, but we need to ensure that llist_for_each_safe() > >completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared. > > This should be fine since store is not speculated, so work->node->next needs > to be loaded before VHOST_WORK_QUEUED is cleared to meet the loop condition. I don't understand you. OK, to simplify, suppose we have 2 global vars void *PTR = something_non_null; unsigned long FLAGS = -1ul; Now I think this code CPU_0 CPU_1 void *ptr = PTR; if (!test_and_set_bit(0, FLAGS)) clear_bit(0, FLAGS); PTR = NULL; BUG_ON(!ptr); is racy and can hit the BUG_ON(!ptr). I guess it is fine on x86, but in general you need smp_mb__before_atomic() before clear_bit(), or clear_bit_unlock(). > > __set_current_state(TASK_RUNNING); > > > >Why do we set TASK_RUNNING inside the loop? Does this mean that work->fn() > >can return with current->state != RUNNING ? > > It is because the state were set to TASK_INTERRUPTIBLE in the beginning of > the loop otherwise it might be side effect while executing work->fn(). Again, I don't understand you. So let me repeat: can work->fn() return with current->_state != TASK_RUNNING ? If not (and I'd say it should not), you can do __set_current_state(TASK_RUNNING) once, before llist_for_each_entry_safe(). > >Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right > >before we call work->fn(). Is it "safe" to run this callback with > >signal_pending() or fatal_signal_pending() ? > > It looks safe since: > > 1) vhost hold refcnt of the mm > 2) release will sync with the worker Well, that's not what I asked... nevermind, please forget. Thanks. Oleg. _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-31 7:25 ` Oleg Nesterov @ 2023-05-31 8:17 ` Jason Wang 2023-05-31 9:14 ` Oleg Nesterov 0 siblings, 1 reply; 51+ messages in thread From: Jason Wang @ 2023-05-31 8:17 UTC (permalink / raw) To: Oleg Nesterov Cc: axboe, brauner, mst, linux, linux-kernel, ebiederm, stefanha, nicolas.dichtel, virtualization, torvalds On Wed, May 31, 2023 at 3:25 PM Oleg Nesterov <oleg@redhat.com> wrote: > > On 05/31, Jason Wang wrote: > > > > 在 2023/5/23 20:15, Oleg Nesterov 写道: > > > > > > /* make sure flag is seen after deletion */ > > > smp_wmb(); > > > llist_for_each_entry_safe(work, work_next, node, node) { > > > clear_bit(VHOST_WORK_QUEUED, &work->flags); > > > > > >I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED, > > >vhost_work_queue() can add this work again and change work->node->next. > > > > > >That is why we use _safe, but we need to ensure that llist_for_each_safe() > > >completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared. > > > > This should be fine since store is not speculated, so work->node->next needs > > to be loaded before VHOST_WORK_QUEUED is cleared to meet the loop condition. > > I don't understand you. OK, to simplify, suppose we have 2 global vars > > void *PTR = something_non_null; > unsigned long FLAGS = -1ul; > > Now I think this code > > CPU_0 CPU_1 > > void *ptr = PTR; if (!test_and_set_bit(0, FLAGS)) > clear_bit(0, FLAGS); PTR = NULL; > BUG_ON(!ptr); > > is racy and can hit the BUG_ON(!ptr). This seems different to the above case? And you can hit BUG_ON with the following execution sequence: [cpu 0] clear_bit(0, FLAGS); [cpu 1] if (!test_and_set_bit(0, FLAGS)) [cpu 1] PTR = NULL; [cpu 0] BUG_ON(!ptr) In vhost code, there's a condition before the clear_bit() which sits inside llist_for_each_entry_safe(): #define llist_for_each_entry_safe(pos, n, node, member) \ for (pos = llist_entry((node), typeof(*pos), member); \ member_address_is_nonnull(pos, member) && \ (n = llist_entry(pos->member.next, typeof(*n), member), true); \ pos = n) The clear_bit() is a store which is not speculated, so there's a control dependency, the store can't be executed until the condition expression is evaluated which requires pos->member.next (work->node.next) to be loaded. > > I guess it is fine on x86, but in general you need smp_mb__before_atomic() > before clear_bit(), or clear_bit_unlock(). > > > > __set_current_state(TASK_RUNNING); > > > > > >Why do we set TASK_RUNNING inside the loop? Does this mean that work->fn() > > >can return with current->state != RUNNING ? > > > > It is because the state were set to TASK_INTERRUPTIBLE in the beginning of > > the loop otherwise it might be side effect while executing work->fn(). > > Again, I don't understand you. So let me repeat: can work->fn() return with > current->_state != TASK_RUNNING ? If not (and I'd say it should not), you can > do __set_current_state(TASK_RUNNING) once, before llist_for_each_entry_safe(). > Ok, that should be fine. Thanks > > >Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right > > >before we call work->fn(). Is it "safe" to run this callback with > > >signal_pending() or fatal_signal_pending() ? > > > > It looks safe since: > > > > 1) vhost hold refcnt of the mm > > 2) release will sync with the worker > > Well, that's not what I asked... nevermind, please forget. > > Thanks. > > Oleg. > _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-31 8:17 ` Jason Wang @ 2023-05-31 9:14 ` Oleg Nesterov 2023-06-01 2:44 ` Jason Wang 0 siblings, 1 reply; 51+ messages in thread From: Oleg Nesterov @ 2023-05-31 9:14 UTC (permalink / raw) To: Jason Wang Cc: axboe, brauner, mst, linux, linux-kernel, ebiederm, stefanha, nicolas.dichtel, virtualization, torvalds On 05/31, Jason Wang wrote: > > On Wed, May 31, 2023 at 3:25 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > On 05/31, Jason Wang wrote: > > > > > > 在 2023/5/23 20:15, Oleg Nesterov 写道: > > > > > > > > /* make sure flag is seen after deletion */ > > > > smp_wmb(); > > > > llist_for_each_entry_safe(work, work_next, node, node) { > > > > clear_bit(VHOST_WORK_QUEUED, &work->flags); > > > > > > > >I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED, > > > >vhost_work_queue() can add this work again and change work->node->next. > > > > > > > >That is why we use _safe, but we need to ensure that llist_for_each_safe() > > > >completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared. > > > > > > This should be fine since store is not speculated, so work->node->next needs > > > to be loaded before VHOST_WORK_QUEUED is cleared to meet the loop condition. > > > > I don't understand you. OK, to simplify, suppose we have 2 global vars > > > > void *PTR = something_non_null; > > unsigned long FLAGS = -1ul; > > > > Now I think this code > > > > CPU_0 CPU_1 > > > > void *ptr = PTR; if (!test_and_set_bit(0, FLAGS)) > > clear_bit(0, FLAGS); PTR = NULL; > > BUG_ON(!ptr); > > > > is racy and can hit the BUG_ON(!ptr). > > This seems different to the above case? not sure, > And you can hit BUG_ON with > the following execution sequence: > > [cpu 0] clear_bit(0, FLAGS); > [cpu 1] if (!test_and_set_bit(0, FLAGS)) > [cpu 1] PTR = NULL; > [cpu 0] BUG_ON(!ptr) I don't understand this part... yes, we can hit this BUG_ON() without mb in between, this is what I tried to say. > In vhost code, there's a condition before the clear_bit() which sits > inside llist_for_each_entry_safe(): > > #define llist_for_each_entry_safe(pos, n, node, member) \ > for (pos = llist_entry((node), typeof(*pos), member); \ > member_address_is_nonnull(pos, member) && \ > (n = llist_entry(pos->member.next, typeof(*n), member), true); \ > pos = n) > > The clear_bit() is a store which is not speculated, so there's a > control dependency, the store can't be executed until the condition > expression is evaluated which requires pos->member.next > (work->node.next) to be loaded. But llist_for_each_entry_safe() doesn't check "n", I mean, it is not that we have something like n = llist_entry(...); if (n) clear_bit(...); so I do not see how can we rely on the load-store control dependency. Oleg. _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-31 9:14 ` Oleg Nesterov @ 2023-06-01 2:44 ` Jason Wang 2023-06-01 7:43 ` Oleg Nesterov 0 siblings, 1 reply; 51+ messages in thread From: Jason Wang @ 2023-06-01 2:44 UTC (permalink / raw) To: Oleg Nesterov Cc: axboe, brauner, mst, linux, linux-kernel, ebiederm, stefanha, nicolas.dichtel, virtualization, torvalds On Wed, May 31, 2023 at 5:14 PM Oleg Nesterov <oleg@redhat.com> wrote: > > On 05/31, Jason Wang wrote: > > > > On Wed, May 31, 2023 at 3:25 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > On 05/31, Jason Wang wrote: > > > > > > > > 在 2023/5/23 20:15, Oleg Nesterov 写道: > > > > > > > > > > /* make sure flag is seen after deletion */ > > > > > smp_wmb(); > > > > > llist_for_each_entry_safe(work, work_next, node, node) { > > > > > clear_bit(VHOST_WORK_QUEUED, &work->flags); > > > > > > > > > >I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED, > > > > >vhost_work_queue() can add this work again and change work->node->next. > > > > > > > > > >That is why we use _safe, but we need to ensure that llist_for_each_safe() > > > > >completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared. > > > > > > > > This should be fine since store is not speculated, so work->node->next needs > > > > to be loaded before VHOST_WORK_QUEUED is cleared to meet the loop condition. > > > > > > I don't understand you. OK, to simplify, suppose we have 2 global vars > > > > > > void *PTR = something_non_null; > > > unsigned long FLAGS = -1ul; > > > > > > Now I think this code > > > > > > CPU_0 CPU_1 > > > > > > void *ptr = PTR; if (!test_and_set_bit(0, FLAGS)) > > > clear_bit(0, FLAGS); PTR = NULL; > > > BUG_ON(!ptr); > > > > > > is racy and can hit the BUG_ON(!ptr). > > > > This seems different to the above case? > > not sure, > > > And you can hit BUG_ON with > > the following execution sequence: > > > > [cpu 0] clear_bit(0, FLAGS); > > [cpu 1] if (!test_and_set_bit(0, FLAGS)) > > [cpu 1] PTR = NULL; > > [cpu 0] BUG_ON(!ptr) > > I don't understand this part... yes, we can hit this BUG_ON() without mb in > between, this is what I tried to say. I may miss something, but the above is the sequence that is executed by the processor (for each CPU, it's just the program order). So where do you expect to place an mb can help? > > > In vhost code, there's a condition before the clear_bit() which sits > > inside llist_for_each_entry_safe(): > > > > #define llist_for_each_entry_safe(pos, n, node, member) \ > > for (pos = llist_entry((node), typeof(*pos), member); \ > > member_address_is_nonnull(pos, member) && \ > > (n = llist_entry(pos->member.next, typeof(*n), member), true); \ > > pos = n) > > > > The clear_bit() is a store which is not speculated, so there's a > > control dependency, the store can't be executed until the condition > > expression is evaluated which requires pos->member.next > > (work->node.next) to be loaded. > > But llist_for_each_entry_safe() doesn't check "n", I mean, it is not that we have > something like > > n = llist_entry(...); > if (n) > clear_bit(...); > > so I do not see how can we rely on the load-store control dependency. Just to make sure we are on the same page, the condition expression is member_address_is_nonnull(pos, member) && (n = llist_entry(pos->member.next, typeof(*n), member), true) So it's something like: if (work->node && (work_next = work->node->next, true)) clear_bit(&work->flags); So two loads from both work->node and work->node->next, and there's a store which is clear_bit, then it's a load-store control dependencies? Thanks > > Oleg. > _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-06-01 2:44 ` Jason Wang @ 2023-06-01 7:43 ` Oleg Nesterov 2023-06-02 5:03 ` Jason Wang 0 siblings, 1 reply; 51+ messages in thread From: Oleg Nesterov @ 2023-06-01 7:43 UTC (permalink / raw) To: Jason Wang Cc: axboe, brauner, mst, linux, linux-kernel, ebiederm, stefanha, nicolas.dichtel, virtualization, torvalds On 06/01, Jason Wang wrote: > > On Wed, May 31, 2023 at 5:14 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > I don't understand you. OK, to simplify, suppose we have 2 global vars > > > > > > > > void *PTR = something_non_null; > > > > unsigned long FLAGS = -1ul; > > > > > > > > Now I think this code > > > > > > > > CPU_0 CPU_1 > > > > > > > > void *ptr = PTR; if (!test_and_set_bit(0, FLAGS)) > > > > clear_bit(0, FLAGS); PTR = NULL; > > > > BUG_ON(!ptr); > > > > > > > > is racy and can hit the BUG_ON(!ptr). > > > > > > This seems different to the above case? > > > > not sure, > > > > > And you can hit BUG_ON with > > > the following execution sequence: > > > > > > [cpu 0] clear_bit(0, FLAGS); > > > [cpu 1] if (!test_and_set_bit(0, FLAGS)) > > > [cpu 1] PTR = NULL; > > > [cpu 0] BUG_ON(!ptr) > > > > I don't understand this part... yes, we can hit this BUG_ON() without mb in > > between, this is what I tried to say. > > I may miss something, Or me... note that CPU_0 loads the global "PTR" into the local "ptr" before clear_bit. Since you have mentioned the program order: yes this lacks READ_ONCE() or barrier(), but the same is true for the code in vhost_worker(). So I still don't understand. > but the above is the sequence that is executed > by the processor (for each CPU, it's just the program order). So where > do you expect to place an mb can help? before clear_bit: CPU_0 void *ptr = PTR; mb(); // implies compiler barrier as well clear_bit(0, FLAGS); BUG_ON(!ptr); just in case... mb() in the code above is only for illustration, we can use smp_mb__before_atomic() + clear_bit(). Or just clear_bit_unlock(), iiuc the one-way barrier is fine in this case. > > > In vhost code, there's a condition before the clear_bit() which sits > > > inside llist_for_each_entry_safe(): > > > > > > #define llist_for_each_entry_safe(pos, n, node, member) \ > > > for (pos = llist_entry((node), typeof(*pos), member); \ > > > member_address_is_nonnull(pos, member) && \ > > > (n = llist_entry(pos->member.next, typeof(*n), member), true); \ > > > pos = n) > > > > > > The clear_bit() is a store which is not speculated, so there's a > > > control dependency, the store can't be executed until the condition > > > expression is evaluated which requires pos->member.next > > > (work->node.next) to be loaded. > > > > But llist_for_each_entry_safe() doesn't check "n", I mean, it is not that we have > > something like > > > > n = llist_entry(...); > > if (n) > > clear_bit(...); > > > > so I do not see how can we rely on the load-store control dependency. > > Just to make sure we are on the same page, the condition expression is > > member_address_is_nonnull(pos, member) && (n = > llist_entry(pos->member.next, typeof(*n), member), true) > > So it's something like: > > if (work->node && (work_next = work->node->next, true)) > clear_bit(&work->flags); > > So two loads from both work->node and work->node->next, and there's a > store which is clear_bit, then it's a load-store control dependencies? I guess you missed the comma expression... Let me rewrite your pseudo-code above, it is equivalent to if (work->node) { if ((work_next = work->node->next, true)) clear_bit(&work->flags); } another rewrite: if (work->node) { work_next = work->node->next; if ((work, true)) clear_bit(&work->flags); } and the final rewrite: if (work->node) { work_next = work->node->next; if (true) clear_bit(&work->flags); } so again, I do not see the load-store control dependency. Not to mention this code lacks READ_ONCE(). If we had something like if (work->node) { work_next = READ_ONCE(work->node->next); if (work_next) clear_bit(&work->flags); } instead, then yes, we could rely on the LOAD-STORE dependency. Oleg. _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-06-01 7:43 ` Oleg Nesterov @ 2023-06-02 5:03 ` Jason Wang 2023-06-02 17:58 ` Oleg Nesterov 0 siblings, 1 reply; 51+ messages in thread From: Jason Wang @ 2023-06-02 5:03 UTC (permalink / raw) To: Oleg Nesterov Cc: axboe, brauner, mst, linux, linux-kernel, ebiederm, stefanha, nicolas.dichtel, virtualization, torvalds On Thu, Jun 1, 2023 at 3:43 PM Oleg Nesterov <oleg@redhat.com> wrote: > > On 06/01, Jason Wang wrote: > > > > On Wed, May 31, 2023 at 5:14 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > I don't understand you. OK, to simplify, suppose we have 2 global vars > > > > > > > > > > void *PTR = something_non_null; > > > > > unsigned long FLAGS = -1ul; > > > > > > > > > > Now I think this code > > > > > > > > > > CPU_0 CPU_1 > > > > > > > > > > void *ptr = PTR; if (!test_and_set_bit(0, FLAGS)) > > > > > clear_bit(0, FLAGS); PTR = NULL; > > > > > BUG_ON(!ptr); > > > > > > > > > > is racy and can hit the BUG_ON(!ptr). > > > > > > > > This seems different to the above case? > > > > > > not sure, > > > > > > > And you can hit BUG_ON with > > > > the following execution sequence: > > > > > > > > [cpu 0] clear_bit(0, FLAGS); > > > > [cpu 1] if (!test_and_set_bit(0, FLAGS)) > > > > [cpu 1] PTR = NULL; > > > > [cpu 0] BUG_ON(!ptr) > > > > > > I don't understand this part... yes, we can hit this BUG_ON() without mb in > > > between, this is what I tried to say. > > > > I may miss something, > > Or me... note that CPU_0 loads the global "PTR" into the local "ptr" before clear_bit. > Since you have mentioned the program order: yes this lacks READ_ONCE() or barrier(), > but the same is true for the code in vhost_worker(). So I still don't understand. > > > but the above is the sequence that is executed > > by the processor (for each CPU, it's just the program order). So where > > do you expect to place an mb can help? > > before clear_bit: > > CPU_0 > > void *ptr = PTR; > mb(); // implies compiler barrier as well > clear_bit(0, FLAGS); > BUG_ON(!ptr); > > just in case... mb() in the code above is only for illustration, we can use > smp_mb__before_atomic() + clear_bit(). Or just clear_bit_unlock(), iiuc the > one-way barrier is fine in this case. Ok, but it seems different, in the case of vhost we had a condition above the clear_bit(). > > > > > > In vhost code, there's a condition before the clear_bit() which sits > > > > inside llist_for_each_entry_safe(): > > > > > > > > #define llist_for_each_entry_safe(pos, n, node, member) \ > > > > for (pos = llist_entry((node), typeof(*pos), member); \ > > > > member_address_is_nonnull(pos, member) && \ > > > > (n = llist_entry(pos->member.next, typeof(*n), member), true); \ > > > > pos = n) > > > > > > > > The clear_bit() is a store which is not speculated, so there's a > > > > control dependency, the store can't be executed until the condition > > > > expression is evaluated which requires pos->member.next > > > > (work->node.next) to be loaded. > > > > > > But llist_for_each_entry_safe() doesn't check "n", I mean, it is not that we have > > > something like > > > > > > n = llist_entry(...); > > > if (n) > > > clear_bit(...); > > > > > > so I do not see how can we rely on the load-store control dependency. > > > > Just to make sure we are on the same page, the condition expression is > > > > member_address_is_nonnull(pos, member) && (n = > > llist_entry(pos->member.next, typeof(*n), member), true) > > > > So it's something like: > > > > if (work->node && (work_next = work->node->next, true)) > > clear_bit(&work->flags); > > > > So two loads from both work->node and work->node->next, and there's a > > store which is clear_bit, then it's a load-store control dependencies? > > I guess you missed the comma expression... Probably not, see below: > Let me rewrite your pseudo-code > above, it is equivalent to > > if (work->node) { > if ((work_next = work->node->next, true)) > clear_bit(&work->flags); > } > > another rewrite: > > if (work->node) { > work_next = work->node->next; > if ((work, true)) > clear_bit(&work->flags); > } > > and the final rewrite: > > if (work->node) { > work_next = work->node->next; > if (true) > clear_bit(&work->flags); > } > > so again, I do not see the load-store control dependency. This kind of optimization is suspicious. Especially considering it's the control expression of the loop but not a condition. Looking at the assembly (x86): 0xffffffff81d46c5b <+75>: callq 0xffffffff81689ac0 <llist_reverse_order> 0xffffffff81d46c60 <+80>: mov %rax,%r15 0xffffffff81d46c63 <+83>: test %rax,%rax 0xffffffff81d46c66 <+86>: je 0xffffffff81d46c3a <vhost_worker+42> 0xffffffff81d46c68 <+88>: mov %r15,%rdi 0xffffffff81d46c6b <+91>: mov (%r15),%r15 0xffffffff81d46c6e <+94>: lock andb $0xfd,0x10(%rdi) 0xffffffff81d46c73 <+99>: movl $0x0,0x18(%rbx) 0xffffffff81d46c7a <+106>: mov 0x8(%rdi),%rax 0xffffffff81d46c7e <+110>: callq 0xffffffff821b39a0 <__x86_indirect_thunk_array> 0xffffffff81d46c83 <+115>: callq 0xffffffff821b4d10 <__SCT__cond_resched> ... I can see: 1) The code read node->next (+91) before clear_bit (+94) 2) And the it uses a lock prefix to guarantee the execution order > Not to mention this > code lacks READ_ONCE(). Work_next is loaded only once for temporary storage, so I don't see why we need that. Thanks > > > If we had something like > > if (work->node) { > work_next = READ_ONCE(work->node->next); > if (work_next) > clear_bit(&work->flags); > } > > instead, then yes, we could rely on the LOAD-STORE dependency. > > Oleg. > _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-06-02 5:03 ` Jason Wang @ 2023-06-02 17:58 ` Oleg Nesterov 2023-06-02 20:07 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: Oleg Nesterov @ 2023-06-02 17:58 UTC (permalink / raw) To: Jason Wang Cc: axboe, brauner, mst, linux, linux-kernel, ebiederm, stefanha, nicolas.dichtel, virtualization, torvalds On 06/02, Jason Wang wrote: > > On Thu, Jun 1, 2023 at 3:43 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > and the final rewrite: > > > > if (work->node) { > > work_next = work->node->next; > > if (true) > > clear_bit(&work->flags); > > } > > > > so again, I do not see the load-store control dependency. > > This kind of optimization is suspicious. Especially considering it's > the control expression of the loop but not a condition. It is not about optimization, > Looking at the assembly (x86): > > 0xffffffff81d46c5b <+75>: callq 0xffffffff81689ac0 <llist_reverse_order> > 0xffffffff81d46c60 <+80>: mov %rax,%r15 > 0xffffffff81d46c63 <+83>: test %rax,%rax > 0xffffffff81d46c66 <+86>: je 0xffffffff81d46c3a <vhost_worker+42> > 0xffffffff81d46c68 <+88>: mov %r15,%rdi > 0xffffffff81d46c6b <+91>: mov (%r15),%r15 > 0xffffffff81d46c6e <+94>: lock andb $0xfd,0x10(%rdi) > 0xffffffff81d46c73 <+99>: movl $0x0,0x18(%rbx) > 0xffffffff81d46c7a <+106>: mov 0x8(%rdi),%rax > 0xffffffff81d46c7e <+110>: callq 0xffffffff821b39a0 > <__x86_indirect_thunk_array> > 0xffffffff81d46c83 <+115>: callq 0xffffffff821b4d10 <__SCT__cond_resched> > ... > > I can see: > > 1) The code read node->next (+91) before clear_bit (+94) The code does. but what about CPU ? > 2) And the it uses a lock prefix to guarantee the execution order As I said from the very beginning, this code is fine on x86 because atomic ops are fully serialised on x86. OK. we can't convince each other. I'll try to write another email when I have time, If this code is correct, then my understanding of memory barriers is even worse than I think. I wouldn't be surprised, but I'd like to understand what I have missed. Oleg. _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-06-02 17:58 ` Oleg Nesterov @ 2023-06-02 20:07 ` Linus Torvalds 2023-06-05 14:20 ` Oleg Nesterov 0 siblings, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2023-06-02 20:07 UTC (permalink / raw) To: Oleg Nesterov Cc: axboe, brauner, mst, linux, linux-kernel, ebiederm, stefanha, nicolas.dichtel, virtualization On Fri, Jun 2, 2023 at 1:59 PM Oleg Nesterov <oleg@redhat.com> wrote: > > As I said from the very beginning, this code is fine on x86 because > atomic ops are fully serialised on x86. Yes. Other architectures require __smp_mb__{before,after}_atomic for the bit setting ops to actually be memory barriers. We *should* probably have acquire/release versions of the bit test/set helpers, but we don't, so they end up being full memory barriers with those things. Which isn't optimal, but I doubt it matters on most architectures. So maybe we'll some day have a "test_bit_acquire()" and a "set_bit_release()" etc. Linus _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-06-02 20:07 ` Linus Torvalds @ 2023-06-05 14:20 ` Oleg Nesterov 0 siblings, 0 replies; 51+ messages in thread From: Oleg Nesterov @ 2023-06-05 14:20 UTC (permalink / raw) To: Linus Torvalds Cc: axboe, brauner, mst, linux, linux-kernel, ebiederm, stefanha, nicolas.dichtel, virtualization On 06/02, Linus Torvalds wrote: > > On Fri, Jun 2, 2023 at 1:59 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > As I said from the very beginning, this code is fine on x86 because > > atomic ops are fully serialised on x86. > > Yes. Other architectures require __smp_mb__{before,after}_atomic for > the bit setting ops to actually be memory barriers. > > We *should* probably have acquire/release versions of the bit test/set > helpers, but we don't, so they end up being full memory barriers with > those things. Which isn't optimal, but I doubt it matters on most > architectures. > > So maybe we'll some day have a "test_bit_acquire()" and a > "set_bit_release()" etc. In this particular case we need clear_bit_release() and iiuc it is already here, just it is named clear_bit_unlock(). So do you agree that vhost_worker() needs smp_mb__before_atomic() before clear_bit() or just clear_bit_unlock() to avoid the race with vhost_work_queue() ? Let me provide a simplified example: struct item { struct llist_node llist; unsigned long flags; }; struct llist_head HEAD = {}; // global void queue(struct item *item) { // ensure this item was already flushed if (!test_and_set_bit(0, &item->flags)) llist_add(item->llist, &HEAD); } void flush(void) { struct llist_node *head = llist_del_all(&HEAD); struct item *item, *next; llist_for_each_entry_safe(item, next, head, llist) clear_bit(0, &item->flags); } I think this code is buggy in that flush() can race with queue(), the same way as vhost_worker() and vhost_work_queue(). Once flush() clears bit 0, queue() can come on another CPU and re-queue this item and change item->llist.next. We need a barrier before clear_bit() to ensure that next = llist_entry(item->next) in llist_for_each_entry_safe() completes before the result of clear_bit() is visible to queue(). And, I do not think we can rely on control dependency because... because I fail to see the load-store control dependency in this code, llist_for_each_entry_safe() loads item->llist.next but doesn't check the result until the next iteration. No? Oleg. _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-22 2:51 ` [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression Mike Christie 2023-05-22 12:30 ` Oleg Nesterov @ 2023-05-22 19:40 ` Michael S. Tsirkin 2023-05-23 15:39 ` Eric W. Biederman 2023-05-23 15:48 ` Mike Christie 1 sibling, 2 replies; 51+ messages in thread From: Michael S. Tsirkin @ 2023-05-22 19:40 UTC (permalink / raw) To: Mike Christie Cc: axboe, brauner, linux-kernel, oleg, linux, ebiederm, stefanha, nicolas.dichtel, virtualization, torvalds On Sun, May 21, 2023 at 09:51:24PM -0500, Mike Christie wrote: > When switching from kthreads to vhost_tasks two bugs were added: > 1. The vhost worker tasks's now show up as processes so scripts doing ps > or ps a would not incorrectly detect the vhost task as another process. > 2. kthreads disabled freeze by setting PF_NOFREEZE, but vhost tasks's > didn't disable or add support for them. > > To fix both bugs, this switches the vhost task to be thread in the > process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call > get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that > SIGKILL/STOP support is required because CLONE_THREAD requires > CLONE_SIGHAND which requires those 2 signals to be suppported. > > This a modified version of patch originally written by Linus which > handles his review comment to himself to rename ignore_signals to > block_signals to better represent what it now does. And it includes a > change to vhost_worker() to support SIGSTOP/KILL and freeze, and it > drops the wait use per Oleg's review comment that it's no longer needed > when using CLONE_THREAD. > > Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads") > Signed-off-by: Mike Christie <michael.christie@oracle.com> > --- > drivers/vhost/vhost.c | 17 ++++++++++++++++- > include/linux/sched/task.h | 2 +- > kernel/fork.c | 12 +++--------- > kernel/signal.c | 1 + > kernel/vhost_task.c | 16 ++++------------ > 5 files changed, 25 insertions(+), 23 deletions(-) > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index a92af08e7864..bf83e9340e40 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -338,6 +338,7 @@ static int vhost_worker(void *data) > struct vhost_worker *worker = data; > struct vhost_work *work, *work_next; > struct llist_node *node; > + bool dead = false; > > for (;;) { > /* mb paired w/ kthread_stop */ > @@ -349,8 +350,22 @@ static int vhost_worker(void *data) > } > > node = llist_del_all(&worker->work_list); > - if (!node) > + if (!node) { > schedule(); > + /* > + * When we get a SIGKILL our release function will > + * be called. That will stop new IOs from being queued > + * and check for outstanding cmd responses. It will then > + * call vhost_task_stop to tell us to return and exit. > + */ > + if (!dead && signal_pending(current)) { > + struct ksignal ksig; > + > + dead = get_signal(&ksig); > + if (dead) > + clear_thread_flag(TIF_SIGPENDING); Does get_signal actually return true only on SIGKILL then? > + } > + } > > node = llist_reverse_order(node); > /* make sure flag is seen after deletion */ > diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h > index 537cbf9a2ade..249a5ece9def 100644 > --- a/include/linux/sched/task.h > +++ b/include/linux/sched/task.h > @@ -29,7 +29,7 @@ struct kernel_clone_args { > u32 io_thread:1; > u32 user_worker:1; > u32 no_files:1; > - u32 ignore_signals:1; > + u32 block_signals:1; > unsigned long stack; > unsigned long stack_size; > unsigned long tls; > diff --git a/kernel/fork.c b/kernel/fork.c > index ed4e01daccaa..9e04ab5c3946 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -2338,14 +2338,10 @@ __latent_entropy struct task_struct *copy_process( > p->flags |= PF_KTHREAD; > if (args->user_worker) > p->flags |= PF_USER_WORKER; > - if (args->io_thread) { > - /* > - * Mark us an IO worker, and block any signal that isn't > - * fatal or STOP > - */ > + if (args->io_thread) > p->flags |= PF_IO_WORKER; > + if (args->block_signals) > siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP)); > - } > > if (args->name) > strscpy_pad(p->comm, args->name, sizeof(p->comm)); > @@ -2517,9 +2513,6 @@ __latent_entropy struct task_struct *copy_process( > if (retval) > goto bad_fork_cleanup_io; > > - if (args->ignore_signals) > - ignore_signals(p); > - > stackleak_task_init(p); > > if (pid != &init_struct_pid) { > @@ -2861,6 +2854,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node) > .fn_arg = arg, > .io_thread = 1, > .user_worker = 1, > + .block_signals = 1, > }; > > return copy_process(NULL, 0, node, &args); > diff --git a/kernel/signal.c b/kernel/signal.c > index 8050fe23c732..a0f00a078cbb 100644 > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -2891,6 +2891,7 @@ bool get_signal(struct ksignal *ksig) > > return ksig->sig > 0; > } > +EXPORT_SYMBOL_GPL(get_signal); If you are exporting this, could you add documentation please? > /** > * signal_delivered - called after signal delivery to update blocked signals > diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c > index b7cbd66f889e..7a2d7d9fe772 100644 > --- a/kernel/vhost_task.c > +++ b/kernel/vhost_task.c > @@ -31,22 +31,13 @@ static int vhost_task_fn(void *data) > */ > void vhost_task_stop(struct vhost_task *vtsk) > { > - pid_t pid = vtsk->task->pid; > - > set_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags); > wake_up_process(vtsk->task); > /* > * Make sure vhost_task_fn is no longer accessing the vhost_task before > - * freeing it below. If userspace crashed or exited without closing, > - * then the vhost_task->task could already be marked dead so > - * kernel_wait will return early. > + * freeing it below. > */ > wait_for_completion(&vtsk->exited); > - /* > - * If we are just closing/removing a device and the parent process is > - * not exiting then reap the task. > - */ > - kernel_wait4(pid, NULL, __WCLONE, NULL); > kfree(vtsk); > } > EXPORT_SYMBOL_GPL(vhost_task_stop); > @@ -75,13 +66,14 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg, > const char *name) > { > struct kernel_clone_args args = { > - .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM, > + .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM | > + CLONE_THREAD | CLONE_SIGHAND, > .exit_signal = 0, > .fn = vhost_task_fn, > .name = name, > .user_worker = 1, > .no_files = 1, > - .ignore_signals = 1, > + .block_signals = 1, > }; > struct vhost_task *vtsk; > struct task_struct *tsk; > -- > 2.25.1 _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-22 19:40 ` Michael S. Tsirkin @ 2023-05-23 15:39 ` Eric W. Biederman 2023-05-23 15:48 ` Mike Christie 1 sibling, 0 replies; 51+ messages in thread From: Eric W. Biederman @ 2023-05-23 15:39 UTC (permalink / raw) To: Michael S. Tsirkin Cc: axboe, brauner, linux-kernel, oleg, linux, stefanha, nicolas.dichtel, virtualization, torvalds "Michael S. Tsirkin" <mst@redhat.com> writes: > On Sun, May 21, 2023 at 09:51:24PM -0500, Mike Christie wrote: >> When switching from kthreads to vhost_tasks two bugs were added: >> 1. The vhost worker tasks's now show up as processes so scripts doing ps >> or ps a would not incorrectly detect the vhost task as another process. >> 2. kthreads disabled freeze by setting PF_NOFREEZE, but vhost tasks's >> didn't disable or add support for them. >> >> To fix both bugs, this switches the vhost task to be thread in the >> process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call >> get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that >> SIGKILL/STOP support is required because CLONE_THREAD requires >> CLONE_SIGHAND which requires those 2 signals to be suppported. >> >> This a modified version of patch originally written by Linus which >> handles his review comment to himself to rename ignore_signals to >> block_signals to better represent what it now does. And it includes a >> change to vhost_worker() to support SIGSTOP/KILL and freeze, and it >> drops the wait use per Oleg's review comment that it's no longer needed >> when using CLONE_THREAD. >> >> Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads") >> Signed-off-by: Mike Christie <michael.christie@oracle.com> >> --- >> drivers/vhost/vhost.c | 17 ++++++++++++++++- >> include/linux/sched/task.h | 2 +- >> kernel/fork.c | 12 +++--------- >> kernel/signal.c | 1 + >> kernel/vhost_task.c | 16 ++++------------ >> 5 files changed, 25 insertions(+), 23 deletions(-) >> >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c >> index a92af08e7864..bf83e9340e40 100644 >> --- a/drivers/vhost/vhost.c >> +++ b/drivers/vhost/vhost.c >> @@ -338,6 +338,7 @@ static int vhost_worker(void *data) >> struct vhost_worker *worker = data; >> struct vhost_work *work, *work_next; >> struct llist_node *node; >> + bool dead = false; >> >> for (;;) { >> /* mb paired w/ kthread_stop */ >> @@ -349,8 +350,22 @@ static int vhost_worker(void *data) >> } >> >> node = llist_del_all(&worker->work_list); >> - if (!node) >> + if (!node) { >> schedule(); >> + /* >> + * When we get a SIGKILL our release function will >> + * be called. That will stop new IOs from being queued >> + * and check for outstanding cmd responses. It will then >> + * call vhost_task_stop to tell us to return and exit. >> + */ >> + if (!dead && signal_pending(current)) { >> + struct ksignal ksig; >> + >> + dead = get_signal(&ksig); >> + if (dead) >> + clear_thread_flag(TIF_SIGPENDING); > > > Does get_signal actually return true only on SIGKILL then? get_signal returns the signal that was dequeued, or 0 if no signal was dequeued. For these threads that block all but SIGSTOP and SIGKILL get_signal should properly never return any signal. As SIGSTOP and SIGKILL are handled internally to get_signal. However get_signal was modified to have a special case for io_uring and now the vhost code so that extra cleanup can be performed, before do_exit is called on the thread. That special case causes get_signal to return when with the value of SIGKILL when the process exits. The process can exit with do_group_exit aka exit(2) or any fatal signal not just SIGKILL, and get_signal on these threads will return. Eric _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression 2023-05-22 19:40 ` Michael S. Tsirkin 2023-05-23 15:39 ` Eric W. Biederman @ 2023-05-23 15:48 ` Mike Christie 1 sibling, 0 replies; 51+ messages in thread From: Mike Christie @ 2023-05-23 15:48 UTC (permalink / raw) To: Michael S. Tsirkin Cc: axboe, brauner, linux-kernel, oleg, linux, ebiederm, stefanha, nicolas.dichtel, virtualization, torvalds On 5/22/23 2:40 PM, Michael S. Tsirkin wrote: >> return copy_process(NULL, 0, node, &args); >> diff --git a/kernel/signal.c b/kernel/signal.c >> index 8050fe23c732..a0f00a078cbb 100644 >> --- a/kernel/signal.c >> +++ b/kernel/signal.c >> @@ -2891,6 +2891,7 @@ bool get_signal(struct ksignal *ksig) >> >> return ksig->sig > 0; >> } >> +EXPORT_SYMBOL_GPL(get_signal); > > If you are exporting this, could you add documentation please? > Ok. _______________________________________________ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization ^ permalink raw reply [flat|nested] 51+ messages in thread
end of thread, other threads:[~2023-06-05 14:21 UTC | newest]
Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-22 2:51 [PATCH 0/3] vhost: Fix freezer/ps regressions Mike Christie
2023-05-22 2:51 ` [PATCH 1/3] signal: Don't always put SIGKILL in shared_pending Mike Christie
2023-05-23 15:30 ` Eric W. Biederman
2023-05-22 2:51 ` [PATCH 2/3] signal: Don't exit for PF_USER_WORKER tasks Mike Christie
2023-05-22 2:51 ` [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression Mike Christie
2023-05-22 12:30 ` Oleg Nesterov
2023-05-22 17:00 ` Mike Christie
2023-05-22 17:47 ` Oleg Nesterov
2023-05-23 12:15 ` Oleg Nesterov
2023-05-23 15:57 ` Eric W. Biederman
2023-05-24 14:10 ` Oleg Nesterov
2023-05-24 14:44 ` Eric W. Biederman
2023-05-25 11:55 ` Oleg Nesterov
2023-05-25 15:30 ` Eric W. Biederman
2023-05-25 16:20 ` Linus Torvalds
2023-05-27 9:49 ` Eric W. Biederman
2023-05-27 16:12 ` Linus Torvalds
2023-05-28 1:17 ` Eric W. Biederman
2023-05-28 1:21 ` Linus Torvalds
2023-05-29 11:19 ` Oleg Nesterov
2023-05-29 16:09 ` michael.christie
2023-05-29 17:46 ` Oleg Nesterov
2023-05-29 17:54 ` Oleg Nesterov
2023-05-29 19:03 ` Mike Christie
2023-05-29 19:35 ` Mike Christie
2023-05-29 19:46 ` michael.christie
2023-05-30 2:48 ` Eric W. Biederman
2023-05-30 2:38 ` Eric W. Biederman
2023-05-30 15:34 ` Mike Christie
2023-05-31 3:30 ` Mike Christie
2023-05-29 16:11 ` michael.christie
[not found] ` <20230530-autor-faxnummer-01e0a31c0fb8@brauner>
2023-05-30 17:55 ` Oleg Nesterov
2023-05-30 15:01 ` Eric W. Biederman
2023-05-31 5:22 ` Jason Wang
2023-05-24 0:02 ` Mike Christie
2023-05-25 16:15 ` Mike Christie
2023-05-28 1:41 ` Eric W. Biederman
2023-05-28 19:29 ` Mike Christie
2023-05-31 5:22 ` Jason Wang
2023-05-31 7:25 ` Oleg Nesterov
2023-05-31 8:17 ` Jason Wang
2023-05-31 9:14 ` Oleg Nesterov
2023-06-01 2:44 ` Jason Wang
2023-06-01 7:43 ` Oleg Nesterov
2023-06-02 5:03 ` Jason Wang
2023-06-02 17:58 ` Oleg Nesterov
2023-06-02 20:07 ` Linus Torvalds
2023-06-05 14:20 ` Oleg Nesterov
2023-05-22 19:40 ` Michael S. Tsirkin
2023-05-23 15:39 ` Eric W. Biederman
2023-05-23 15:48 ` Mike Christie
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).