* [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND
@ 2023-05-18 0:09 Mike Christie
2023-05-18 0:09 ` [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set Mike Christie
` (7 more replies)
0 siblings, 8 replies; 28+ messages in thread
From: Mike Christie @ 2023-05-18 0:09 UTC (permalink / raw)
To: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
brauner
This patch allows the vhost and vhost_task code to use CLONE_THREAD,
CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
normal testing, haven't coverted vsock and vdpa, and I know you guys
will not like the first patch. However, I think it better shows what
we need from the signal code and how we can support signals in the
vhost_task layer.
Note that I took the super simple route and kicked off some work to
the system workqueue. We can do more invassive approaches:
1. Modify the vhost drivers so they can check for IO completions using
a non-blocking interface. We then don't need to run from the system
workqueue and can run from the vhost_task.
2. We could drop patch 1 and just say we are doing a polling type
of approach. We then modify the vhost layer similar to #1 where we
can check for completions using a non-blocking interface and use
the vhost_task task.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 28+ messages in thread
* [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
2023-05-18 0:09 [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Mike Christie
@ 2023-05-18 0:09 ` Mike Christie
2023-05-18 2:34 ` Eric W. Biederman
` (2 more replies)
2023-05-18 0:09 ` [RFC PATCH 2/8] vhost/vhost_task: Hook vhost layer into signal handler Mike Christie
` (6 subsequent siblings)
7 siblings, 3 replies; 28+ messages in thread
From: Mike Christie @ 2023-05-18 0:09 UTC (permalink / raw)
To: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
brauner
This has us deqeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is
set when we are dealing with PF_USER_WORKER tasks.
When a vhost_task gets a SIGKILL, we could have outstanding IO in flight.
We can easily stop new work/IO from being queued to the vhost_task, but
for IO that's already been sent to something like the block layer we
need to wait for the response then process it. These type of IO
completions use the vhost_task to process the completion so we can't
exit immediately.
We need to handle wait for then handle those completions from the
vhost_task, but when we have a SIGKLL pending, functions like
schedule() return immediately so we can't wait like normal. Functions
like vhost_worker() degrade to just a while(1); loop.
This patch has get_signal drop down to the normal code path when
SIGNAL_GROUP_EXIT/group_exec_task is set so the caller can still detect
there is a SIGKILL but still perform some blocking cleanup.
Note that in that chunk I'm now bypassing that does:
sigdelset(¤t->pending.signal, SIGKILL);
we look to be ok, because in the places we set SIGNAL_GROUP_EXIT/
group_exec_task we are already doing that on the threads in the
group.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
kernel/signal.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/kernel/signal.c b/kernel/signal.c
index 8f6330f0e9ca..ae4972eea5db 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2705,9 +2705,18 @@ bool get_signal(struct ksignal *ksig)
struct k_sigaction *ka;
enum pid_type type;
- /* Has this task already been marked for death? */
- if ((signal->flags & SIGNAL_GROUP_EXIT) ||
- signal->group_exec_task) {
+ /*
+ * Has this task already been marked for death?
+ *
+ * If this is a PF_USER_WORKER then the task may need to do
+ * extra work that requires waiting on running work, so we want
+ * to dequeue the signal below and tell the caller its time to
+ * start its exit procedure. When the work has completed then
+ * the task will exit.
+ */
+ if (!(current->flags & PF_USER_WORKER) &&
+ ((signal->flags & SIGNAL_GROUP_EXIT) ||
+ signal->group_exec_task)) {
clear_siginfo(&ksig->info);
ksig->info.si_signo = signr = SIGKILL;
sigdelset(¤t->pending.signal, SIGKILL);
@@ -2861,11 +2870,11 @@ bool get_signal(struct ksignal *ksig)
}
/*
- * PF_IO_WORKER threads will catch and exit on fatal signals
+ * PF_USER_WORKER threads will catch and exit on fatal signals
* themselves. They have cleanup that must be performed, so
* we cannot call do_exit() on their behalf.
*/
- if (current->flags & PF_IO_WORKER)
+ if (current->flags & PF_USER_WORKER)
goto out;
/*
--
2.25.1
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC PATCH 2/8] vhost/vhost_task: Hook vhost layer into signal handler
2023-05-18 0:09 [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Mike Christie
2023-05-18 0:09 ` [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set Mike Christie
@ 2023-05-18 0:09 ` Mike Christie
2023-05-18 0:16 ` Linus Torvalds
2023-05-18 0:09 ` [RFC PATCH 3/8] fork/vhost_task: Switch to CLONE_THREAD and CLONE_SIGHAND Mike Christie
` (5 subsequent siblings)
7 siblings, 1 reply; 28+ messages in thread
From: Mike Christie @ 2023-05-18 0:09 UTC (permalink / raw)
To: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
brauner
This patch has vhost use get_signal to handle freezing and sort of
handle signals. By the latter I mean that when we get SIGKILL, our
parent will exit and call our file_operatons release function. That will
then stop new work from breing queued and wait for the vhost_task to
handle completions for running IO. We then exit when those are done.
The next patches will then have us work more like io_uring where
we handle the get_signal return value and key off that to cleanup.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
drivers/vhost/vhost.c | 10 +++++++++-
include/linux/sched/vhost_task.h | 1 +
kernel/vhost_task.c | 20 ++++++++++++++++++++
3 files changed, 30 insertions(+), 1 deletion(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a92af08e7864..1ba9e068b2ab 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -349,8 +349,16 @@ static int vhost_worker(void *data)
}
node = llist_del_all(&worker->work_list);
- if (!node)
+ if (!node) {
schedule();
+ /*
+ * When we get a SIGKILL our release function will
+ * be called. That will stop new IOs from being queued
+ * and check for outstanding cmd responses. It will then
+ * call vhost_task_stop to exit us.
+ */
+ vhost_task_get_signal();
+ }
node = llist_reverse_order(node);
/* make sure flag is seen after deletion */
diff --git a/include/linux/sched/vhost_task.h b/include/linux/sched/vhost_task.h
index 6123c10b99cf..54b68115eb3b 100644
--- a/include/linux/sched/vhost_task.h
+++ b/include/linux/sched/vhost_task.h
@@ -19,5 +19,6 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
void vhost_task_start(struct vhost_task *vtsk);
void vhost_task_stop(struct vhost_task *vtsk);
bool vhost_task_should_stop(struct vhost_task *vtsk);
+bool vhost_task_get_signal(void);
#endif
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index b7cbd66f889e..a661cfa32ba3 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -61,6 +61,26 @@ bool vhost_task_should_stop(struct vhost_task *vtsk)
}
EXPORT_SYMBOL_GPL(vhost_task_should_stop);
+/**
+ * vhost_task_get_signal - Check if there are pending signals
+ *
+ * Return true if we got SIGKILL.
+ */
+bool vhost_task_get_signal(void)
+{
+ struct ksignal ksig;
+ bool rc;
+
+ if (!signal_pending(current))
+ return false;
+
+ __set_current_state(TASK_RUNNING);
+ rc = get_signal(&ksig);
+ set_current_state(TASK_INTERRUPTIBLE);
+ return rc;
+}
+EXPORT_SYMBOL_GPL(vhost_task_get_signal);
+
/**
* vhost_task_create - create a copy of a process to be used by the kernel
* @fn: thread stack
--
2.25.1
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC PATCH 3/8] fork/vhost_task: Switch to CLONE_THREAD and CLONE_SIGHAND
2023-05-18 0:09 [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Mike Christie
2023-05-18 0:09 ` [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set Mike Christie
2023-05-18 0:09 ` [RFC PATCH 2/8] vhost/vhost_task: Hook vhost layer into signal handler Mike Christie
@ 2023-05-18 0:09 ` Mike Christie
2023-05-18 0:09 ` [RFC PATCH 4/8] vhost-net: Move vhost_net_open Mike Christie
` (4 subsequent siblings)
7 siblings, 0 replies; 28+ messages in thread
From: Mike Christie @ 2023-05-18 0:09 UTC (permalink / raw)
To: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
brauner
This is a modified version of Linus's patch which has vhost_task
use CLONE_THREAD and CLONE_SIGHAND and allow SIGKILL and SIGSTOP.
I renamed the ignore_signals to block_signals based on Linus's comment
where it aligns with what we are doing with the siginitsetinv
p->blocked use and no longer calling ignore_signals.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
include/linux/sched/task.h | 2 +-
kernel/fork.c | 12 +++---------
kernel/vhost_task.c | 5 +++--
3 files changed, 7 insertions(+), 12 deletions(-)
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 537cbf9a2ade..249a5ece9def 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -29,7 +29,7 @@ struct kernel_clone_args {
u32 io_thread:1;
u32 user_worker:1;
u32 no_files:1;
- u32 ignore_signals:1;
+ u32 block_signals:1;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..9e04ab5c3946 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2338,14 +2338,10 @@ __latent_entropy struct task_struct *copy_process(
p->flags |= PF_KTHREAD;
if (args->user_worker)
p->flags |= PF_USER_WORKER;
- if (args->io_thread) {
- /*
- * Mark us an IO worker, and block any signal that isn't
- * fatal or STOP
- */
+ if (args->io_thread)
p->flags |= PF_IO_WORKER;
+ if (args->block_signals)
siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
- }
if (args->name)
strscpy_pad(p->comm, args->name, sizeof(p->comm));
@@ -2517,9 +2513,6 @@ __latent_entropy struct task_struct *copy_process(
if (retval)
goto bad_fork_cleanup_io;
- if (args->ignore_signals)
- ignore_signals(p);
-
stackleak_task_init(p);
if (pid != &init_struct_pid) {
@@ -2861,6 +2854,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
.fn_arg = arg,
.io_thread = 1,
.user_worker = 1,
+ .block_signals = 1,
};
return copy_process(NULL, 0, node, &args);
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index a661cfa32ba3..a11f036290cc 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -95,13 +95,14 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
const char *name)
{
struct kernel_clone_args args = {
- .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM,
+ .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM |
+ CLONE_THREAD | CLONE_SIGHAND,
.exit_signal = 0,
.fn = vhost_task_fn,
.name = name,
.user_worker = 1,
.no_files = 1,
- .ignore_signals = 1,
+ .block_signals = 1,
};
struct vhost_task *vtsk;
struct task_struct *tsk;
--
2.25.1
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC PATCH 4/8] vhost-net: Move vhost_net_open
2023-05-18 0:09 [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Mike Christie
` (2 preceding siblings ...)
2023-05-18 0:09 ` [RFC PATCH 3/8] fork/vhost_task: Switch to CLONE_THREAD and CLONE_SIGHAND Mike Christie
@ 2023-05-18 0:09 ` Mike Christie
2023-05-18 0:09 ` [RFC PATCH 5/8] vhost: Add callback that stops new work and waits on running ones Mike Christie
` (3 subsequent siblings)
7 siblings, 0 replies; 28+ messages in thread
From: Mike Christie @ 2023-05-18 0:09 UTC (permalink / raw)
To: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
brauner
This moves vhost_net_open so in the next patches we can pass
vhost_dev_init a new helper which will use the stop/flush functions.
There is no functionality changes in this patch.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
drivers/vhost/net.c | 134 ++++++++++++++++++++++----------------------
1 file changed, 67 insertions(+), 67 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 07181cd8d52e..8557072ff05e 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1285,73 +1285,6 @@ static void handle_rx_net(struct vhost_work *work)
handle_rx(net);
}
-static int vhost_net_open(struct inode *inode, struct file *f)
-{
- struct vhost_net *n;
- struct vhost_dev *dev;
- struct vhost_virtqueue **vqs;
- void **queue;
- struct xdp_buff *xdp;
- int i;
-
- n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
- if (!n)
- return -ENOMEM;
- vqs = kmalloc_array(VHOST_NET_VQ_MAX, sizeof(*vqs), GFP_KERNEL);
- if (!vqs) {
- kvfree(n);
- return -ENOMEM;
- }
-
- queue = kmalloc_array(VHOST_NET_BATCH, sizeof(void *),
- GFP_KERNEL);
- if (!queue) {
- kfree(vqs);
- kvfree(n);
- return -ENOMEM;
- }
- n->vqs[VHOST_NET_VQ_RX].rxq.queue = queue;
-
- xdp = kmalloc_array(VHOST_NET_BATCH, sizeof(*xdp), GFP_KERNEL);
- if (!xdp) {
- kfree(vqs);
- kvfree(n);
- kfree(queue);
- return -ENOMEM;
- }
- n->vqs[VHOST_NET_VQ_TX].xdp = xdp;
-
- dev = &n->dev;
- vqs[VHOST_NET_VQ_TX] = &n->vqs[VHOST_NET_VQ_TX].vq;
- vqs[VHOST_NET_VQ_RX] = &n->vqs[VHOST_NET_VQ_RX].vq;
- n->vqs[VHOST_NET_VQ_TX].vq.handle_kick = handle_tx_kick;
- n->vqs[VHOST_NET_VQ_RX].vq.handle_kick = handle_rx_kick;
- for (i = 0; i < VHOST_NET_VQ_MAX; i++) {
- n->vqs[i].ubufs = NULL;
- n->vqs[i].ubuf_info = NULL;
- n->vqs[i].upend_idx = 0;
- n->vqs[i].done_idx = 0;
- n->vqs[i].batched_xdp = 0;
- n->vqs[i].vhost_hlen = 0;
- n->vqs[i].sock_hlen = 0;
- n->vqs[i].rx_ring = NULL;
- vhost_net_buf_init(&n->vqs[i].rxq);
- }
- vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
- UIO_MAXIOV + VHOST_NET_BATCH,
- VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
- NULL);
-
- vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev);
- vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev);
-
- f->private_data = n;
- n->page_frag.page = NULL;
- n->refcnt_bias = 0;
-
- return 0;
-}
-
static struct socket *vhost_net_stop_vq(struct vhost_net *n,
struct vhost_virtqueue *vq)
{
@@ -1421,6 +1354,73 @@ static int vhost_net_release(struct inode *inode, struct file *f)
return 0;
}
+static int vhost_net_open(struct inode *inode, struct file *f)
+{
+ struct vhost_net *n;
+ struct vhost_dev *dev;
+ struct vhost_virtqueue **vqs;
+ void **queue;
+ struct xdp_buff *xdp;
+ int i;
+
+ n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+ if (!n)
+ return -ENOMEM;
+ vqs = kmalloc_array(VHOST_NET_VQ_MAX, sizeof(*vqs), GFP_KERNEL);
+ if (!vqs) {
+ kvfree(n);
+ return -ENOMEM;
+ }
+
+ queue = kmalloc_array(VHOST_NET_BATCH, sizeof(void *),
+ GFP_KERNEL);
+ if (!queue) {
+ kfree(vqs);
+ kvfree(n);
+ return -ENOMEM;
+ }
+ n->vqs[VHOST_NET_VQ_RX].rxq.queue = queue;
+
+ xdp = kmalloc_array(VHOST_NET_BATCH, sizeof(*xdp), GFP_KERNEL);
+ if (!xdp) {
+ kfree(vqs);
+ kvfree(n);
+ kfree(queue);
+ return -ENOMEM;
+ }
+ n->vqs[VHOST_NET_VQ_TX].xdp = xdp;
+
+ dev = &n->dev;
+ vqs[VHOST_NET_VQ_TX] = &n->vqs[VHOST_NET_VQ_TX].vq;
+ vqs[VHOST_NET_VQ_RX] = &n->vqs[VHOST_NET_VQ_RX].vq;
+ n->vqs[VHOST_NET_VQ_TX].vq.handle_kick = handle_tx_kick;
+ n->vqs[VHOST_NET_VQ_RX].vq.handle_kick = handle_rx_kick;
+ for (i = 0; i < VHOST_NET_VQ_MAX; i++) {
+ n->vqs[i].ubufs = NULL;
+ n->vqs[i].ubuf_info = NULL;
+ n->vqs[i].upend_idx = 0;
+ n->vqs[i].done_idx = 0;
+ n->vqs[i].batched_xdp = 0;
+ n->vqs[i].vhost_hlen = 0;
+ n->vqs[i].sock_hlen = 0;
+ n->vqs[i].rx_ring = NULL;
+ vhost_net_buf_init(&n->vqs[i].rxq);
+ }
+ vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
+ UIO_MAXIOV + VHOST_NET_BATCH,
+ VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
+ NULL);
+
+ vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev);
+ vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev);
+
+ f->private_data = n;
+ n->page_frag.page = NULL;
+ n->refcnt_bias = 0;
+
+ return 0;
+}
+
static struct socket *get_raw_socket(int fd)
{
int r;
--
2.25.1
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC PATCH 5/8] vhost: Add callback that stops new work and waits on running ones
2023-05-18 0:09 [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Mike Christie
` (3 preceding siblings ...)
2023-05-18 0:09 ` [RFC PATCH 4/8] vhost-net: Move vhost_net_open Mike Christie
@ 2023-05-18 0:09 ` Mike Christie
[not found] ` <20230518-lokomotive-aufziehen-dbc432136b76@brauner>
2023-05-18 0:09 ` [RFC PATCH 6/8] vhost-scsi: Add callback to stop and wait on works Mike Christie
` (2 subsequent siblings)
7 siblings, 1 reply; 28+ messages in thread
From: Mike Christie @ 2023-05-18 0:09 UTC (permalink / raw)
To: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
brauner
When the vhost_task gets a SIGKILL we want to stop new work from being
queued and also wait for and handle completions for running work. For the
latter, we still need to use the vhost_task to handle the completing work
so we can't just exit right away. But, this has us kick off the stopping
and flushing/stopping of the device/vhost_task/worker to the system work
queue while the vhost_task handles completions. When all completions are
done we will then do vhost_task_stop and we will exit.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
drivers/vhost/net.c | 2 +-
drivers/vhost/scsi.c | 4 ++--
drivers/vhost/test.c | 3 ++-
drivers/vhost/vdpa.c | 2 +-
drivers/vhost/vhost.c | 48 ++++++++++++++++++++++++++++++++++++-------
drivers/vhost/vhost.h | 10 ++++++++-
drivers/vhost/vsock.c | 4 ++--
7 files changed, 58 insertions(+), 15 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8557072ff05e..90c25127b3f8 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1409,7 +1409,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
UIO_MAXIOV + VHOST_NET_BATCH,
VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
- NULL);
+ NULL, NULL);
vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev);
vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev);
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index bb10fa4bb4f6..40f9135e1a62 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1820,8 +1820,8 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
vqs[i] = &vs->vqs[i].vq;
vs->vqs[i].vq.handle_kick = vhost_scsi_handle_kick;
}
- vhost_dev_init(&vs->dev, vqs, nvqs, UIO_MAXIOV,
- VHOST_SCSI_WEIGHT, 0, true, NULL);
+ vhost_dev_init(&vs->dev, vqs, nvqs, UIO_MAXIOV, VHOST_SCSI_WEIGHT, 0,
+ true, NULL, NULL);
vhost_scsi_init_inflight(vs, NULL);
diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
index 42c955a5b211..11a2823d7532 100644
--- a/drivers/vhost/test.c
+++ b/drivers/vhost/test.c
@@ -120,7 +120,8 @@ static int vhost_test_open(struct inode *inode, struct file *f)
vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ];
n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV,
- VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, true, NULL);
+ VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, true, NULL,
+ NULL);
f->private_data = n;
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 8c1aefc865f0..de9a83ecb70d 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -1279,7 +1279,7 @@ static int vhost_vdpa_open(struct inode *inode, struct file *filep)
vqs[i]->handle_kick = handle_vq_kick;
}
vhost_dev_init(dev, vqs, nvqs, 0, 0, 0, false,
- vhost_vdpa_process_iotlb_msg);
+ vhost_vdpa_process_iotlb_msg, NULL);
r = vhost_vdpa_alloc_domain(v);
if (r)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 1ba9e068b2ab..4163c86db50c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -336,6 +336,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
static int vhost_worker(void *data)
{
struct vhost_worker *worker = data;
+ struct vhost_dev *dev = worker->dev;
struct vhost_work *work, *work_next;
struct llist_node *node;
@@ -352,12 +353,13 @@ static int vhost_worker(void *data)
if (!node) {
schedule();
/*
- * When we get a SIGKILL our release function will
- * be called. That will stop new IOs from being queued
- * and check for outstanding cmd responses. It will then
- * call vhost_task_stop to exit us.
+ * When we get a SIGKILL we kick off a work to
+ * run the driver's helper to stop new work and
+ * handle completions. When they are done they will
+ * call vhost_task_stop to tell us to exit.
*/
- vhost_task_get_signal();
+ if (vhost_task_get_signal())
+ schedule_work(&dev->destroy_worker);
}
node = llist_reverse_order(node);
@@ -376,6 +378,33 @@ static int vhost_worker(void *data)
return 0;
}
+static void __vhost_dev_stop_work(struct vhost_dev *dev)
+{
+ mutex_lock(&dev->stop_work_mutex);
+ if (dev->work_stopped)
+ goto done;
+
+ if (dev->stop_dev_work)
+ dev->stop_dev_work(dev);
+ dev->work_stopped = true;
+done:
+ mutex_unlock(&dev->stop_work_mutex);
+}
+
+void vhost_dev_stop_work(struct vhost_dev *dev)
+{
+ __vhost_dev_stop_work(dev);
+ flush_work(&dev->destroy_worker);
+}
+EXPORT_SYMBOL_GPL(vhost_dev_stop_work);
+
+static void vhost_worker_destroy(struct work_struct *work)
+{
+ struct vhost_dev *dev = container_of(work, struct vhost_dev,
+ destroy_worker);
+ __vhost_dev_stop_work(dev);
+}
+
static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
{
kfree(vq->indirect);
@@ -464,7 +493,8 @@ void vhost_dev_init(struct vhost_dev *dev,
int iov_limit, int weight, int byte_weight,
bool use_worker,
int (*msg_handler)(struct vhost_dev *dev, u32 asid,
- struct vhost_iotlb_msg *msg))
+ struct vhost_iotlb_msg *msg),
+ void (*stop_dev_work)(struct vhost_dev *dev))
{
struct vhost_virtqueue *vq;
int i;
@@ -472,6 +502,7 @@ void vhost_dev_init(struct vhost_dev *dev,
dev->vqs = vqs;
dev->nvqs = nvqs;
mutex_init(&dev->mutex);
+ mutex_init(&dev->stop_work_mutex);
dev->log_ctx = NULL;
dev->umem = NULL;
dev->iotlb = NULL;
@@ -482,12 +513,14 @@ void vhost_dev_init(struct vhost_dev *dev,
dev->byte_weight = byte_weight;
dev->use_worker = use_worker;
dev->msg_handler = msg_handler;
+ dev->work_stopped = false;
+ dev->stop_dev_work = stop_dev_work;
+ INIT_WORK(&dev->destroy_worker, vhost_worker_destroy);
init_waitqueue_head(&dev->wait);
INIT_LIST_HEAD(&dev->read_list);
INIT_LIST_HEAD(&dev->pending_list);
spin_lock_init(&dev->iotlb_lock);
-
for (i = 0; i < dev->nvqs; ++i) {
vq = dev->vqs[i];
vq->log = NULL;
@@ -572,6 +605,7 @@ static int vhost_worker_create(struct vhost_dev *dev)
if (!worker)
return -ENOMEM;
+ worker->dev = dev;
dev->worker = worker;
worker->kcov_handle = kcov_common_handle();
init_llist_head(&worker->work_list);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 0308638cdeee..325e5e52c7ae 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -17,6 +17,7 @@
struct vhost_work;
struct vhost_task;
+struct vhost_dev;
typedef void (*vhost_work_fn_t)(struct vhost_work *work);
#define VHOST_WORK_QUEUED 1
@@ -28,6 +29,7 @@ struct vhost_work {
struct vhost_worker {
struct vhost_task *vtsk;
+ struct vhost_dev *dev;
struct llist_head work_list;
u64 kcov_handle;
};
@@ -165,8 +167,12 @@ struct vhost_dev {
int weight;
int byte_weight;
bool use_worker;
+ struct mutex stop_work_mutex;
+ bool work_stopped;
+ struct work_struct destroy_worker;
int (*msg_handler)(struct vhost_dev *dev, u32 asid,
struct vhost_iotlb_msg *msg);
+ void (*stop_dev_work)(struct vhost_dev *dev);
};
bool vhost_exceeds_weight(struct vhost_virtqueue *vq, int pkts, int total_len);
@@ -174,7 +180,8 @@ void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs,
int nvqs, int iov_limit, int weight, int byte_weight,
bool use_worker,
int (*msg_handler)(struct vhost_dev *dev, u32 asid,
- struct vhost_iotlb_msg *msg));
+ struct vhost_iotlb_msg *msg),
+ void (*stop_dev_work)(struct vhost_dev *dev));
long vhost_dev_set_owner(struct vhost_dev *dev);
bool vhost_dev_has_owner(struct vhost_dev *dev);
long vhost_dev_check_owner(struct vhost_dev *);
@@ -182,6 +189,7 @@ struct vhost_iotlb *vhost_dev_reset_owner_prepare(void);
void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_iotlb *iotlb);
void vhost_dev_cleanup(struct vhost_dev *);
void vhost_dev_stop(struct vhost_dev *);
+void vhost_dev_stop_work(struct vhost_dev *dev);
long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, void __user *argp);
long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp);
bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 6578db78f0ae..1ef53722d494 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -664,8 +664,8 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
vsock->vqs[VSOCK_VQ_RX].handle_kick = vhost_vsock_handle_rx_kick;
vhost_dev_init(&vsock->dev, vqs, ARRAY_SIZE(vsock->vqs),
- UIO_MAXIOV, VHOST_VSOCK_PKT_WEIGHT,
- VHOST_VSOCK_WEIGHT, true, NULL);
+ UIO_MAXIOV, VHOST_VSOCK_PKT_WEIGHT, VHOST_VSOCK_WEIGHT,
+ true, NULL, NULL);
file->private_data = vsock;
skb_queue_head_init(&vsock->send_pkt_queue);
--
2.25.1
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC PATCH 6/8] vhost-scsi: Add callback to stop and wait on works
2023-05-18 0:09 [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Mike Christie
` (4 preceding siblings ...)
2023-05-18 0:09 ` [RFC PATCH 5/8] vhost: Add callback that stops new work and waits on running ones Mike Christie
@ 2023-05-18 0:09 ` Mike Christie
2023-05-18 0:09 ` [RFC PATCH 7/8] vhost-net: " Mike Christie
2023-05-18 0:09 ` [RFC PATCH 8/8] fork/vhost_task: remove no_files Mike Christie
7 siblings, 0 replies; 28+ messages in thread
From: Mike Christie @ 2023-05-18 0:09 UTC (permalink / raw)
To: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
brauner
This moves the scsi code we use to stop new works from being queued
and wait on running works to a helper which is used by the vhost layer
when the vhost_task is being killed by a SIGKILL.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
drivers/vhost/scsi.c | 23 +++++++++++++++--------
1 file changed, 15 insertions(+), 8 deletions(-)
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 40f9135e1a62..a0f2588270f2 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1768,6 +1768,19 @@ static int vhost_scsi_set_features(struct vhost_scsi *vs, u64 features)
return 0;
}
+static void vhost_scsi_stop_dev_work(struct vhost_dev *dev)
+{
+ struct vhost_scsi *vs = container_of(dev, struct vhost_scsi, dev);
+ struct vhost_scsi_target t;
+
+ mutex_lock(&vs->dev.mutex);
+ memcpy(t.vhost_wwpn, vs->vs_vhost_wwpn, sizeof(t.vhost_wwpn));
+ mutex_unlock(&vs->dev.mutex);
+ vhost_scsi_clear_endpoint(vs, &t);
+ vhost_dev_stop(&vs->dev);
+ vhost_dev_cleanup(&vs->dev);
+}
+
static int vhost_scsi_open(struct inode *inode, struct file *f)
{
struct vhost_scsi *vs;
@@ -1821,7 +1834,7 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
vs->vqs[i].vq.handle_kick = vhost_scsi_handle_kick;
}
vhost_dev_init(&vs->dev, vqs, nvqs, UIO_MAXIOV, VHOST_SCSI_WEIGHT, 0,
- true, NULL, NULL);
+ true, NULL, vhost_scsi_stop_dev_work);
vhost_scsi_init_inflight(vs, NULL);
@@ -1843,14 +1856,8 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
static int vhost_scsi_release(struct inode *inode, struct file *f)
{
struct vhost_scsi *vs = f->private_data;
- struct vhost_scsi_target t;
- mutex_lock(&vs->dev.mutex);
- memcpy(t.vhost_wwpn, vs->vs_vhost_wwpn, sizeof(t.vhost_wwpn));
- mutex_unlock(&vs->dev.mutex);
- vhost_scsi_clear_endpoint(vs, &t);
- vhost_dev_stop(&vs->dev);
- vhost_dev_cleanup(&vs->dev);
+ vhost_dev_stop_work(&vs->dev);
kfree(vs->dev.vqs);
kfree(vs->vqs);
kfree(vs->old_inflight);
--
2.25.1
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC PATCH 7/8] vhost-net: Add callback to stop and wait on works
2023-05-18 0:09 [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Mike Christie
` (5 preceding siblings ...)
2023-05-18 0:09 ` [RFC PATCH 6/8] vhost-scsi: Add callback to stop and wait on works Mike Christie
@ 2023-05-18 0:09 ` Mike Christie
2023-05-18 0:09 ` [RFC PATCH 8/8] fork/vhost_task: remove no_files Mike Christie
7 siblings, 0 replies; 28+ messages in thread
From: Mike Christie @ 2023-05-18 0:09 UTC (permalink / raw)
To: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
brauner
This moves the net code we use to stop new works from being queued
and wait on running works to a helper which is used by the vhost layer
when the vhost_task is being killed by a SIGKILL.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
drivers/vhost/net.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 90c25127b3f8..f8a5527b15ba 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1325,9 +1325,9 @@ static void vhost_net_flush(struct vhost_net *n)
}
}
-static int vhost_net_release(struct inode *inode, struct file *f)
+static void vhost_net_stop_dev_work(struct vhost_dev *dev)
{
- struct vhost_net *n = f->private_data;
+ struct vhost_net *n = container_of(dev, struct vhost_net, dev);
struct socket *tx_sock;
struct socket *rx_sock;
@@ -1345,6 +1345,13 @@ static int vhost_net_release(struct inode *inode, struct file *f)
/* We do an extra flush before freeing memory,
* since jobs can re-queue themselves. */
vhost_net_flush(n);
+}
+
+static int vhost_net_release(struct inode *inode, struct file *f)
+{
+ struct vhost_net *n = f->private_data;
+
+ vhost_dev_stop_work(&n->dev);
kfree(n->vqs[VHOST_NET_VQ_RX].rxq.queue);
kfree(n->vqs[VHOST_NET_VQ_TX].xdp);
kfree(n->dev.vqs);
@@ -1409,7 +1416,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
UIO_MAXIOV + VHOST_NET_BATCH,
VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
- NULL, NULL);
+ NULL, vhost_net_stop_dev_work);
vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev);
vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev);
--
2.25.1
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC PATCH 8/8] fork/vhost_task: remove no_files
2023-05-18 0:09 [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Mike Christie
` (6 preceding siblings ...)
2023-05-18 0:09 ` [RFC PATCH 7/8] vhost-net: " Mike Christie
@ 2023-05-18 0:09 ` Mike Christie
2023-05-18 1:04 ` Mike Christie
7 siblings, 1 reply; 28+ messages in thread
From: Mike Christie @ 2023-05-18 0:09 UTC (permalink / raw)
To: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
brauner
The vhost_task can now support the worker being freed from under the
device when we get a SIGKILL or the process exits without closing
devices. We no longer need no_files so this removes it.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
include/linux/sched/task.h | 1 -
kernel/fork.c | 10 ++--------
kernel/vhost_task.c | 3 +--
3 files changed, 3 insertions(+), 11 deletions(-)
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 249a5ece9def..342fe297ffd4 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -28,7 +28,6 @@ struct kernel_clone_args {
u32 kthread:1;
u32 io_thread:1;
u32 user_worker:1;
- u32 no_files:1;
u32 block_signals:1;
unsigned long stack;
unsigned long stack_size;
diff --git a/kernel/fork.c b/kernel/fork.c
index 9e04ab5c3946..f2c081c15efb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1769,8 +1769,7 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
return 0;
}
-static int copy_files(unsigned long clone_flags, struct task_struct *tsk,
- int no_files)
+static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
{
struct files_struct *oldf, *newf;
int error = 0;
@@ -1782,11 +1781,6 @@ static int copy_files(unsigned long clone_flags, struct task_struct *tsk,
if (!oldf)
goto out;
- if (no_files) {
- tsk->files = NULL;
- goto out;
- }
-
if (clone_flags & CLONE_FILES) {
atomic_inc(&oldf->count);
goto out;
@@ -2488,7 +2482,7 @@ __latent_entropy struct task_struct *copy_process(
retval = copy_semundo(clone_flags, p);
if (retval)
goto bad_fork_cleanup_security;
- retval = copy_files(clone_flags, p, args->no_files);
+ retval = copy_files(clone_flags, p);
if (retval)
goto bad_fork_cleanup_semundo;
retval = copy_fs(clone_flags, p);
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index a11f036290cc..642047765190 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -96,12 +96,11 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
{
struct kernel_clone_args args = {
.flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM |
- CLONE_THREAD | CLONE_SIGHAND,
+ CLONE_THREAD | CLONE_FILES, CLONE_SIGHAND,
.exit_signal = 0,
.fn = vhost_task_fn,
.name = name,
.user_worker = 1,
- .no_files = 1,
.block_signals = 1,
};
struct vhost_task *vtsk;
--
2.25.1
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 2/8] vhost/vhost_task: Hook vhost layer into signal handler
2023-05-18 0:09 ` [RFC PATCH 2/8] vhost/vhost_task: Hook vhost layer into signal handler Mike Christie
@ 2023-05-18 0:16 ` Linus Torvalds
2023-05-18 1:01 ` Mike Christie
0 siblings, 1 reply; 28+ messages in thread
From: Linus Torvalds @ 2023-05-18 0:16 UTC (permalink / raw)
To: Mike Christie
Cc: axboe, brauner, mst, linux-kernel, oleg, ebiederm, stefanha,
linux, nicolas.dichtel, virtualization
On Wed, May 17, 2023 at 5:09 PM Mike Christie
<michael.christie@oracle.com> wrote:
>
> + __set_current_state(TASK_RUNNING);
> + rc = get_signal(&ksig);
> + set_current_state(TASK_INTERRUPTIBLE);
> + return rc;
The games with current_state seem nonsensical.
What are they all about? get_signal() shouldn't care, and no other
caller does this thing. This just seems completely random.
Linus
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 2/8] vhost/vhost_task: Hook vhost layer into signal handler
2023-05-18 0:16 ` Linus Torvalds
@ 2023-05-18 1:01 ` Mike Christie
0 siblings, 0 replies; 28+ messages in thread
From: Mike Christie @ 2023-05-18 1:01 UTC (permalink / raw)
To: Linus Torvalds
Cc: axboe, brauner, mst, linux-kernel, oleg, ebiederm, stefanha,
linux, nicolas.dichtel, virtualization
On 5/17/23 7:16 PM, Linus Torvalds wrote:
> On Wed, May 17, 2023 at 5:09 PM Mike Christie
> <michael.christie@oracle.com> wrote:
>>
>> + __set_current_state(TASK_RUNNING);
>> + rc = get_signal(&ksig);
>> + set_current_state(TASK_INTERRUPTIBLE);
>> + return rc;
>
> The games with current_state seem nonsensical.
>
> What are they all about? get_signal() shouldn't care, and no other
> caller does this thing. This just seems completely random.
Sorry. It's a leftover.
I was originally calling this from vhost_task_should_stop where before
calling that function we do a:
set_current_state(TASK_INTERRUPTIBLE);
So, I was hitting get_signal->try_to_freeze->might_sleep->__might_sleep
and was getting the "do not call blocking ops when !TASK_RUNNING"
warnings.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 8/8] fork/vhost_task: remove no_files
2023-05-18 0:09 ` [RFC PATCH 8/8] fork/vhost_task: remove no_files Mike Christie
@ 2023-05-18 1:04 ` Mike Christie
0 siblings, 0 replies; 28+ messages in thread
From: Mike Christie @ 2023-05-18 1:04 UTC (permalink / raw)
To: oleg, linux, nicolas.dichtel, axboe, ebiederm, torvalds,
linux-kernel, virtualization, mst, sgarzare, jasowang, stefanha,
brauner
On 5/17/23 7:09 PM, Mike Christie wrote:
> + CLONE_THREAD | CLONE_FILES, CLONE_SIGHAND,
Sorry. I tried to throw this one in last second so we could see that
we can also see that we can now use CLONE_FILES like io_uring.
It will of course not compile.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
2023-05-18 0:09 ` [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set Mike Christie
@ 2023-05-18 2:34 ` Eric W. Biederman
2023-05-18 3:49 ` Eric W. Biederman
[not found] ` <20230518-kontakt-geduckt-25bab595f503@brauner>
2 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2023-05-18 2:34 UTC (permalink / raw)
To: Mike Christie
Cc: axboe, brauner, mst, linux-kernel, oleg, stefanha, linux,
nicolas.dichtel, virtualization, torvalds
Mike Christie <michael.christie@oracle.com> writes:
> This has us deqeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is
> set when we are dealing with PF_USER_WORKER tasks.
> When a vhost_task gets a SIGKILL, we could have outstanding IO in flight.
> We can easily stop new work/IO from being queued to the vhost_task, but
> for IO that's already been sent to something like the block layer we
> need to wait for the response then process it. These type of IO
> completions use the vhost_task to process the completion so we can't
> exit immediately.
I understand the concern.
> We need to handle wait for then handle those completions from the
> vhost_task, but when we have a SIGKLL pending, functions like
> schedule() return immediately so we can't wait like normal. Functions
> like vhost_worker() degrade to just a while(1); loop.
>
> This patch has get_signal drop down to the normal code path when
> SIGNAL_GROUP_EXIT/group_exec_task is set so the caller can still detect
> there is a SIGKILL but still perform some blocking cleanup.
>
> Note that in that chunk I'm now bypassing that does:
>
> sigdelset(¤t->pending.signal, SIGKILL);
>
> we look to be ok, because in the places we set SIGNAL_GROUP_EXIT/
> group_exec_task we are already doing that on the threads in the
> group.
What you are doing does not make any sense to me.
First there is the semantic non-sense, of queuing something that
is not a signal. The per task SIGKILL bit is used as a flag with
essentially the same meaning as SIGNAL_GROUP_EXIT, reporting that
the task has been scheduled for exit.
More so is what happens afterwards.
As I read your patch it is roughly equivalent to doing:
if ((current->flags & PF_USER_WORKER) &&
fatal_signal_pending(current)) {
sigdelset(¤t->pending.signal, SIGKILL);
clear_siginfo(&ksig->info);
ksig->info.si_signo = SIGKILL;
ksig->info.si_code = SI_USER;
recalc_sigpending();
trace_signal_deliver(SIGKILL, &ksig->info,
&sighand->action[SIGKILL - 1]);
goto fatal;
}
Before the "(SIGNAL_GROUP_EXIT || signal->group_exec_task)" test.
To get that code I stripped the active statements out of the
dequeue_signal path the code executes after your change below.
I don't get why you are making it though because the code you
are opting out of does:
/* Has this task already been marked for death? */
if ((signal->flags & SIGNAL_GROUP_EXIT) ||
signal->group_exec_task) {
clear_siginfo(&ksig->info);
ksig->info.si_signo = signr = SIGKILL;
sigdelset(¤t->pending.signal, SIGKILL);
trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
&sighand->action[SIGKILL - 1]);
recalc_sigpending();
goto fatal;
}
I don't see what in practice changes, other than the fact that by going
through the ordinary dequeue_signal path that other signals can be
processed after a SIGKILL has arrived. Of course those signal all
should be blocked.
The trailing bit that expands the PF_IO_WORKER test to be PF_USER_WORKER
appears reasonable, and possibly needed.
Eric
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
> kernel/signal.c | 19 ++++++++++++++-----
> 1 file changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 8f6330f0e9ca..ae4972eea5db 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2705,9 +2705,18 @@ bool get_signal(struct ksignal *ksig)
> struct k_sigaction *ka;
> enum pid_type type;
>
> - /* Has this task already been marked for death? */
> - if ((signal->flags & SIGNAL_GROUP_EXIT) ||
> - signal->group_exec_task) {
> + /*
> + * Has this task already been marked for death?
> + *
> + * If this is a PF_USER_WORKER then the task may need to do
> + * extra work that requires waiting on running work, so we want
> + * to dequeue the signal below and tell the caller its time to
> + * start its exit procedure. When the work has completed then
> + * the task will exit.
> + */
> + if (!(current->flags & PF_USER_WORKER) &&
> + ((signal->flags & SIGNAL_GROUP_EXIT) ||
> + signal->group_exec_task)) {
> clear_siginfo(&ksig->info);
> ksig->info.si_signo = signr = SIGKILL;
> sigdelset(¤t->pending.signal, SIGKILL);
> @@ -2861,11 +2870,11 @@ bool get_signal(struct ksignal *ksig)
> }
>
> /*
> - * PF_IO_WORKER threads will catch and exit on fatal signals
> + * PF_USER_WORKER threads will catch and exit on fatal signals
> * themselves. They have cleanup that must be performed, so
> * we cannot call do_exit() on their behalf.
> */
> - if (current->flags & PF_IO_WORKER)
> + if (current->flags & PF_USER_WORKER)
> goto out;
>
> /*
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
2023-05-18 0:09 ` [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set Mike Christie
2023-05-18 2:34 ` Eric W. Biederman
@ 2023-05-18 3:49 ` Eric W. Biederman
2023-05-18 15:21 ` Mike Christie
[not found] ` <20230518-kontakt-geduckt-25bab595f503@brauner>
2 siblings, 1 reply; 28+ messages in thread
From: Eric W. Biederman @ 2023-05-18 3:49 UTC (permalink / raw)
To: Mike Christie
Cc: axboe, brauner, mst, linux-kernel, oleg, stefanha, linux,
nicolas.dichtel, virtualization, torvalds
Long story short.
In the patch below the first hunk is a noop.
The code you are bypassing was added to ensure that process termination
(aka SIGKILL) is processed before any other signals. Other than signal
processing order there are not any substantive differences in the two
code paths. With all signals except SIGSTOP == 19 and SIGKILL == 9
blocked SIGKILL should always be processed before SIGSTOP.
Can you try patch with just the last hunk that does
s/PF_IO_WORKER/PF_USER_WORKER/ and see if that is enough?
I have no objections to the final hunk.
Mike Christie <michael.christie@oracle.com> writes:
> This has us deqeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is
> set when we are dealing with PF_USER_WORKER tasks.
>
> When a vhost_task gets a SIGKILL, we could have outstanding IO in flight.
> We can easily stop new work/IO from being queued to the vhost_task, but
> for IO that's already been sent to something like the block layer we
> need to wait for the response then process it. These type of IO
> completions use the vhost_task to process the completion so we can't
> exit immediately.
>
> We need to handle wait for then handle those completions from the
> vhost_task, but when we have a SIGKLL pending, functions like
> schedule() return immediately so we can't wait like normal. Functions
> like vhost_worker() degrade to just a while(1); loop.
>
> This patch has get_signal drop down to the normal code path when
> SIGNAL_GROUP_EXIT/group_exec_task is set so the caller can still detect
> there is a SIGKILL but still perform some blocking cleanup.
>
> Note that in that chunk I'm now bypassing that does:
>
> sigdelset(¤t->pending.signal, SIGKILL);
>
> we look to be ok, because in the places we set SIGNAL_GROUP_EXIT/
> group_exec_task we are already doing that on the threads in the
> group.
>
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
> kernel/signal.c | 19 ++++++++++++++-----
> 1 file changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 8f6330f0e9ca..ae4972eea5db 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2705,9 +2705,18 @@ bool get_signal(struct ksignal *ksig)
> struct k_sigaction *ka;
> enum pid_type type;
>
> - /* Has this task already been marked for death? */
> - if ((signal->flags & SIGNAL_GROUP_EXIT) ||
> - signal->group_exec_task) {
> + /*
> + * Has this task already been marked for death?
> + *
> + * If this is a PF_USER_WORKER then the task may need to do
> + * extra work that requires waiting on running work, so we want
> + * to dequeue the signal below and tell the caller its time to
> + * start its exit procedure. When the work has completed then
> + * the task will exit.
> + */
> + if (!(current->flags & PF_USER_WORKER) &&
> + ((signal->flags & SIGNAL_GROUP_EXIT) ||
> + signal->group_exec_task)) {
> clear_siginfo(&ksig->info);
> ksig->info.si_signo = signr = SIGKILL;
> sigdelset(¤t->pending.signal, SIGKILL);
This hunk is a confusing no-op.
> @@ -2861,11 +2870,11 @@ bool get_signal(struct ksignal *ksig)
> }
>
> /*
> - * PF_IO_WORKER threads will catch and exit on fatal signals
> + * PF_USER_WORKER threads will catch and exit on fatal signals
> * themselves. They have cleanup that must be performed, so
> * we cannot call do_exit() on their behalf.
> */
> - if (current->flags & PF_IO_WORKER)
> + if (current->flags & PF_USER_WORKER)
> goto out;
>
> /*
This hunk is good and makes sense.
Eric
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 5/8] vhost: Add callback that stops new work and waits on running ones
[not found] ` <20230518-lokomotive-aufziehen-dbc432136b76@brauner>
@ 2023-05-18 15:03 ` Mike Christie
2023-05-18 18:38 ` Eric W. Biederman
0 siblings, 1 reply; 28+ messages in thread
From: Mike Christie @ 2023-05-18 15:03 UTC (permalink / raw)
To: Christian Brauner
Cc: axboe, mst, linux-kernel, oleg, linux, ebiederm, stefanha,
nicolas.dichtel, virtualization, torvalds
On 5/18/23 9:18 AM, Christian Brauner wrote:
>> @@ -352,12 +353,13 @@ static int vhost_worker(void *data)
>> if (!node) {
>> schedule();
>> /*
>> - * When we get a SIGKILL our release function will
>> - * be called. That will stop new IOs from being queued
>> - * and check for outstanding cmd responses. It will then
>> - * call vhost_task_stop to exit us.
>> + * When we get a SIGKILL we kick off a work to
>> + * run the driver's helper to stop new work and
>> + * handle completions. When they are done they will
>> + * call vhost_task_stop to tell us to exit.
>> */
>> - vhost_task_get_signal();
>> + if (vhost_task_get_signal())
>> + schedule_work(&dev->destroy_worker);
>> }
>
> I'm pretty sure you still need to actually call exit here. Basically
> mirror what's done in io_worker_exit() minus the io specific bits.
We do call do_exit(). Once destory_worker has flushed the device and
all outstanding IO has completed it call vhost_task_stop(). vhost_worker()
above then breaks out of the loop and returns and vhost_task_fn() does
do_exit().
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
2023-05-18 3:49 ` Eric W. Biederman
@ 2023-05-18 15:21 ` Mike Christie
2023-05-18 16:25 ` Oleg Nesterov
0 siblings, 1 reply; 28+ messages in thread
From: Mike Christie @ 2023-05-18 15:21 UTC (permalink / raw)
To: Eric W. Biederman
Cc: axboe, brauner, mst, linux-kernel, oleg, stefanha, linux,
nicolas.dichtel, virtualization, torvalds
On 5/17/23 10:49 PM, Eric W. Biederman wrote:
>
> Long story short.
>
> In the patch below the first hunk is a noop.
>
> The code you are bypassing was added to ensure that process termination
> (aka SIGKILL) is processed before any other signals. Other than signal
> processing order there are not any substantive differences in the two
> code paths. With all signals except SIGSTOP == 19 and SIGKILL == 9
> blocked SIGKILL should always be processed before SIGSTOP.
>
> Can you try patch with just the last hunk that does
> s/PF_IO_WORKER/PF_USER_WORKER/ and see if that is enough?
>
If I just have the last hunk and then we get SIGKILL what happens is
in code like:
vhost_worker()
schedule()
if (has IO)
handle_IO()
The schedule() calls will hit the signal_pending_state check for
signal_pending or __fatal_signal_pending and so instead of waiting
for whatever wake_up call we normally waited for we tend to just
return immediately. If you just run Qemu (the parent of the vhost_task)
and send SIGKILL then sometimes the vhost_task just spins and it
would look like the task has taken over the CPU (this is what I hit
when I tested Linus's patch).
With the first hunk of the patch, we will end up dequeuing the SIGKILL
and clearing TIF_SIGPENDING, so the vhost_task can still do some work
before it exits.
In the other patches we do:
if (get_signal(ksig))
start_exit_cleanup_by_stopping_newIO()
flush running IO()
exit()
But to do the flush running IO() part of this I need to wait for it so
that's why I wanted to be able to dequeue the SIGKILL and clear the
TIF_SIGPENDING bit.
Or I don't need this specifically. In patch 0/8 I said I knew you guys
would not like it :) If I just have a:
if (fatal_signal())
clear_fatal_signal()
then it would work for me.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
[not found] ` <20230518-kontakt-geduckt-25bab595f503@brauner>
@ 2023-05-18 15:27 ` Mike Christie
[not found] ` <20230518-ratgeber-erbeben-843e68b0d6ac@brauner>
0 siblings, 1 reply; 28+ messages in thread
From: Mike Christie @ 2023-05-18 15:27 UTC (permalink / raw)
To: Christian Brauner
Cc: axboe, mst, linux-kernel, oleg, linux, ebiederm, stefanha,
nicolas.dichtel, virtualization, torvalds
On 5/18/23 3:08 AM, Christian Brauner wrote:
> On Wed, May 17, 2023 at 07:09:13PM -0500, Mike Christie wrote:
>> This has us deqeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is
>> set when we are dealing with PF_USER_WORKER tasks.
>>
>> When a vhost_task gets a SIGKILL, we could have outstanding IO in flight.
>> We can easily stop new work/IO from being queued to the vhost_task, but
>> for IO that's already been sent to something like the block layer we
>> need to wait for the response then process it. These type of IO
>> completions use the vhost_task to process the completion so we can't
>> exit immediately.
>>
>> We need to handle wait for then handle those completions from the
>> vhost_task, but when we have a SIGKLL pending, functions like
>> schedule() return immediately so we can't wait like normal. Functions
>> like vhost_worker() degrade to just a while(1); loop.
>>
>> This patch has get_signal drop down to the normal code path when
>> SIGNAL_GROUP_EXIT/group_exec_task is set so the caller can still detect
>> there is a SIGKILL but still perform some blocking cleanup.
>>
>> Note that in that chunk I'm now bypassing that does:
>>
>> sigdelset(¤t->pending.signal, SIGKILL);
>>
>> we look to be ok, because in the places we set SIGNAL_GROUP_EXIT/
>> group_exec_task we are already doing that on the threads in the
>> group.
>>
>> Signed-off-by: Mike Christie <michael.christie@oracle.com>
>> ---
>
> I think you just got confused by the original discussion that was split
> into two separate threads:
>
> (1) The discussion based on your original proposal to adjust the signal
> handling logic to accommodate vhost workers as they are right now.
> That's where Oleg jumped in.
> (2) My request - which you did in this series - of rewriting vhost
> workers to behave more like io_uring workers.
>
> Both problems are orthogonal. The gist of my proposal is to avoid (1) by
> doing (2). So the only change that's needed is
> s/PF_IO_WORKER/PF_USER_WORKER/ which is pretty obvious as io_uring
> workers and vhost workers no almost fully collapse into the same
> concept.
>
> So forget (1). If additional signal patches are needed as discussed in
> (1) then it must be because of a bug that would affect io_uring workers
> today.
I maybe didn't exactly misunderstand you. I did patch 1/8 to show issues I
hit when I'm doing 2-8. See my reply to Eric's question about what I'm
hitting and why the last part of the patch only did not work for me:
https://lore.kernel.org/lkml/20230518000920.191583-2-michael.christie@oracle.com/T/#mc6286d1a42c79761248ba55f1dd7a433379be6d1
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
2023-05-18 15:21 ` Mike Christie
@ 2023-05-18 16:25 ` Oleg Nesterov
2023-05-18 16:42 ` Mike Christie
0 siblings, 1 reply; 28+ messages in thread
From: Oleg Nesterov @ 2023-05-18 16:25 UTC (permalink / raw)
To: Mike Christie
Cc: axboe, brauner, mst, linux-kernel, linux, Eric W. Biederman,
stefanha, nicolas.dichtel, virtualization, torvalds
I too do not understand the 1st change in this patch ...
On 05/18, Mike Christie wrote:
>
> In the other patches we do:
>
> if (get_signal(ksig))
> start_exit_cleanup_by_stopping_newIO()
> flush running IO()
> exit()
>
> But to do the flush running IO() part of this I need to wait for it so
> that's why I wanted to be able to dequeue the SIGKILL and clear the
> TIF_SIGPENDING bit.
But get_signal() will do what you need, dequeue SIGKILL and clear SIGPENDING ?
if ((signal->flags & SIGNAL_GROUP_EXIT) ||
signal->group_exec_task) {
clear_siginfo(&ksig->info);
ksig->info.si_signo = signr = SIGKILL;
sigdelset(¤t->pending.signal, SIGKILL);
this "dequeues" SIGKILL,
trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
&sighand->action[SIGKILL - 1]);
recalc_sigpending();
this clears TIF_SIGPENDING.
> Or I don't need this specifically. In patch 0/8 I said I knew you guys
> would not like it :) If I just have a:
>
> if (fatal_signal())
> clear_fatal_signal()
see above...
Well... I think this code is actually wrong if if SIGSTOP is pending and
the task is PF_IO_WORKER, but this is also true for io-threads so we can
discuss this separately.
Oleg.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
2023-05-18 16:25 ` Oleg Nesterov
@ 2023-05-18 16:42 ` Mike Christie
2023-05-18 17:04 ` Oleg Nesterov
0 siblings, 1 reply; 28+ messages in thread
From: Mike Christie @ 2023-05-18 16:42 UTC (permalink / raw)
To: Oleg Nesterov
Cc: axboe, brauner, mst, linux-kernel, linux, Eric W. Biederman,
stefanha, nicolas.dichtel, virtualization, torvalds
On 5/18/23 11:25 AM, Oleg Nesterov wrote:
> I too do not understand the 1st change in this patch ...
>
> On 05/18, Mike Christie wrote:
>>
>> In the other patches we do:
>>
>> if (get_signal(ksig))
>> start_exit_cleanup_by_stopping_newIO()
>> flush running IO()
>> exit()
>>
>> But to do the flush running IO() part of this I need to wait for it so
>> that's why I wanted to be able to dequeue the SIGKILL and clear the
>> TIF_SIGPENDING bit.
>
> But get_signal() will do what you need, dequeue SIGKILL and clear SIGPENDING ?
>
> if ((signal->flags & SIGNAL_GROUP_EXIT) ||
> signal->group_exec_task) {
> clear_siginfo(&ksig->info);
> ksig->info.si_signo = signr = SIGKILL;
> sigdelset(¤t->pending.signal, SIGKILL);
>
> this "dequeues" SIGKILL,
>
> trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
> &sighand->action[SIGKILL - 1]);
> recalc_sigpending();
>
> this clears TIF_SIGPENDING.
>
I see what you guys meant. TIF_SIGPENDING isn't getting cleared.
I'll dig into why.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
2023-05-18 16:42 ` Mike Christie
@ 2023-05-18 17:04 ` Oleg Nesterov
2023-05-18 18:28 ` Eric W. Biederman
0 siblings, 1 reply; 28+ messages in thread
From: Oleg Nesterov @ 2023-05-18 17:04 UTC (permalink / raw)
To: Mike Christie
Cc: axboe, brauner, mst, linux-kernel, linux, Eric W. Biederman,
stefanha, nicolas.dichtel, virtualization, torvalds
On 05/18, Mike Christie wrote:
>
> On 5/18/23 11:25 AM, Oleg Nesterov wrote:
> > I too do not understand the 1st change in this patch ...
> >
> > On 05/18, Mike Christie wrote:
> >>
> >> In the other patches we do:
> >>
> >> if (get_signal(ksig))
> >> start_exit_cleanup_by_stopping_newIO()
> >> flush running IO()
> >> exit()
> >>
> >> But to do the flush running IO() part of this I need to wait for it so
> >> that's why I wanted to be able to dequeue the SIGKILL and clear the
> >> TIF_SIGPENDING bit.
> >
> > But get_signal() will do what you need, dequeue SIGKILL and clear SIGPENDING ?
> >
> > if ((signal->flags & SIGNAL_GROUP_EXIT) ||
> > signal->group_exec_task) {
> > clear_siginfo(&ksig->info);
> > ksig->info.si_signo = signr = SIGKILL;
> > sigdelset(¤t->pending.signal, SIGKILL);
> >
> > this "dequeues" SIGKILL,
OOPS. this doesn't remove SIGKILL from current->signal->shared_pending
> >
> > trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
> > &sighand->action[SIGKILL - 1]);
> > recalc_sigpending();
> >
> > this clears TIF_SIGPENDING.
No, I was wrong, recalc_sigpending() won't clear TIF_SIGPENDING if
SIGKILL is in signal->shared_pending
> I see what you guys meant. TIF_SIGPENDING isn't getting cleared.
> I'll dig into why.
See above, sorry for confusion.
And again, there is another problem with SIGSTOP. To simplify, suppose
a PF_IO_WORKER thread does something like
while (signal_pending(current))
get_signal(...);
this will loop forever if (SIGNAL_GROUP_EXIT || group_exec_task) and
SIGSTOP is pending.
Oleg.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
[not found] ` <20230518-ratgeber-erbeben-843e68b0d6ac@brauner>
@ 2023-05-18 18:08 ` Oleg Nesterov
[not found] ` <20230518-fettgehalt-erdbeben-25587a432815@brauner>
0 siblings, 1 reply; 28+ messages in thread
From: Oleg Nesterov @ 2023-05-18 18:08 UTC (permalink / raw)
To: Christian Brauner
Cc: axboe, mst, linux, linux-kernel, ebiederm, stefanha,
nicolas.dichtel, virtualization, torvalds
On 05/18, Christian Brauner wrote:
>
> Yeah, but these are issues that exist with PF_IO_WORKER then too
This was my thought too but I am starting to think I was wrong.
Of course I don't understand the code in io_uring/ but it seems
that it always breaks the IO loops if get_signal() returns SIGKILL.
Oleg.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
[not found] ` <20230518-fettgehalt-erdbeben-25587a432815@brauner>
@ 2023-05-18 18:23 ` Oleg Nesterov
0 siblings, 0 replies; 28+ messages in thread
From: Oleg Nesterov @ 2023-05-18 18:23 UTC (permalink / raw)
To: Christian Brauner
Cc: axboe, mst, linux, linux-kernel, ebiederm, stefanha,
nicolas.dichtel, virtualization, torvalds
On 05/18, Christian Brauner wrote:
>
> On Thu, May 18, 2023 at 08:08:10PM +0200, Oleg Nesterov wrote:
> > On 05/18, Christian Brauner wrote:
> > >
> > > Yeah, but these are issues that exist with PF_IO_WORKER then too
> >
> > This was my thought too but I am starting to think I was wrong.
> >
> > Of course I don't understand the code in io_uring/ but it seems
> > that it always breaks the IO loops if get_signal() returns SIGKILL.
>
> Yeah, it does and I think Mike has a point that vhost could be running
> into an issue here that io_uring currently does avoid. But I don't think
> we should rely on that.
So what do you propose?
Unless (quite possibly) I am confused again, unlike io_uring vhost can't
tolerate signal_pending() == T in the cleanup-after-SIGKILL paths?
Oleg.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
2023-05-18 17:04 ` Oleg Nesterov
@ 2023-05-18 18:28 ` Eric W. Biederman
2023-05-18 22:57 ` Mike Christie
2023-05-22 13:30 ` Oleg Nesterov
0 siblings, 2 replies; 28+ messages in thread
From: Eric W. Biederman @ 2023-05-18 18:28 UTC (permalink / raw)
To: Oleg Nesterov
Cc: axboe, brauner, mst, linux, linux-kernel, stefanha,
nicolas.dichtel, virtualization, torvalds
Oleg Nesterov <oleg@redhat.com> writes:
> On 05/18, Mike Christie wrote:
>>
>> On 5/18/23 11:25 AM, Oleg Nesterov wrote:
>> > I too do not understand the 1st change in this patch ...
>> >
>> > On 05/18, Mike Christie wrote:
>> >>
>> >> In the other patches we do:
>> >>
>> >> if (get_signal(ksig))
>> >> start_exit_cleanup_by_stopping_newIO()
>> >> flush running IO()
>> >> exit()
>> >>
>> >> But to do the flush running IO() part of this I need to wait for it so
>> >> that's why I wanted to be able to dequeue the SIGKILL and clear the
>> >> TIF_SIGPENDING bit.
>> >
>> > But get_signal() will do what you need, dequeue SIGKILL and clear SIGPENDING ?
>> >
>> > if ((signal->flags & SIGNAL_GROUP_EXIT) ||
>> > signal->group_exec_task) {
>> > clear_siginfo(&ksig->info);
>> > ksig->info.si_signo = signr = SIGKILL;
>> > sigdelset(¤t->pending.signal, SIGKILL);
>> >
>> > this "dequeues" SIGKILL,
>
> OOPS. this doesn't remove SIGKILL from current->signal->shared_pending
Neither does calling get_signal the first time.
But the second time get_signal is called it should work.
Leaving SIGKILL in current->signal->shared_pending when it has already
been short circuit delivered appears to be an out and out bug.
>> >
>> > trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
>> > &sighand->action[SIGKILL - 1]);
>> > recalc_sigpending();
>> >
>> > this clears TIF_SIGPENDING.
>
> No, I was wrong, recalc_sigpending() won't clear TIF_SIGPENDING if
> SIGKILL is in signal->shared_pending
That feels wrong as well.
>> I see what you guys meant. TIF_SIGPENDING isn't getting cleared.
>> I'll dig into why.
>
> See above, sorry for confusion.
>
>
>
> And again, there is another problem with SIGSTOP. To simplify, suppose
> a PF_IO_WORKER thread does something like
>
> while (signal_pending(current))
> get_signal(...);
>
> this will loop forever if (SIGNAL_GROUP_EXIT || group_exec_task) and
> SIGSTOP is pending.
I think we want to do something like the untested diff below.
That the PF_IO_WORKER test allows get_signal to be called
after get_signal returns a fatal aka SIGKILL seems wrong.
That doesn't happen in the io_uring case, and certainly nowhere
else.
The change to complete_signal appears obviously correct although
a pending siginfo still needs to be handled.
The change to recalc_siginfo also appears mostly right, but I am not
certain that the !freezing test is in the proper place. Nor am I
certain it won't have other surprise effects.
Still the big issue seems to be the way get_signal is connected into
these threads so that it keeps getting called. Calling get_signal after
a fatal signal has been returned happens nowhere else and even if we fix
it today it is likely to lead to bugs in the future because whoever is
testing and updating the code is unlikely they have a vhost test case
the care about.
diff --git a/kernel/signal.c b/kernel/signal.c
index 8f6330f0e9ca..4d54718cad36 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -181,7 +181,9 @@ void recalc_sigpending_and_wake(struct task_struct *t)
void recalc_sigpending(void)
{
- if (!recalc_sigpending_tsk(current) && !freezing(current))
+ if ((!recalc_sigpending_tsk(current) && !freezing(current)) ||
+ ((current->signal->flags & SIGNAL_GROUP_EXIT) &&
+ !__fatal_signal_pending(current)))
clear_thread_flag(TIF_SIGPENDING);
}
@@ -1043,6 +1045,13 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
* This signal will be fatal to the whole group.
*/
if (!sig_kernel_coredump(sig)) {
+ /*
+ * The signal is being short circuit delivered
+ * don't it pending.
+ */
+ if (type != PIDTYPE_PID) {
+ sigdelset(&t->signal->shared_pending, sig);
+
/*
* Start a group exit and wake everybody up.
* This way we don't have other threads
Eric
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 5/8] vhost: Add callback that stops new work and waits on running ones
2023-05-18 15:03 ` Mike Christie
@ 2023-05-18 18:38 ` Eric W. Biederman
0 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2023-05-18 18:38 UTC (permalink / raw)
To: Mike Christie
Cc: axboe, Christian Brauner, mst, linux-kernel, oleg, linux,
stefanha, nicolas.dichtel, virtualization, torvalds
Mike Christie <michael.christie@oracle.com> writes:
> On 5/18/23 9:18 AM, Christian Brauner wrote:
>>> @@ -352,12 +353,13 @@ static int vhost_worker(void *data)
>>> if (!node) {
>>> schedule();
>>> /*
>>> - * When we get a SIGKILL our release function will
>>> - * be called. That will stop new IOs from being queued
>>> - * and check for outstanding cmd responses. It will then
>>> - * call vhost_task_stop to exit us.
>>> + * When we get a SIGKILL we kick off a work to
>>> + * run the driver's helper to stop new work and
>>> + * handle completions. When they are done they will
>>> + * call vhost_task_stop to tell us to exit.
>>> */
>>> - vhost_task_get_signal();
>>> + if (vhost_task_get_signal())
>>> + schedule_work(&dev->destroy_worker);
>>> }
>>
>> I'm pretty sure you still need to actually call exit here. Basically
>> mirror what's done in io_worker_exit() minus the io specific bits.
>
> We do call do_exit(). Once destory_worker has flushed the device and
> all outstanding IO has completed it call vhost_task_stop(). vhost_worker()
> above then breaks out of the loop and returns and vhost_task_fn() does
> do_exit().
I am not certain how you want to structure this but you really should
not call get_signal after it returns positive before you call do_exit.
You are in complete uncharted and untested waters calling get_signal
multiple times, when get_signal figures the proper response is to
call do_exit itself.
Eric
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
2023-05-18 18:28 ` Eric W. Biederman
@ 2023-05-18 22:57 ` Mike Christie
2023-05-19 4:16 ` Eric W. Biederman
2023-05-22 13:30 ` Oleg Nesterov
1 sibling, 1 reply; 28+ messages in thread
From: Mike Christie @ 2023-05-18 22:57 UTC (permalink / raw)
To: Eric W. Biederman, Oleg Nesterov
Cc: axboe, brauner, mst, linux-kernel, linux, virtualization,
stefanha, nicolas.dichtel, torvalds
On 5/18/23 1:28 PM, Eric W. Biederman wrote:
> Still the big issue seems to be the way get_signal is connected into
> these threads so that it keeps getting called. Calling get_signal after
> a fatal signal has been returned happens nowhere else and even if we fix
> it today it is likely to lead to bugs in the future because whoever is
> testing and updating the code is unlikely they have a vhost test case
> the care about.
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 8f6330f0e9ca..4d54718cad36 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -181,7 +181,9 @@ void recalc_sigpending_and_wake(struct task_struct *t)
>
> void recalc_sigpending(void)
> {
> - if (!recalc_sigpending_tsk(current) && !freezing(current))
> + if ((!recalc_sigpending_tsk(current) && !freezing(current)) ||
> + ((current->signal->flags & SIGNAL_GROUP_EXIT) &&
> + !__fatal_signal_pending(current)))
> clear_thread_flag(TIF_SIGPENDING);
>
> }
> @@ -1043,6 +1045,13 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
> * This signal will be fatal to the whole group.
> */
> if (!sig_kernel_coredump(sig)) {
> + /*
> + * The signal is being short circuit delivered
> + * don't it pending.
> + */
> + if (type != PIDTYPE_PID) {
> + sigdelset(&t->signal->shared_pending, sig);
> +
> /*
> * Start a group exit and wake everybody up.
> * This way we don't have other threads
>
If I change up your patch so the last part is moved down a bit to when we set t
like this:
diff --git a/kernel/signal.c b/kernel/signal.c
index 0ac48c96ab04..c976a80650db 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -181,9 +181,10 @@ void recalc_sigpending_and_wake(struct task_struct *t)
void recalc_sigpending(void)
{
- if (!recalc_sigpending_tsk(current) && !freezing(current))
+ if ((!recalc_sigpending_tsk(current) && !freezing(current)) ||
+ ((current->signal->flags & SIGNAL_GROUP_EXIT) &&
+ !__fatal_signal_pending(current)))
clear_thread_flag(TIF_SIGPENDING);
-
}
EXPORT_SYMBOL(recalc_sigpending);
@@ -1053,6 +1054,17 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
signal->group_exit_code = sig;
signal->group_stop_count = 0;
t = p;
+ /*
+ * The signal is being short circuit delivered
+ * don't it pending.
+ */
+ if (type != PIDTYPE_PID) {
+ struct sigpending *pending;
+
+ pending = &t->signal->shared_pending;
+ sigdelset(&pending->signal, sig);
+ }
+
do {
task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
sigaddset(&t->pending.signal, SIGKILL);
Then get_signal() works like how Oleg mentioned it should earlier.
For vhost I just need the code below which is just Linus's patch plus a call
to get_signal() in vhost_worker() and the PF_IO_WORKER->PF_USER_WORKER change.
Note that when we get SIGKILL, the vhost file_operations->release function is called via
do_exit -> exit_files -> put_files_struct -> close_files
and so the vhost release function starts to flush IO and stop the worker/vhost
task. In vhost_worker() then we just handle those last completions for already
running IO. When the vhost release function detects they are done it does
vhost_task_stop() and vhost_worker() returns and then vhost_task_fn() does do_exit().
So we don't return immediately when get_signal() returns non-zero.
So it works, but it sounds like you don't like vhost relying on the behavior,
and it's non standard to use get_signal() like we are. So I'm not sure how we
want to proceed.
Maybe the safest is to revert:
commit 6e890c5d5021ca7e69bbe203fde42447874d9a82
Author: Mike Christie <michael.christie@oracle.com>
Date: Fri Mar 10 16:03:32 2023 -0600
vhost: use vhost_tasks for worker threads
and retry this for the next kernel when we can do proper testing and more
code review?
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a92af08e7864..1ba9e068b2ab 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -349,8 +349,16 @@ static int vhost_worker(void *data)
}
node = llist_del_all(&worker->work_list);
- if (!node)
+ if (!node) {
schedule();
+ /*
+ * When we get a SIGKILL our release function will
+ * be called. That will stop new IOs from being queued
+ * and check for outstanding cmd responses. It will then
+ * call vhost_task_stop to exit us.
+ */
+ vhost_task_get_signal();
+ }
node = llist_reverse_order(node);
/* make sure flag is seen after deletion */
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 537cbf9a2ade..249a5ece9def 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -29,7 +29,7 @@ struct kernel_clone_args {
u32 io_thread:1;
u32 user_worker:1;
u32 no_files:1;
- u32 ignore_signals:1;
+ u32 block_signals:1;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
diff --git a/include/linux/sched/vhost_task.h b/include/linux/sched/vhost_task.h
index 6123c10b99cf..79bf0ed4ded0 100644
--- a/include/linux/sched/vhost_task.h
+++ b/include/linux/sched/vhost_task.h
@@ -19,5 +19,6 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
void vhost_task_start(struct vhost_task *vtsk);
void vhost_task_stop(struct vhost_task *vtsk);
bool vhost_task_should_stop(struct vhost_task *vtsk);
+void vhost_task_get_signal(void);
#endif
diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..9e04ab5c3946 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2338,14 +2338,10 @@ __latent_entropy struct task_struct *copy_process(
p->flags |= PF_KTHREAD;
if (args->user_worker)
p->flags |= PF_USER_WORKER;
- if (args->io_thread) {
- /*
- * Mark us an IO worker, and block any signal that isn't
- * fatal or STOP
- */
+ if (args->io_thread)
p->flags |= PF_IO_WORKER;
+ if (args->block_signals)
siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
- }
if (args->name)
strscpy_pad(p->comm, args->name, sizeof(p->comm));
@@ -2517,9 +2513,6 @@ __latent_entropy struct task_struct *copy_process(
if (retval)
goto bad_fork_cleanup_io;
- if (args->ignore_signals)
- ignore_signals(p);
-
stackleak_task_init(p);
if (pid != &init_struct_pid) {
@@ -2861,6 +2854,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
.fn_arg = arg,
.io_thread = 1,
.user_worker = 1,
+ .block_signals = 1,
};
return copy_process(NULL, 0, node, &args);
diff --git a/kernel/signal.c b/kernel/signal.c
index 8f6330f0e9ca..0ac48c96ab04 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2861,11 +2861,11 @@ bool get_signal(struct ksignal *ksig)
}
/*
- * PF_IO_WORKER threads will catch and exit on fatal signals
+ * PF_USER_WORKER threads will catch and exit on fatal signals
* themselves. They have cleanup that must be performed, so
* we cannot call do_exit() on their behalf.
*/
- if (current->flags & PF_IO_WORKER)
+ if (current->flags & PF_USER_WORKER)
goto out;
/*
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index b7cbd66f889e..82467f450f0d 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -31,22 +31,13 @@ static int vhost_task_fn(void *data)
*/
void vhost_task_stop(struct vhost_task *vtsk)
{
- pid_t pid = vtsk->task->pid;
-
set_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags);
wake_up_process(vtsk->task);
/*
* Make sure vhost_task_fn is no longer accessing the vhost_task before
- * freeing it below. If userspace crashed or exited without closing,
- * then the vhost_task->task could already be marked dead so
- * kernel_wait will return early.
+ * freeing it below.
*/
wait_for_completion(&vtsk->exited);
- /*
- * If we are just closing/removing a device and the parent process is
- * not exiting then reap the task.
- */
- kernel_wait4(pid, NULL, __WCLONE, NULL);
kfree(vtsk);
}
EXPORT_SYMBOL_GPL(vhost_task_stop);
@@ -61,6 +52,25 @@ bool vhost_task_should_stop(struct vhost_task *vtsk)
}
EXPORT_SYMBOL_GPL(vhost_task_should_stop);
+/**
+ * vhost_task_get_signal - Check if there are pending signals
+ *
+ * This checks if there are signals and will handle freezes requests. For
+ * SIGKILL, out file_operations->release is already being called when we
+ * see the signal, so we let release call vhost_task_stop to tell the
+ * vhost_task to exit when it's done using the task.
+ */
+void vhost_task_get_signal(void)
+{
+ struct ksignal ksig;
+
+ if (!signal_pending(current))
+ return;
+
+ get_signal(&ksig);
+}
+EXPORT_SYMBOL_GPL(vhost_task_get_signal);
+
/**
* vhost_task_create - create a copy of a process to be used by the kernel
* @fn: thread stack
@@ -75,13 +85,14 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
const char *name)
{
struct kernel_clone_args args = {
- .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM,
+ .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM |
+ CLONE_THREAD | CLONE_SIGHAND,
.exit_signal = 0,
.fn = vhost_task_fn,
.name = name,
.user_worker = 1,
.no_files = 1,
- .ignore_signals = 1,
+ .block_signals = 1,
};
struct vhost_task *vtsk;
struct task_struct *tsk;
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
2023-05-18 22:57 ` Mike Christie
@ 2023-05-19 4:16 ` Eric W. Biederman
2023-05-19 23:24 ` Mike Christie
0 siblings, 1 reply; 28+ messages in thread
From: Eric W. Biederman @ 2023-05-19 4:16 UTC (permalink / raw)
To: Mike Christie
Cc: axboe, brauner, mst, linux-kernel, Oleg Nesterov, stefanha, linux,
nicolas.dichtel, virtualization, torvalds
Mike Christie <michael.christie@oracle.com> writes:
> On 5/18/23 1:28 PM, Eric W. Biederman wrote:
>> Still the big issue seems to be the way get_signal is connected into
>> these threads so that it keeps getting called. Calling get_signal after
>> a fatal signal has been returned happens nowhere else and even if we fix
>> it today it is likely to lead to bugs in the future because whoever is
>> testing and updating the code is unlikely they have a vhost test case
>> the care about.
>>
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 8f6330f0e9ca..4d54718cad36 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -181,7 +181,9 @@ void recalc_sigpending_and_wake(struct task_struct *t)
>>
>> void recalc_sigpending(void)
>> {
>> - if (!recalc_sigpending_tsk(current) && !freezing(current))
>> + if ((!recalc_sigpending_tsk(current) && !freezing(current)) ||
>> + ((current->signal->flags & SIGNAL_GROUP_EXIT) &&
>> + !__fatal_signal_pending(current)))
>> clear_thread_flag(TIF_SIGPENDING);
>>
>> }
>> @@ -1043,6 +1045,13 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
>> * This signal will be fatal to the whole group.
>> */
>> if (!sig_kernel_coredump(sig)) {
>> + /*
>> + * The signal is being short circuit delivered
>> + * don't it pending.
>> + */
>> + if (type != PIDTYPE_PID) {
>> + sigdelset(&t->signal->shared_pending, sig);
>> +
>> /*
>> * Start a group exit and wake everybody up.
>> * This way we don't have other threads
>>
>
> If I change up your patch so the last part is moved down a bit to when we set t
> like this:
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 0ac48c96ab04..c976a80650db 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -181,9 +181,10 @@ void recalc_sigpending_and_wake(struct task_struct *t)
>
> void recalc_sigpending(void)
> {
> - if (!recalc_sigpending_tsk(current) && !freezing(current))
> + if ((!recalc_sigpending_tsk(current) && !freezing(current)) ||
> + ((current->signal->flags & SIGNAL_GROUP_EXIT) &&
> + !__fatal_signal_pending(current)))
> clear_thread_flag(TIF_SIGPENDING);
> -
Can we get rid of this suggestion to recalc_sigpending. The more I look
at it the more I am convinced it is not safe. In particular I believe
it is incompatible with dump_interrupted() in fs/coredump.c
The code in fs/coredump.c is the closest code we have to what you are
trying to do with vhost_worker after the session is killed. It also
struggles with TIF_SIGPENDING getting set.
> }
> EXPORT_SYMBOL(recalc_sigpending);
>
> @@ -1053,6 +1054,17 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
> signal->group_exit_code = sig;
> signal->group_stop_count = 0;
> t = p;
> + /*
> + * The signal is being short circuit delivered
> + * don't it pending.
> + */
> + if (type != PIDTYPE_PID) {
> + struct sigpending *pending;
> +
> + pending = &t->signal->shared_pending;
> + sigdelset(&pending->signal, sig);
> + }
> +
> do {
> task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
> sigaddset(&t->pending.signal, SIGKILL);
>
>
> Then get_signal() works like how Oleg mentioned it should earlier.
I am puzzled it makes a difference as t->signal and p->signal should
point to the same thing, and in fact the code would more clearly read
sigdelset(&signal->shared_pending, sig);
But all of that seems minor.
> For vhost I just need the code below which is just Linus's patch plus a call
> to get_signal() in vhost_worker() and the PF_IO_WORKER->PF_USER_WORKER change.
>
> Note that when we get SIGKILL, the vhost file_operations->release function is called via
>
> do_exit -> exit_files -> put_files_struct -> close_files
>
> and so the vhost release function starts to flush IO and stop the worker/vhost
> task. In vhost_worker() then we just handle those last completions for already
> running IO. When the vhost release function detects they are done it does
> vhost_task_stop() and vhost_worker() returns and then vhost_task_fn() does do_exit().
> So we don't return immediately when get_signal() returns non-zero.
>
> So it works, but it sounds like you don't like vhost relying on the behavior,
> and it's non standard to use get_signal() like we are. So I'm not sure how we
> want to proceed.
Let me clarify my concern.
Your code modifies get_signal as:
/*
- * PF_IO_WORKER threads will catch and exit on fatal signals
+ * PF_USER_WORKER threads will catch and exit on fatal signals
* themselves. They have cleanup that must be performed, so
* we cannot call do_exit() on their behalf.
*/
- if (current->flags & PF_IO_WORKER)
+ if (current->flags & PF_USER_WORKER)
goto out;
/*
* Death signals, no core dump.
*/
do_group_exit(ksig->info.si_signo);
/* NOTREACHED */
Which means by modifying get_signal you are logically deleting the
do_group_exit from get_signal. As far as that goes that is a perfectly
reasonable change. The problem is you wind up calling get_signal again
after that. That does not make sense.
I would suggest doing something like:
static int vhost_worker(void *data)
{
struct vhost_worker *worker = data;
struct vhost_work *work, *work_next;
struct llist_node *node;
bool dead = false;
for (;;) {
/* mb paired w/ kthread_stop */
set_current_state(TASK_INTERRUPTIBLE);
if (vhost_task_should_stop(worker->vtsk)) {
__set_current_state(TASK_RUNNING);
break;
}
+ if (!dead && signal_pending()) {
+ dead = get_signal();
+ if (dead) {
+ /*
+ * When the process exits we kick off a work to
+ * run the driver's helper to stop new work and
+ * handle completions. When they are done they will
+ * call vhost_task_stop to tell us to exit.
+ */
+ schedule_work(&dev->destroy_worker);
+ clear_thread_flag(TIF_SIGPENDING);
+ }
+ }
+
node = llist_del_all(&worker->work_list);
if (!node)
schedule();
node = llist_reverse_order(node);
/* make sure flag is seen after deletion */
smp_wmb();
llist_for_each_entry_safe(work, work_next, node, node) {
clear_bit(VHOST_WORK_QUEUED, &work->flags);
__set_current_state(TASK_RUNNING);
kcov_remote_start_common(worker->kcov_handle);
work->fn(work);
kcov_remote_stop();
cond_resched();
}
}
return 0;
}
The idea is two fold.
1) Call get_signal every time through the loop to handle SIGSTOP (to the
process).
2) Don't call get_signal after you know the process is exiting.
With a single call to get_signal (once the process is dead) I don't
see any fundamental problems with your approach. It is doing pretty
much what fs/coredump.c is trying to do.
*Grumble* fs/coredump.c also struggles with TIF_SIGPENDING. But at
least you won't be alone.
> Maybe the safest is to revert:
>
> commit 6e890c5d5021ca7e69bbe203fde42447874d9a82
> Author: Mike Christie <michael.christie@oracle.com>
> Date: Fri Mar 10 16:03:32 2023 -0600
>
> vhost: use vhost_tasks for worker threads
>
> and retry this for the next kernel when we can do proper testing and more
> code review?
I can see wisdom in that. It is always nice when you don't have
to scramble to get the code to do what you want.
What is the diff below? It does not appear to a revert diff.
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index a92af08e7864..1ba9e068b2ab 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -349,8 +349,16 @@ static int vhost_worker(void *data)
> }
>
> node = llist_del_all(&worker->work_list);
> - if (!node)
> + if (!node) {
> schedule();
> + /*
> + * When we get a SIGKILL our release function will
> + * be called. That will stop new IOs from being queued
> + * and check for outstanding cmd responses. It will then
> + * call vhost_task_stop to exit us.
> + */
> + vhost_task_get_signal();
> + }
>
> node = llist_reverse_order(node);
> /* make sure flag is seen after deletion */
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 537cbf9a2ade..249a5ece9def 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -29,7 +29,7 @@ struct kernel_clone_args {
> u32 io_thread:1;
> u32 user_worker:1;
> u32 no_files:1;
> - u32 ignore_signals:1;
> + u32 block_signals:1;
> unsigned long stack;
> unsigned long stack_size;
> unsigned long tls;
> diff --git a/include/linux/sched/vhost_task.h b/include/linux/sched/vhost_task.h
> index 6123c10b99cf..79bf0ed4ded0 100644
> --- a/include/linux/sched/vhost_task.h
> +++ b/include/linux/sched/vhost_task.h
> @@ -19,5 +19,6 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
> void vhost_task_start(struct vhost_task *vtsk);
> void vhost_task_stop(struct vhost_task *vtsk);
> bool vhost_task_should_stop(struct vhost_task *vtsk);
> +void vhost_task_get_signal(void);
>
> #endif
> diff --git a/kernel/fork.c b/kernel/fork.c
> index ed4e01daccaa..9e04ab5c3946 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2338,14 +2338,10 @@ __latent_entropy struct task_struct *copy_process(
> p->flags |= PF_KTHREAD;
> if (args->user_worker)
> p->flags |= PF_USER_WORKER;
> - if (args->io_thread) {
> - /*
> - * Mark us an IO worker, and block any signal that isn't
> - * fatal or STOP
> - */
> + if (args->io_thread)
> p->flags |= PF_IO_WORKER;
> + if (args->block_signals)
> siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
> - }
>
> if (args->name)
> strscpy_pad(p->comm, args->name, sizeof(p->comm));
> @@ -2517,9 +2513,6 @@ __latent_entropy struct task_struct *copy_process(
> if (retval)
> goto bad_fork_cleanup_io;
>
> - if (args->ignore_signals)
> - ignore_signals(p);
> -
> stackleak_task_init(p);
>
> if (pid != &init_struct_pid) {
> @@ -2861,6 +2854,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
> .fn_arg = arg,
> .io_thread = 1,
> .user_worker = 1,
> + .block_signals = 1,
> };
>
> return copy_process(NULL, 0, node, &args);
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 8f6330f0e9ca..0ac48c96ab04 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2861,11 +2861,11 @@ bool get_signal(struct ksignal *ksig)
> }
>
> /*
> - * PF_IO_WORKER threads will catch and exit on fatal signals
> + * PF_USER_WORKER threads will catch and exit on fatal signals
> * themselves. They have cleanup that must be performed, so
> * we cannot call do_exit() on their behalf.
> */
> - if (current->flags & PF_IO_WORKER)
> + if (current->flags & PF_USER_WORKER)
> goto out;
>
> /*
> diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
> index b7cbd66f889e..82467f450f0d 100644
> --- a/kernel/vhost_task.c
> +++ b/kernel/vhost_task.c
> @@ -31,22 +31,13 @@ static int vhost_task_fn(void *data)
> */
> void vhost_task_stop(struct vhost_task *vtsk)
> {
> - pid_t pid = vtsk->task->pid;
> -
> set_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags);
> wake_up_process(vtsk->task);
> /*
> * Make sure vhost_task_fn is no longer accessing the vhost_task before
> - * freeing it below. If userspace crashed or exited without closing,
> - * then the vhost_task->task could already be marked dead so
> - * kernel_wait will return early.
> + * freeing it below.
> */
> wait_for_completion(&vtsk->exited);
> - /*
> - * If we are just closing/removing a device and the parent process is
> - * not exiting then reap the task.
> - */
> - kernel_wait4(pid, NULL, __WCLONE, NULL);
> kfree(vtsk);
> }
> EXPORT_SYMBOL_GPL(vhost_task_stop);
> @@ -61,6 +52,25 @@ bool vhost_task_should_stop(struct vhost_task *vtsk)
> }
> EXPORT_SYMBOL_GPL(vhost_task_should_stop);
>
> +/**
> + * vhost_task_get_signal - Check if there are pending signals
> + *
> + * This checks if there are signals and will handle freezes requests. For
> + * SIGKILL, out file_operations->release is already being called when we
> + * see the signal, so we let release call vhost_task_stop to tell the
> + * vhost_task to exit when it's done using the task.
> + */
> +void vhost_task_get_signal(void)
> +{
> + struct ksignal ksig;
> +
> + if (!signal_pending(current))
> + return;
> +
> + get_signal(&ksig);
> +}
> +EXPORT_SYMBOL_GPL(vhost_task_get_signal);
> +
> /**
> * vhost_task_create - create a copy of a process to be used by the kernel
> * @fn: thread stack
> @@ -75,13 +85,14 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
> const char *name)
> {
> struct kernel_clone_args args = {
> - .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM,
> + .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM |
> + CLONE_THREAD | CLONE_SIGHAND,
> .exit_signal = 0,
> .fn = vhost_task_fn,
> .name = name,
> .user_worker = 1,
> .no_files = 1,
> - .ignore_signals = 1,
> + .block_signals = 1,
> };
> struct vhost_task *vtsk;
> struct task_struct *tsk;
Eric
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
2023-05-19 4:16 ` Eric W. Biederman
@ 2023-05-19 23:24 ` Mike Christie
0 siblings, 0 replies; 28+ messages in thread
From: Mike Christie @ 2023-05-19 23:24 UTC (permalink / raw)
To: Eric W. Biederman
Cc: axboe, brauner, mst, linux-kernel, Oleg Nesterov, stefanha, linux,
nicolas.dichtel, virtualization, torvalds
On 5/18/23 11:16 PM, Eric W. Biederman wrote:
> Mike Christie <michael.christie@oracle.com> writes:
>
>> On 5/18/23 1:28 PM, Eric W. Biederman wrote:
>>> Still the big issue seems to be the way get_signal is connected into
>>> these threads so that it keeps getting called. Calling get_signal after
>>> a fatal signal has been returned happens nowhere else and even if we fix
>>> it today it is likely to lead to bugs in the future because whoever is
>>> testing and updating the code is unlikely they have a vhost test case
>>> the care about.
>>>
>>> diff --git a/kernel/signal.c b/kernel/signal.c
>>> index 8f6330f0e9ca..4d54718cad36 100644
>>> --- a/kernel/signal.c
>>> +++ b/kernel/signal.c
>>> @@ -181,7 +181,9 @@ void recalc_sigpending_and_wake(struct task_struct *t)
>>>
>>> void recalc_sigpending(void)
>>> {
>>> - if (!recalc_sigpending_tsk(current) && !freezing(current))
>>> + if ((!recalc_sigpending_tsk(current) && !freezing(current)) ||
>>> + ((current->signal->flags & SIGNAL_GROUP_EXIT) &&
>>> + !__fatal_signal_pending(current)))
>>> clear_thread_flag(TIF_SIGPENDING);
>>>
>>> }
>>> @@ -1043,6 +1045,13 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
>>> * This signal will be fatal to the whole group.
>>> */
>>> if (!sig_kernel_coredump(sig)) {
>>> + /*
>>> + * The signal is being short circuit delivered
>>> + * don't it pending.
>>> + */
>>> + if (type != PIDTYPE_PID) {
>>> + sigdelset(&t->signal->shared_pending, sig);
>>> +
>>> /*
>>> * Start a group exit and wake everybody up.
>>> * This way we don't have other threads
>>>
>>
>> If I change up your patch so the last part is moved down a bit to when we set t
>> like this:
>>
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 0ac48c96ab04..c976a80650db 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -181,9 +181,10 @@ void recalc_sigpending_and_wake(struct task_struct *t)
>>
>> void recalc_sigpending(void)
>> {
>> - if (!recalc_sigpending_tsk(current) && !freezing(current))
>> + if ((!recalc_sigpending_tsk(current) && !freezing(current)) ||
>> + ((current->signal->flags & SIGNAL_GROUP_EXIT) &&
>> + !__fatal_signal_pending(current)))
>> clear_thread_flag(TIF_SIGPENDING);
>> -
> Can we get rid of this suggestion to recalc_sigpending. The more I look
> at it the more I am convinced it is not safe. In particular I believe
> it is incompatible with dump_interrupted() in fs/coredump.c
With your clear_thread_flag call in vhost_worker suggestion I don't need
the above chunk.
>
> The code in fs/coredump.c is the closest code we have to what you are
> trying to do with vhost_worker after the session is killed. It also
> struggles with TIF_SIGPENDING getting set.
>> }
>> EXPORT_SYMBOL(recalc_sigpending);
>>
>> @@ -1053,6 +1054,17 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
>> signal->group_exit_code = sig;
>> signal->group_stop_count = 0;
>> t = p;
>> + /*
>> + * The signal is being short circuit delivered
>> + * don't it pending.
>> + */
>> + if (type != PIDTYPE_PID) {
>> + struct sigpending *pending;
>> +
>> + pending = &t->signal->shared_pending;
>> + sigdelset(&pending->signal, sig);
>> + }
>> +
>> do {
>> task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
>> sigaddset(&t->pending.signal, SIGKILL);
>>
>>
>> Then get_signal() works like how Oleg mentioned it should earlier.
>
> I am puzzled it makes a difference as t->signal and p->signal should
> point to the same thing, and in fact the code would more clearly read
> sigdelset(&signal->shared_pending, sig);
Yeah either should work. The original patch had used t before it was
set so my patch just moved it down to after we set it. I just used signal
like you wrote and it works fine.
>
> But all of that seems minor.
>
>> For vhost I just need the code below which is just Linus's patch plus a call
>> to get_signal() in vhost_worker() and the PF_IO_WORKER->PF_USER_WORKER change.
>>
>> Note that when we get SIGKILL, the vhost file_operations->release function is called via
>>
>> do_exit -> exit_files -> put_files_struct -> close_files
>>
>> and so the vhost release function starts to flush IO and stop the worker/vhost
>> task. In vhost_worker() then we just handle those last completions for already
>> running IO. When the vhost release function detects they are done it does
>> vhost_task _stop() and vhost_worker() returns and then vhost_task_fn() does do_exit().
>> So we don't return immediately when get_signal() returns non-zero.
>>
>> So it works, but it sounds like you don't like vhost relying on the behavior,
>> and it's non standard to use get_signal() like we are. So I'm not sure how we
>> want to proceed.
>
> Let me clarify my concern.
>
> Your code modifies get_signal as:
> /*
> - * PF_IO_WORKER threads will catch and exit on fatal signals
> + * PF_USER_WORKER threads will catch and exit on fatal signals
> * themselves. They have cleanup that must be performed, so
> * we cannot call do_exit() on their behalf.
> */
> - if (current->flags & PF_IO_WORKER)
> + if (current->flags & PF_USER_WORKER)
> goto out;
> /*
> * Death signals, no core dump.
> */
> do_group_exit(ksig->info.si_signo);
> /* NOTREACHED */
>
> Which means by modifying get_signal you are logically deleting the
> do_group_exit from get_signal. As far as that goes that is a perfectly
> reasonable change. The problem is you wind up calling get_signal again
> after that. That does not make sense.
>
> I would suggest doing something like:
I see. I've run some tests today and what you suggested for vhost_worker
and your signal change and it works for SIGKILL/STOP/CONT and freeze.
>
> What is the diff below? It does not appear to a revert diff.
It was just the most simple patch that was needed with your signal changes
(and the PF_IO_WORKER -> PF_USER_WORKER signal change) to fix the 2
regressions reported. I wanted to give the vhost devs an idea of what was
needed with your signal changes.
Let me do some more testing over the weekend and I'll post a RFC with your
signal change and the minimal changes needed to vhost to handle the 2
regressions that were reported. The vhost developers can get a better idea
of what needs to be done and they can better decide what they want to do to
proceed.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
2023-05-18 18:28 ` Eric W. Biederman
2023-05-18 22:57 ` Mike Christie
@ 2023-05-22 13:30 ` Oleg Nesterov
1 sibling, 0 replies; 28+ messages in thread
From: Oleg Nesterov @ 2023-05-22 13:30 UTC (permalink / raw)
To: Eric W. Biederman
Cc: axboe, brauner, mst, linux, linux-kernel, stefanha,
nicolas.dichtel, virtualization, torvalds
On 05/18, Eric W. Biederman wrote:
>
> void recalc_sigpending(void)
> {
> - if (!recalc_sigpending_tsk(current) && !freezing(current))
> + if ((!recalc_sigpending_tsk(current) && !freezing(current)) ||
> + ((current->signal->flags & SIGNAL_GROUP_EXIT) &&
> + !__fatal_signal_pending(current)))
> clear_thread_flag(TIF_SIGPENDING);
>
> }
> @@ -1043,6 +1045,13 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
> * This signal will be fatal to the whole group.
> */
> if (!sig_kernel_coredump(sig)) {
> + /*
> + * The signal is being short circuit delivered
> + * don't it pending.
> + */
> + if (type != PIDTYPE_PID) {
> + sigdelset(&t->signal->shared_pending, sig);
> +
> /*
> * Start a group exit and wake everybody up.
> * This way we don't have other threads
Eric, sorry. I fail to understand this patch.
How can it help? And whom?
Perhaps we can discuss it in the context of the new series from Mike?
Oleg.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2023-05-22 13:31 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-18 0:09 [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND Mike Christie
2023-05-18 0:09 ` [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set Mike Christie
2023-05-18 2:34 ` Eric W. Biederman
2023-05-18 3:49 ` Eric W. Biederman
2023-05-18 15:21 ` Mike Christie
2023-05-18 16:25 ` Oleg Nesterov
2023-05-18 16:42 ` Mike Christie
2023-05-18 17:04 ` Oleg Nesterov
2023-05-18 18:28 ` Eric W. Biederman
2023-05-18 22:57 ` Mike Christie
2023-05-19 4:16 ` Eric W. Biederman
2023-05-19 23:24 ` Mike Christie
2023-05-22 13:30 ` Oleg Nesterov
[not found] ` <20230518-kontakt-geduckt-25bab595f503@brauner>
2023-05-18 15:27 ` Mike Christie
[not found] ` <20230518-ratgeber-erbeben-843e68b0d6ac@brauner>
2023-05-18 18:08 ` Oleg Nesterov
[not found] ` <20230518-fettgehalt-erdbeben-25587a432815@brauner>
2023-05-18 18:23 ` Oleg Nesterov
2023-05-18 0:09 ` [RFC PATCH 2/8] vhost/vhost_task: Hook vhost layer into signal handler Mike Christie
2023-05-18 0:16 ` Linus Torvalds
2023-05-18 1:01 ` Mike Christie
2023-05-18 0:09 ` [RFC PATCH 3/8] fork/vhost_task: Switch to CLONE_THREAD and CLONE_SIGHAND Mike Christie
2023-05-18 0:09 ` [RFC PATCH 4/8] vhost-net: Move vhost_net_open Mike Christie
2023-05-18 0:09 ` [RFC PATCH 5/8] vhost: Add callback that stops new work and waits on running ones Mike Christie
[not found] ` <20230518-lokomotive-aufziehen-dbc432136b76@brauner>
2023-05-18 15:03 ` Mike Christie
2023-05-18 18:38 ` Eric W. Biederman
2023-05-18 0:09 ` [RFC PATCH 6/8] vhost-scsi: Add callback to stop and wait on works Mike Christie
2023-05-18 0:09 ` [RFC PATCH 7/8] vhost-net: " Mike Christie
2023-05-18 0:09 ` [RFC PATCH 8/8] fork/vhost_task: remove no_files Mike Christie
2023-05-18 1:04 ` Mike Christie
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).