From: "Michael S. Tsirkin" <mst@redhat.com>
To: Jason Wang <jasowang@redhat.com>
Cc: Andreas Karis <akaris@redhat.com>,
Mike Christie <michael.christie@oracle.com>,
oleg@redhat.com, ebiederm@xmission.com,
virtualization@lists.linux-foundation.org, sgarzare@redhat.com,
stefanha@redhat.com, brauner@kernel.org,
Laurent Vivier <lvivier@redhat.com>
Subject: Re: [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting
Date: Wed, 15 May 2024 03:24:12 -0400 [thread overview]
Message-ID: <20240515032050-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <CACGkMEs-urygu=h4_Y75ojwCKvRXH39fr-HpSTHchT8eL7issA@mail.gmail.com>
On Wed, May 15, 2024 at 02:27:32PM +0800, Jason Wang wrote:
> On Fri, Apr 19, 2024 at 8:40 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Fri, Apr 19, 2024 at 8:37 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Thu, Apr 18, 2024 at 5:26 PM Andreas Karis <akaris@redhat.com> wrote:
> > > >
> > > > Not a kernel developer, so sorry if this here is pointless spam.
> > > >
> > > > But qemu is not the only thing consuming /dev/vhost-net:
> > > > https://doc.dpdk.org/guides/howto/virtio_user_as_exception_path.html
> > > >
> > > > Before the series of patches around
> > > > https://github.com/torvalds/linux/commit/e297cd54b3f81d652456ae6cb93941fc6b5c6683,
> > > > you would run a DPDK application inside namespaces of container
> > > > "blue", but then the vhost-... threads were spawned as children of
> > > > kthreadd and run in the global namespaces. That seemed counter
> > > > intuitive, and what's worse, the DPDK application running inside the
> > > > "blue" container namespaces cannot see and thus cannot modify
> > > > scheduling attributes and affinity of its own vhost-... threads. In
> > > > scenarios with pretty strict isolation/pinning requirements, the
> > > > vhost-... threads could easily be scheduled on the same CPUs as the
> > > > DPDK poll mode drivers (because they inherit the calling process' CPU
> > > > affinity), and there's no way for the DPDK process itself to move the
> > > > vhost-... threads to other CPUs. Also, if the DPDK process ran as
> > > > SCHED_RR, the vhost-.... thread would still be SCHED_NORMAL.
> > > >
> > > > After the patch series, the fact that the vhost-... threads run as
> > > > tasks of the process that requests them seems more natural and gives
> > > > us as users the kind of control that we'd want from within the
> > > > container to modify the vhost-... threads' scheduling and affinity.
> > > > The vhost-... thread as a child of the DPDK application inherits the
> > > > same scheduling class, CPU set, etc and the DPDK process can easily
> > > > change those attributes.
> > > >
> > > > However, if another user was used to the old behavior, and their
> > > > entire tooling was created for the old behavior (imagine someone wrote
> > > > a bunch of scripts/services to force affinity from the global PID
> > > > namespace for those vhost-... threads), and now this change was
> > > > introduced, it would break their tooling. So because _I_ might fancy
> > > > the new behavior, but user _B_ might be all set up for the old
> > > > kthreadd, shouldn't there be a flag or configuration for the user to:
> > > >
> > > > if (new_flag)
> > > > vhost_task_create()
> > > > else
> > > > kthread_create()
> > > >
> > > > I don't know what kind of flag or knob I'd expect here, but it could
> > > > be granular for the calling process (a new syscall?), or a kernel
> > > > flag, etc. But something that let's the admin choose how the kernel
> > > > spawns the vhost-... threads?
> > >
> > > A flag via uAPI that could be controlled by the user of vhost-net via
> > > syscall. For example, it could be done via set/get_backend_feautre()
> > > syscall.
> >
> > So it can be controlled by the admin if we want.
> >
> > Thanks
>
> Michael, does the flag make sense or not?
>
> Thanks
So you want a flag to bring back the pre-6.4 behaviour?
Given the behaviour has been in an upstream kernel already,
I'd be inclined to say let's wait until there's an actual
report of a follout, and then look for work-arounds.
Otherwise we don't really know if we are even fixing anything.
> >
> > >
> > > Thanks
> > >
> > > >
> > > > On Thu, Apr 18, 2024 at 9:07 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > >
> > > > > On Thu, Apr 18, 2024 at 12:08:52PM +0800, Jason Wang wrote:
> > > > > > On Thu, Apr 18, 2024 at 12:10 AM Mike Christie
> > > > > > <michael.christie@oracle.com> wrote:
> > > > > > >
> > > > > > > On 4/16/24 10:50 PM, Jason Wang wrote:
> > > > > > > > On Mon, Apr 15, 2024 at 4:52 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > >>
> > > > > > > >> On Sat, Apr 13, 2024 at 12:53 AM <michael.christie@oracle.com> wrote:
> > > > > > > >>>
> > > > > > > >>> On 4/11/24 10:28 PM, Jason Wang wrote:
> > > > > > > >>>> On Fri, Apr 12, 2024 at 12:19 AM Mike Christie
> > > > > > > >>>> <michael.christie@oracle.com> wrote:
> > > > > > > >>>>>
> > > > > > > >>>>> On 4/11/24 3:39 AM, Jason Wang wrote:
> > > > > > > >>>>>> On Sat, Mar 16, 2024 at 8:47 AM Mike Christie
> > > > > > > >>>>>> <michael.christie@oracle.com> wrote:
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> The following patches were made over Linus's tree and also apply over
> > > > > > > >>>>>>> mst's vhost branch. The patches add the ability for vhost_tasks to
> > > > > > > >>>>>>> handle SIGKILL by flushing queued works, stop new works from being
> > > > > > > >>>>>>> queued, and prepare the task for an early exit.
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> This removes the need for the signal/coredump hacks added in:
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> when the vhost_task patches were initially merged and fix the issue
> > > > > > > >>>>>>> in this thread:
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> Long Background:
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> The original vhost worker code didn't support any signals. If the
> > > > > > > >>>>>>> userspace application that owned the worker got a SIGKILL, the app/
> > > > > > > >>>>>>> process would exit dropping all references to the device and then the
> > > > > > > >>>>>>> file operation's release function would be called. From there we would
> > > > > > > >>>>>>> wait on running IO then cleanup the device's memory.
> > > > > > > >>>>>>
> > > > > > > >>>>>> A dumb question.
> > > > > > > >>>>>>
> > > > > > > >>>>>> Is this a user space noticeable change? For example, with this series
> > > > > > > >>>>>> a SIGKILL may shutdown the datapath ...
> > > > > > > >>>>>
> > > > > > > >>>>> It already changed in 6.4. We basically added a new interface to shutdown
> > > > > > > >>>>> everything (userspace and vhost kernel parts). So we won't just shutdown
> > > > > > > >>>>> the data path while userspace is still running. We will shutdown everything
> > > > > > > >>>>> now if you send a SIGKILL to a vhost worker's thread.
> > > > > > > >>>>
> > > > > > > >>>> If I understand correctly, for example Qemu can still live is SIGKILL
> > > > > > > >>>> is just send to vhost thread.
> > > > > > > >>>
> > > > > > > >>> Pre-6.4 qemu could still survive if only the vhost thread got a SIGKILL.
> > > > > > > >>> We used kthreads which are special and can ignore it like how userspace
> > > > > > > >>> can ignore SIGHUP.
> > > > > > > >>>
> > > > > > > >>> 6.4 and newer kernels cannot survive. Even if the vhost thread sort of
> > > > > > > >>> ignores it like I described below where, the signal is still delivered
> > > > > > > >>> to the other qemu threads due to the shared signal handler. Userspace
> > > > > > > >>> can't ignore SIGKILL. It doesn't have any say in the matter, and the
> > > > > > > >>> kernel forces them to exit.
> > > > > > > >>
> > > > > > > >> Ok, I see, so the reason is that vhost belongs to the same thread
> > > > > > > >> group as the owner now.
> > > > > > > >>
> > > > > > > >>>
> > > > > > > >>>>
> > > > > > > >>>> If this is correct, guests may detect this (for example virtio-net has
> > > > > > > >>>> a watchdog).
> > > > > > > >>>>
> > > > > > > >>>
> > > > > > > >>> What did you mean by that part? Do you mean if the vhost thread were to
> > > > > > > >>> exit, so drivers/vhost/net.c couldn't process IO, then the watchdog in
> > > > > > > >>> the guest (virtio-net driver in the guest kernel) would detect that?
> > > > > > > >>
> > > > > > > >> I meant this one. But since we are using CLONE_THREAD, we won't see these.
> > > > > > > >>
> > > > > > > >>> Or
> > > > > > > >>> are you saying the watchdog in the guest can detect signals that the
> > > > > > > >>> host gets?
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>>>
> > > > > > > >>>>> Here are a lots of details:
> > > > > > > >>>>>
> > > > > > > >>>>> - Pre-6.4 kernel, when vhost workers used kthreads, if you sent any signal
> > > > > > > >>>>> to a vhost worker, we ignore it. Nothing happens. kthreads are special and
> > > > > > > >>>>> can ignore all signals.
> > > > > > > >>>>>
> > > > > > > >>>>> You could think of it as the worker is a completely different process than
> > > > > > > >>>>> qemu/userspace so they have completely different signal handlers. The
> > > > > > > >>>>> vhost worker signal handler ignores all signals even SIGKILL.
> > > > > > > >>>>
> > > > > > > >>>> Yes.
> > > > > > > >>>>
> > > > > > > >>>>>
> > > > > > > >>>>> If you send a SIGKILL to a qemu thread, then it just exits right away. We
> > > > > > > >>>>> don't get to do an explicit close() on the vhost device and we don't get
> > > > > > > >>>>> to do ioctls like VHOST_NET_SET_BACKEND to clear backends. The kernel exit
> > > > > > > >>>>> code runs and releases refcounts on the device/file, then the vhost device's
> > > > > > > >>>>> file_operations->release function is called. vhost_dev_cleanup then stops
> > > > > > > >>>>> the vhost worker.
> > > > > > > >>>>
> > > > > > > >>>> Right.
> > > > > > > >>>>
> > > > > > > >>>>>
> > > > > > > >>>>> - In 6.4 and newer kernels, vhost workers use vhost_tasks, so the worker
> > > > > > > >>>>> can be thought of as a thread within the userspace process. With that
> > > > > > > >>>>> change we have the same signal handler as the userspace process.
> > > > > > > >>>>>
> > > > > > > >>>>> If you send a SIGKILL to a qemu thread then it works like above.
> > > > > > > >>>>>
> > > > > > > >>>>> If you send a SIGKILL to a vhost worker, the vhost worker still sort of
> > > > > > > >>>>> ignores it (that is the hack that I mentioned at the beginning of this
> > > > > > > >>>>> thread). kernel/vhost_task.c:vhost_task_fn will see the signal and
> > > > > > > >>>>> then just continue to process works until file_operations->release
> > > > > > > >>>>> calls
> > > > > > > >>>>
> > > > > > > >>>> Yes, so this sticks to the behaviour before vhost_tasks.
> > > > > > > >>>
> > > > > > > >>> Not exactly. The vhost_task stays alive temporarily.
> > > > > > > >>>
> > > > > > > >>> The signal is still delivered to the userspace threads and they will
> > > > > > > >>> exit due to getting the SIGKILL also. SIGKILL goes to all the threads in
> > > > > > > >>> the process and all userspace threads exit like normal because the vhost
> > > > > > > >>> task and normal old userspace threads share a signal handler. When
> > > > > > > >>> userspace exits, the kernel force drops the refcounts on the vhost
> > > > > > > >>> devices and that runs the release function so the vhost_task will then exit.
> > > > > > > >>>
> > > > > > > >>> So what I'm trying to say is that in 6.4 we already changed the behavior.
> > > > > > > >>
> > > > > > > >> Yes. To say the truth, it looks even worse but it might be too late to fix.
> > > > > > > >
> > > > > > > > Andres (cced) has identified two other possible changes:
> > > > > > > >
> > > > > > > > 1) doesn't run in the global PID namespace but run in the namespace of owner
> > > > > > >
> > > > > > > Yeah, I mentioned that one in vhost.h like it's a feature and when posting
> > > > > > > the patches I mentioned it as a possible fix. I mean I thought we wanted it
> > > > > > > to work like qemu and iothreads where the iothread would inherit all those
> > > > > > > values automatically.
> > > > > >
> > > > > > Right, but it could be noticed by the userspace, especially for the
> > > > > > one that tries to do tweak on the performance.
> > > > > >
> > > > > > The root cause is the that now we do copy_processs() in the process of
> > > > > > Qemu instead of the kthreadd. Which result of the the differences of
> > > > > > namespace (I think PID namespace is not the only one we see
> > > > > > difference) and others for the vhost task.
> > > > >
> > > > > Leaking things out of a namespace looks more like a bug.
> > > > > If you really have to be pedantic, the thing to add would
> > > > > be a namespace flag not a qemu flag. Userspace running inside
> > > > > a namespace really must have no say about whether to leak
> > > > > info out of it.
> > > > >
> > > > > > >
> > > > > > > At the time, I thought we didn't inherit the namespace, like we did the cgroup,
> > > > > > > because there was no kernel function for it (like how we didn't inherit v2
> > > > > > > cgroups until recently when someone added some code for that).
> > > > > > >
> > > > > > > I don't know if it's allowed to have something like qemu in namespace N but then
> > > > > > > have it's children (vhost thread in this case) in the global namespace.
> > > > > > > I'll
> > > > > > > look into it.
> > > > > >
> > > > > > Instead of moving vhost thread between difference namespaces, I wonder
> > > > > > if the following is simpler:
> > > > > >
> > > > > > if (new_flag)
> > > > > > vhost_task_create()
> > > > > > else
> > > > > > kthread_create()
> > > > > >
> > > > > > New flag inherits the attributes of Qemu (namespaces, rlimit, cgroup,
> > > > > > scheduling attributes ...) which is what we want. Without the new
> > > > > > flag, we stick exactly to the behaviour as in the past to unbreak
> > > > > > existing userspace.
> > > > > >
> > > > > > >
> > > > > > > > 2) doesn't inherit kthreadd's scheduling attributes but the owner
> > > > > > >
> > > > > > > Same as above for this one. I thought I was fixing a bug where before
> > > > > > > we had to manually tune the vhost thread's values but for iothreads they
> > > > > > > automatically got setup.
> > > > > > >
> > > > > > > Just to clarify this one. When we used kthreads, kthread() will reset the
> > > > > > > scheduler priority for the kthread that's created, so we got the default
> > > > > > > values instead of inheriting kthreadd's values. So we would want:
> > > > > > >
> > > > > > > + sched_setscheduler_nocheck(current, SCHED_NORMAL, ¶m);
> > > > > > >
> > > > > > > in vhost_task_fn() instead of inheriting kthreadd's values.
> > > > > > >
> > > > > > > >
> > > > > > > > Though such a change makes more sense for some use cases, it may break others.
> > > > > > > >
> > > > > > > > I wonder if we need to introduce a new flag and bring the old kthread
> > > > > > >
> > > > > > > Do you mean something like a module param?
> > > > > >
> > > > > > This requires the management layer to know if it has a new user space
> > > > > > or not which is hard. A better place is to introduce backend features.
> > > > > >
> > > > > > >
> > > > > > > > codes if the flag is not set? Then we would not end up trying to align
> > > > > > > > the behaviour?
> > > > > > > >
> > > > > > >
> > > > > > > Let me know what you guys prefer. The sched part is easy. The namespace
> > > > > > > part might be more difficult, but I will look into it if you want it.
> > > > > >
> > > > > > Thanks a lot. I think it would be better to have the namespace part
> > > > > > (as well as other namespaces) then we don't need to answer hard
> > > > > > questions like if it can break user space or not.
> > > > > >
> > > > > > >
> > > > >
> > > >
next prev parent reply other threads:[~2024-05-15 7:24 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-16 0:46 [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting Mike Christie
2024-03-16 0:46 ` [PATCH 1/9] vhost-scsi: Handle vhost_vq_work_queue failures for events Mike Christie
2024-03-16 0:47 ` [PATCH 2/9] vhost-scsi: Handle vhost_vq_work_queue failures for cmds Mike Christie
2024-03-16 0:47 ` [PATCH 3/9] vhost-scsi: Use system wq to flush dev for TMFs Mike Christie
2024-03-16 0:47 ` [PATCH 4/9] vhost: Remove vhost_vq_flush Mike Christie
2024-03-16 0:47 ` [PATCH 5/9] vhost_scsi: Handle vhost_vq_work_queue failures for TMFs Mike Christie
2024-03-16 0:47 ` [PATCH 6/9] vhost: Use virtqueue mutex for swapping worker Mike Christie
2024-03-16 0:47 ` [PATCH 7/9] vhost: Release worker mutex during flushes Mike Christie
2024-03-16 0:47 ` [PATCH 8/9] vhost_task: Handle SIGKILL by flushing work and exiting Mike Christie
2024-03-16 0:47 ` [PATCH 9/9] kernel: Remove signal hacks for vhost_tasks Mike Christie
2024-04-09 4:16 ` [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting Jason Wang
2024-04-09 14:57 ` Mike Christie
2024-04-09 16:40 ` Michael S. Tsirkin
2024-04-09 21:55 ` michael.christie
2024-04-10 4:21 ` Michael S. Tsirkin
2024-04-18 7:10 ` Michael S. Tsirkin
2024-04-11 8:39 ` Jason Wang
2024-04-11 16:19 ` Mike Christie
2024-04-12 3:28 ` Jason Wang
2024-04-12 16:52 ` michael.christie
2024-04-15 8:52 ` Jason Wang
2024-04-17 3:50 ` Jason Wang
2024-04-17 16:03 ` Mike Christie
2024-04-18 4:08 ` Jason Wang
2024-04-18 7:07 ` Michael S. Tsirkin
2024-04-18 9:25 ` Andreas Karis
2024-04-19 0:37 ` Jason Wang
2024-04-19 0:40 ` Jason Wang
2024-05-15 6:27 ` Jason Wang
2024-05-15 7:24 ` Michael S. Tsirkin [this message]
2024-04-19 0:33 ` Jason Wang
2024-04-18 7:01 ` Michael S. Tsirkin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240515032050-mutt-send-email-mst@kernel.org \
--to=mst@redhat.com \
--cc=akaris@redhat.com \
--cc=brauner@kernel.org \
--cc=ebiederm@xmission.com \
--cc=jasowang@redhat.com \
--cc=lvivier@redhat.com \
--cc=michael.christie@oracle.com \
--cc=oleg@redhat.com \
--cc=sgarzare@redhat.com \
--cc=stefanha@redhat.com \
--cc=virtualization@lists.linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).