virtualization.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Mike Christie <michael.christie@oracle.com>
Cc: Jason Wang <jasowang@redhat.com>,
	oleg@redhat.com, ebiederm@xmission.com,
	virtualization@lists.linux-foundation.org, sgarzare@redhat.com,
	stefanha@redhat.com, brauner@kernel.org,
	Andreas Karis <akaris@redhat.com>,
	Laurent Vivier <lvivier@redhat.com>
Subject: Re: [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting
Date: Thu, 18 Apr 2024 03:01:03 -0400	[thread overview]
Message-ID: <20240417150923-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <edc792c4-30c9-4065-bf09-657bd7766d04@oracle.com>

On Wed, Apr 17, 2024 at 11:03:07AM -0500, Mike Christie wrote:
> On 4/16/24 10:50 PM, Jason Wang wrote:
> > On Mon, Apr 15, 2024 at 4:52 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> On Sat, Apr 13, 2024 at 12:53 AM <michael.christie@oracle.com> wrote:
> >>>
> >>> On 4/11/24 10:28 PM, Jason Wang wrote:
> >>>> On Fri, Apr 12, 2024 at 12:19 AM Mike Christie
> >>>> <michael.christie@oracle.com> wrote:
> >>>>>
> >>>>> On 4/11/24 3:39 AM, Jason Wang wrote:
> >>>>>> On Sat, Mar 16, 2024 at 8:47 AM Mike Christie
> >>>>>> <michael.christie@oracle.com> wrote:
> >>>>>>>
> >>>>>>> The following patches were made over Linus's tree and also apply over
> >>>>>>> mst's vhost branch. The patches add the ability for vhost_tasks to
> >>>>>>> handle SIGKILL by flushing queued works, stop new works from being
> >>>>>>> queued, and prepare the task for an early exit.
> >>>>>>>
> >>>>>>> This removes the need for the signal/coredump hacks added in:
> >>>>>>>
> >>>>>>> Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
> >>>>>>>
> >>>>>>> when the vhost_task patches were initially merged and fix the issue
> >>>>>>> in this thread:
> >>>>>>>
> >>>>>>> https://lore.kernel.org/all/000000000000a41b82060e875721@google.com/
> >>>>>>>
> >>>>>>> Long Background:
> >>>>>>>
> >>>>>>> The original vhost worker code didn't support any signals. If the
> >>>>>>> userspace application that owned the worker got a SIGKILL, the app/
> >>>>>>> process would exit dropping all references to the device and then the
> >>>>>>> file operation's release function would be called. From there we would
> >>>>>>> wait on running IO then cleanup the device's memory.
> >>>>>>
> >>>>>> A dumb question.
> >>>>>>
> >>>>>> Is this a user space noticeable change? For example, with this series
> >>>>>> a SIGKILL may shutdown the datapath ...
> >>>>>
> >>>>> It already changed in 6.4. We basically added a new interface to shutdown
> >>>>> everything (userspace and vhost kernel parts). So we won't just shutdown
> >>>>> the data path while userspace is still running. We will shutdown everything
> >>>>> now if you send a SIGKILL to a vhost worker's thread.
> >>>>
> >>>> If I understand correctly, for example Qemu can still live is SIGKILL
> >>>> is just send to vhost thread.
> >>>
> >>> Pre-6.4 qemu could still survive if only the vhost thread got a SIGKILL.
> >>> We used kthreads which are special and can ignore it like how userspace
> >>> can ignore SIGHUP.
> >>>
> >>> 6.4 and newer kernels cannot survive. Even if the vhost thread sort of
> >>> ignores it like I described below where, the signal is still delivered
> >>> to the other qemu threads due to the shared signal handler. Userspace
> >>> can't ignore SIGKILL. It doesn't have any say in the matter, and the
> >>> kernel forces them to exit.
> >>
> >> Ok, I see, so the reason is that vhost belongs to the same thread
> >> group as the owner now.
> >>
> >>>
> >>>>
> >>>> If this is correct, guests may detect this (for example virtio-net has
> >>>> a watchdog).
> >>>>
> >>>
> >>> What did you mean by that part? Do you mean if the vhost thread were to
> >>> exit, so drivers/vhost/net.c couldn't process IO, then the watchdog in
> >>> the guest (virtio-net driver in the guest kernel) would detect that?
> >>
> >> I meant this one. But since we are using CLONE_THREAD, we won't see these.
> >>
> >>> Or
> >>> are you saying the watchdog in the guest can detect signals that the
> >>> host gets?
> >>>
> >>>
> >>>>>
> >>>>> Here are a lots of details:
> >>>>>
> >>>>> - Pre-6.4 kernel, when vhost workers used kthreads, if you sent any signal
> >>>>> to a vhost worker, we ignore it. Nothing happens. kthreads are special and
> >>>>> can ignore all signals.
> >>>>>
> >>>>> You could think of it as the worker is a completely different process than
> >>>>> qemu/userspace so they have completely different signal handlers. The
> >>>>> vhost worker signal handler ignores all signals even SIGKILL.
> >>>>
> >>>> Yes.
> >>>>
> >>>>>
> >>>>> If you send a SIGKILL to a qemu thread, then it just exits right away. We
> >>>>> don't get to do an explicit close() on the vhost device and we don't get
> >>>>> to do ioctls like VHOST_NET_SET_BACKEND to clear backends. The kernel exit
> >>>>> code runs and releases refcounts on the device/file, then the vhost device's
> >>>>> file_operations->release function is called. vhost_dev_cleanup then stops
> >>>>> the vhost worker.
> >>>>
> >>>> Right.
> >>>>
> >>>>>
> >>>>> - In 6.4 and newer kernels, vhost workers use vhost_tasks, so the worker
> >>>>> can be thought of as a thread within the userspace process. With that
> >>>>> change we have the same signal handler as the userspace process.
> >>>>>
> >>>>> If you send a SIGKILL to a qemu thread then it works like above.
> >>>>>
> >>>>> If you send a SIGKILL to a vhost worker, the vhost worker still sort of
> >>>>> ignores it (that is the hack that I mentioned at the beginning of this
> >>>>> thread). kernel/vhost_task.c:vhost_task_fn will see the signal and
> >>>>> then just continue to process works until file_operations->release
> >>>>> calls
> >>>>
> >>>> Yes, so this sticks to the behaviour before vhost_tasks.
> >>>
> >>> Not exactly. The vhost_task stays alive temporarily.
> >>>
> >>> The signal is still delivered to the userspace threads and they will
> >>> exit due to getting the SIGKILL also. SIGKILL goes to all the threads in
> >>> the process and all userspace threads exit like normal because the vhost
> >>> task and normal old userspace threads share a signal handler. When
> >>> userspace exits, the kernel force drops the refcounts on the vhost
> >>> devices and that runs the release function so the vhost_task will then exit.
> >>>
> >>> So what I'm trying to say is that in 6.4 we already changed the behavior.
> >>
> >> Yes. To say the truth, it looks even worse but it might be too late to fix.
> > 
> > Andres (cced) has identified two other possible changes:
> > 
> > 1) doesn't run in the global PID namespace but run in the namespace of owner
> 
> Yeah, I mentioned that one in vhost.h like it's a feature and when posting
> the patches I mentioned it as a possible fix. I mean I thought we wanted it
> to work like qemu and iothreads where the iothread would inherit all those
> values automatically.
> 
> At the time, I thought we didn't inherit the namespace, like we did the cgroup,
> because there was no kernel function for it (like how we didn't inherit v2
> cgroups until recently when someone added some code for that).
> 
> I don't know if it's allowed to have something like qemu in namespace N but then
> have it's children (vhost thread in this case) in the global namespace. I'll
> look into it.

Yea a big if.

> > 2) doesn't inherit kthreadd's scheduling attributes but the owner
> 
> Same as above for this one. I thought I was fixing a bug where before
> we had to manually tune the vhost thread's values but for iothreads they
> automatically got setup.
> 
> Just to clarify this one. When we used kthreads, kthread() will reset the
> scheduler priority for the kthread that's created, so we got the default
> values instead of inheriting kthreadd's values.  So we would want:
> 
> +	sched_setscheduler_nocheck(current, SCHED_NORMAL, &param);
> 
> in vhost_task_fn() instead of inheriting kthreadd's values.
> 
> > 
> > Though such a change makes more sense for some use cases, it may break others.
> > 
> > I wonder if we need to introduce a new flag and bring the old kthread
> 
> Do you mean something like a module param?
> 
> > codes if the flag is not set? Then we would not end up trying to align
> > the behaviour?
> >
> 
> Let me know what you guys prefer. The sched part is easy. The namespace
> part might be more difficult, but I will look into it if you want it.


      parent reply	other threads:[~2024-04-18  7:01 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-16  0:46 [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting Mike Christie
2024-03-16  0:46 ` [PATCH 1/9] vhost-scsi: Handle vhost_vq_work_queue failures for events Mike Christie
2024-03-16  0:47 ` [PATCH 2/9] vhost-scsi: Handle vhost_vq_work_queue failures for cmds Mike Christie
2024-03-16  0:47 ` [PATCH 3/9] vhost-scsi: Use system wq to flush dev for TMFs Mike Christie
2024-03-16  0:47 ` [PATCH 4/9] vhost: Remove vhost_vq_flush Mike Christie
2024-03-16  0:47 ` [PATCH 5/9] vhost_scsi: Handle vhost_vq_work_queue failures for TMFs Mike Christie
2024-03-16  0:47 ` [PATCH 6/9] vhost: Use virtqueue mutex for swapping worker Mike Christie
2024-03-16  0:47 ` [PATCH 7/9] vhost: Release worker mutex during flushes Mike Christie
2024-03-16  0:47 ` [PATCH 8/9] vhost_task: Handle SIGKILL by flushing work and exiting Mike Christie
2024-03-16  0:47 ` [PATCH 9/9] kernel: Remove signal hacks for vhost_tasks Mike Christie
2024-04-09  4:16 ` [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting Jason Wang
2024-04-09 14:57   ` Mike Christie
2024-04-09 16:40     ` Michael S. Tsirkin
2024-04-09 21:55       ` michael.christie
2024-04-10  4:21         ` Michael S. Tsirkin
2024-04-18  7:10   ` Michael S. Tsirkin
2024-04-11  8:39 ` Jason Wang
2024-04-11 16:19   ` Mike Christie
2024-04-12  3:28     ` Jason Wang
2024-04-12 16:52       ` michael.christie
2024-04-15  8:52         ` Jason Wang
2024-04-17  3:50           ` Jason Wang
2024-04-17 16:03             ` Mike Christie
2024-04-18  4:08               ` Jason Wang
2024-04-18  7:07                 ` Michael S. Tsirkin
2024-04-18  9:25                   ` Andreas Karis
2024-04-19  0:37                     ` Jason Wang
2024-04-19  0:40                       ` Jason Wang
2024-05-15  6:27                         ` Jason Wang
2024-05-15  7:24                           ` Michael S. Tsirkin
2024-04-19  0:33                   ` Jason Wang
2024-04-18  7:01               ` Michael S. Tsirkin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240417150923-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=akaris@redhat.com \
    --cc=brauner@kernel.org \
    --cc=ebiederm@xmission.com \
    --cc=jasowang@redhat.com \
    --cc=lvivier@redhat.com \
    --cc=michael.christie@oracle.com \
    --cc=oleg@redhat.com \
    --cc=sgarzare@redhat.com \
    --cc=stefanha@redhat.com \
    --cc=virtualization@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).