cgroup2 freezer and kvm_vm_worker

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* cgroup2 freezer and kvm_vm_worker_thread()
@ 2024-10-29  0:07 Tejun Heo
  2024-10-29 20:46 ` Roman Gushchin
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Tejun Heo @ 2024-10-29  0:07 UTC (permalink / raw)
  To: Paolo Bonzini, Luca Boccassi, Roman Gushchin
  Cc: kvm, cgroups, Michal Koutný, linux-kernel

Hello,

Luca is reporting that cgroups which have kvm instances inside never
complete freezing. This can be trivially reproduced:

  root@test ~# mkdir /sys/fs/cgroup/test
  root@test ~# echo $fish_pid > /sys/fs/cgroup/test/cgroup.procs
  root@test ~# qemu-system-x86_64 --nographic -enable-kvm

and in another terminal:

  root@test ~# echo 1 > /sys/fs/cgroup/test/cgroup.freeze
  root@test ~# cat /sys/fs/cgroup/test/cgroup.events
  populated 1
  frozen 0
  root@test ~# for i in (cat /sys/fs/cgroup/test/cgroup.threads); echo $i; cat /proc/$i/stack; end 
  2070
  [<0>] do_freezer_trap+0x42/0x70
  [<0>] get_signal+0x4da/0x870
  [<0>] arch_do_signal_or_restart+0x1a/0x1c0
  [<0>] syscall_exit_to_user_mode+0x73/0x120
  [<0>] do_syscall_64+0x87/0x140
  [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
  2159
  [<0>] do_freezer_trap+0x42/0x70
  [<0>] get_signal+0x4da/0x870
  [<0>] arch_do_signal_or_restart+0x1a/0x1c0
  [<0>] syscall_exit_to_user_mode+0x73/0x120
  [<0>] do_syscall_64+0x87/0x140
  [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
  2160
  [<0>] do_freezer_trap+0x42/0x70
  [<0>] get_signal+0x4da/0x870
  [<0>] arch_do_signal_or_restart+0x1a/0x1c0
  [<0>] syscall_exit_to_user_mode+0x73/0x120
  [<0>] do_syscall_64+0x87/0x140
  [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
  2161
  [<0>] kvm_nx_huge_page_recovery_worker+0xea/0x680
  [<0>] kvm_vm_worker_thread+0x8f/0x2b0
  [<0>] kthread+0xe8/0x110
  [<0>] ret_from_fork+0x33/0x40
  [<0>] ret_from_fork_asm+0x1a/0x30
  2164
  [<0>] do_freezer_trap+0x42/0x70
  [<0>] get_signal+0x4da/0x870
  [<0>] arch_do_signal_or_restart+0x1a/0x1c0
  [<0>] syscall_exit_to_user_mode+0x73/0x120
  [<0>] do_syscall_64+0x87/0x140
  [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

The cgroup freezing happens in the signal delivery path but
kvm_vm_worker_thread() thread never call into the signal delivery path while
joining non-root cgroups, so they never get frozen. Because the cgroup
freezer determines whether a given cgroup is frozen by comparing the number
of frozen threads to the total number of threads in the cgroup, the cgroup
never becomes frozen and users waiting for the state transition may hang
indefinitely.

There are two paths that we can take:

1. Make kvm_vm_worker_thread() call into signal delivery path.
   io_wq_worker() is in a similar boat and handles signal delivery and can
   be frozen and trapped like regular threads.

2. Keep the count of threads which can't be frozen per cgroup so that cgroup
   freezer can ignore these threads.

#1 is better in that the cgroup will actually be frozen when reported
frozen. However, the rather ambiguous criterion we've been using for cgroup
freezer is whether the cgroup can be safely snapshotted whil frozen and as
long as the workers not being frozen doesn't break that, we can go for #2
too.

What do you guys think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: cgroup2 freezer and kvm_vm_worker_thread()
  2024-10-29  0:07 cgroup2 freezer and kvm_vm_worker_thread() Tejun Heo
@ 2024-10-29 20:46 ` Roman Gushchin
  2024-10-29 22:59 ` Paolo Bonzini
  2024-11-07 18:05 ` Michal Koutný
  2 siblings, 0 replies; 12+ messages in thread
From: Roman Gushchin @ 2024-10-29 20:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paolo Bonzini, Luca Boccassi, kvm, cgroups, Michal Koutný,
	linux-kernel

On Mon, Oct 28, 2024 at 02:07:36PM -1000, Tejun Heo wrote:
11;rgb:fcc2/f732/e5d0> Hello,
> 
> Luca is reporting that cgroups which have kvm instances inside never
> complete freezing. This can be trivially reproduced:
> 
>   root@test ~# mkdir /sys/fs/cgroup/test
>   root@test ~# echo $fish_pid > /sys/fs/cgroup/test/cgroup.procs
>   root@test ~# qemu-system-x86_64 --nographic -enable-kvm
> 
> and in another terminal:
> 
>   root@test ~# echo 1 > /sys/fs/cgroup/test/cgroup.freeze
>   root@test ~# cat /sys/fs/cgroup/test/cgroup.events
>   populated 1
>   frozen 0
>   root@test ~# for i in (cat /sys/fs/cgroup/test/cgroup.threads); echo $i; cat /proc/$i/stack; end 
>   2070
>   [<0>] do_freezer_trap+0x42/0x70
>   [<0>] get_signal+0x4da/0x870
>   [<0>] arch_do_signal_or_restart+0x1a/0x1c0
>   [<0>] syscall_exit_to_user_mode+0x73/0x120
>   [<0>] do_syscall_64+0x87/0x140
>   [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>   2159
>   [<0>] do_freezer_trap+0x42/0x70
>   [<0>] get_signal+0x4da/0x870
>   [<0>] arch_do_signal_or_restart+0x1a/0x1c0
>   [<0>] syscall_exit_to_user_mode+0x73/0x120
>   [<0>] do_syscall_64+0x87/0x140
>   [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>   2160
>   [<0>] do_freezer_trap+0x42/0x70
>   [<0>] get_signal+0x4da/0x870
>   [<0>] arch_do_signal_or_restart+0x1a/0x1c0
>   [<0>] syscall_exit_to_user_mode+0x73/0x120
>   [<0>] do_syscall_64+0x87/0x140
>   [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>   2161
>   [<0>] kvm_nx_huge_page_recovery_worker+0xea/0x680
>   [<0>] kvm_vm_worker_thread+0x8f/0x2b0
>   [<0>] kthread+0xe8/0x110
>   [<0>] ret_from_fork+0x33/0x40
>   [<0>] ret_from_fork_asm+0x1a/0x30
>   2164
>   [<0>] do_freezer_trap+0x42/0x70
>   [<0>] get_signal+0x4da/0x870
>   [<0>] arch_do_signal_or_restart+0x1a/0x1c0
>   [<0>] syscall_exit_to_user_mode+0x73/0x120
>   [<0>] do_syscall_64+0x87/0x140
>   [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> The cgroup freezing happens in the signal delivery path but
> kvm_vm_worker_thread() thread never call into the signal delivery path while
> joining non-root cgroups, so they never get frozen. Because the cgroup
> freezer determines whether a given cgroup is frozen by comparing the number
> of frozen threads to the total number of threads in the cgroup, the cgroup
> never becomes frozen and users waiting for the state transition may hang
> indefinitely.
> 
> There are two paths that we can take:
> 
> 1. Make kvm_vm_worker_thread() call into signal delivery path.
>    io_wq_worker() is in a similar boat and handles signal delivery and can
>    be frozen and trapped like regular threads.
> 
> 2. Keep the count of threads which can't be frozen per cgroup so that cgroup
>    freezer can ignore these threads.
> 
> #1 is better in that the cgroup will actually be frozen when reported
> frozen. However, the rather ambiguous criterion we've been using for cgroup
> freezer is whether the cgroup can be safely snapshotted whil frozen and as
> long as the workers not being frozen doesn't break that, we can go for #2
> too.
> 
> What do you guys think?

The general assumption (which is broken here) is that kernel threads are
belonging to the root cgroup, but also they are not safe to freeze, so we're
fine unless a user moves them to another cgroup, which is generally not a good
idea for many reasons.

However in this case we have a kthread which we want to freeze and which belongs
to a non-root cgroup, so I think the option #1 is preferable. Option #2 brings
a notion of special non-freezable threads into a user facing API. Idk if we
really need this, so I'd avoid this.

Thanks!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: cgroup2 freezer and kvm_vm_worker_thread()
  2024-10-29  0:07 cgroup2 freezer and kvm_vm_worker_thread() Tejun Heo
  2024-10-29 20:46 ` Roman Gushchin
@ 2024-10-29 22:59 ` Paolo Bonzini
  2024-10-30  0:25   ` Tejun Heo
  2024-11-07 18:05 ` Michal Koutný
  2 siblings, 1 reply; 12+ messages in thread
From: Paolo Bonzini @ 2024-10-29 22:59 UTC (permalink / raw)
  To: Tejun Heo, Luca Boccassi, Roman Gushchin
  Cc: kvm, cgroups, Michal Koutný, linux-kernel

On 10/29/24 01:07, Tejun Heo wrote:
> Hello,
> 
> Luca is reporting that cgroups which have kvm instances inside never
> complete freezing. This can be trivially reproduced:
> 
>    root@test ~# mkdir /sys/fs/cgroup/test
>    root@test ~# echo $fish_pid > /sys/fs/cgroup/test/cgroup.procs
>    root@test ~# qemu-system-x86_64 --nographic -enable-kvm
> 
> and in another terminal:
> 
>    root@test ~# echo 1 > /sys/fs/cgroup/test/cgroup.freeze
>    root@test ~# cat /sys/fs/cgroup/test/cgroup.events
>    populated 1
>    frozen 0
>    root@test ~# for i in (cat /sys/fs/cgroup/test/cgroup.threads); echo $i; cat /proc/$i/stack; end
>    2070
>    [<0>] do_freezer_trap+0x42/0x70
>    [<0>] get_signal+0x4da/0x870
>    [<0>] arch_do_signal_or_restart+0x1a/0x1c0
>    [<0>] syscall_exit_to_user_mode+0x73/0x120
>    [<0>] do_syscall_64+0x87/0x140
>    [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>    2159
>    [<0>] do_freezer_trap+0x42/0x70
>    [<0>] get_signal+0x4da/0x870
>    [<0>] arch_do_signal_or_restart+0x1a/0x1c0
>    [<0>] syscall_exit_to_user_mode+0x73/0x120
>    [<0>] do_syscall_64+0x87/0x140
>    [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>    2160
>    [<0>] do_freezer_trap+0x42/0x70
>    [<0>] get_signal+0x4da/0x870
>    [<0>] arch_do_signal_or_restart+0x1a/0x1c0
>    [<0>] syscall_exit_to_user_mode+0x73/0x120
>    [<0>] do_syscall_64+0x87/0x140
>    [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>    2161
>    [<0>] kvm_nx_huge_page_recovery_worker+0xea/0x680
>    [<0>] kvm_vm_worker_thread+0x8f/0x2b0
>    [<0>] kthread+0xe8/0x110
>    [<0>] ret_from_fork+0x33/0x40
>    [<0>] ret_from_fork_asm+0x1a/0x30
>    2164
>    [<0>] do_freezer_trap+0x42/0x70
>    [<0>] get_signal+0x4da/0x870
>    [<0>] arch_do_signal_or_restart+0x1a/0x1c0
>    [<0>] syscall_exit_to_user_mode+0x73/0x120
>    [<0>] do_syscall_64+0x87/0x140
>    [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> The cgroup freezing happens in the signal delivery path but
> kvm_vm_worker_thread() thread never call into the signal delivery path while
> joining non-root cgroups, so they never get frozen. Because the cgroup
> freezer determines whether a given cgroup is frozen by comparing the number
> of frozen threads to the total number of threads in the cgroup, the cgroup
> never becomes frozen and users waiting for the state transition may hang
> indefinitely.
> 
> There are two paths that we can take:
> 
> 1. Make kvm_vm_worker_thread() call into signal delivery path.
>     io_wq_worker() is in a similar boat and handles signal delivery and can
>     be frozen and trapped like regular threads.

For the freezing part, would this be anything more than

fdiff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d16ce8174ed6..b7b6a1c1b6a4 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -47,6 +47,7 @@
  #include <linux/kern_levels.h>
  #include <linux/kstrtox.h>
  #include <linux/kthread.h>
+#include <linux/freezer.h>
  #include <linux/wordpart.h>
  
  #include <asm/page.h>
@@ -7429,22 +7430,27 @@ static long get_nx_huge_page_recovery_timeout(u64 start_time)
  static int kvm_nx_huge_page_recovery_worker(struct kvm *kvm, uintptr_t data)
  {
  	u64 start_time;
-	long remaining_time;
+	u64 end_time;
+
+	set_freezable();
  
  	while (true) {
  		start_time = get_jiffies_64();
-		remaining_time = get_nx_huge_page_recovery_timeout(start_time);
+		end_time = start_time + get_nx_huge_page_recovery_timeout(start_time);
  
-		set_current_state(TASK_INTERRUPTIBLE);
-		while (!kthread_should_stop() && remaining_time > 0) {
-			schedule_timeout(remaining_time);
-			remaining_time = get_nx_huge_page_recovery_timeout(start_time);
+		for (;;) {
  			set_current_state(TASK_INTERRUPTIBLE);
+			if (kthread_freezable_should_stop(NULL))
+				break;
+			start_time = get_jiffies_64();
+			if ((s64)(end_time - start_time) <= 0)
+				break;
+			schedule_timeout(end_time - start_time);
  		}
  
  		set_current_state(TASK_RUNNING);
  
-		if (kthread_should_stop())
+		if (kthread_freezable_should_stop(NULL))
  			return 0;
  
  		kvm_recover_nx_huge_pages(kvm);

(untested beyond compilation).

I'm not sure if the KVM worker thread should process signals.  We want it
to take the CPU time it uses from the guest, but otherwise it's not running
on behalf of userspace in the way that io_wq_worker() is.

Paolo


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: cgroup2 freezer and kvm_vm_worker_thread()
  2024-10-29 22:59 ` Paolo Bonzini
@ 2024-10-30  0:25   ` Tejun Heo
  2024-10-30  0:38     ` Luca Boccassi
  2024-10-30 12:05     ` Paolo Bonzini
  0 siblings, 2 replies; 12+ messages in thread
From: Tejun Heo @ 2024-10-30  0:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Luca Boccassi, Roman Gushchin, kvm, cgroups, Michal Koutný,
	linux-kernel

Hello, Paolo.

On Tue, Oct 29, 2024 at 11:59:43PM +0100, Paolo Bonzini wrote:
...
> For the freezing part, would this be anything more than
> 
> fdiff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index d16ce8174ed6..b7b6a1c1b6a4 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -47,6 +47,7 @@
>  #include <linux/kern_levels.h>
>  #include <linux/kstrtox.h>
>  #include <linux/kthread.h>
> +#include <linux/freezer.h>
>  #include <linux/wordpart.h>
>  #include <asm/page.h>
> @@ -7429,22 +7430,27 @@ static long get_nx_huge_page_recovery_timeout(u64 start_time)
>  static int kvm_nx_huge_page_recovery_worker(struct kvm *kvm, uintptr_t data)
>  {
>  	u64 start_time;
> +	u64 end_time;
> +
> +	set_freezable();
>  	while (true) {
>  		start_time = get_jiffies_64();
> +		end_time = start_time + get_nx_huge_page_recovery_timeout(start_time);
> +		for (;;) {
>  			set_current_state(TASK_INTERRUPTIBLE);
> +			if (kthread_freezable_should_stop(NULL))
> +				break;
> +			start_time = get_jiffies_64();
> +			if ((s64)(end_time - start_time) <= 0)
> +				break;
> +			schedule_timeout(end_time - start_time);
>  		}
>  		set_current_state(TASK_RUNNING);
> +		if (kthread_freezable_should_stop(NULL))

So, this is the PM and cgroup1 freezer which traps tasks in a pretty opaque
state which is problematic when only a part of the system is frozen, so the
cgroup2 freezer is implemented in the signal delivery / trap path so that
they behave similar to e.g. SIGSTOP stops.

>  			return 0;
>  		kvm_recover_nx_huge_pages(kvm);
> 
> (untested beyond compilation).
> 
> I'm not sure if the KVM worker thread should process signals.  We want it
> to take the CPU time it uses from the guest, but otherwise it's not running
> on behalf of userspace in the way that io_wq_worker() is.

I see, so io_wq_worker()'s handle signals only partially. It sets
PF_USER_WORKER which ignores fatal signals, so the only signals which take
effect are STOP/CONT (and friends) which is handled in do_signal_stop()
which is also where the cgroup2 freezer is implemented.

Given that the kthreads are tied to user processes, I think it'd be better
to behave similarly to user tasks as possible in this regard if userspace
being able to stop/cont these kthreads are okay.

If that can't be done, we can add a mechanism to ignore these tasks when
determining when a cgroup is frozen provided that this doesn't affect the
snapshotting (ie. when taking a snapshot of frozen cgroup, activities from
the kthreads won't corrupt the snapshot).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: cgroup2 freezer and kvm_vm_worker_thread()
  2024-10-30  0:25   ` Tejun Heo
@ 2024-10-30  0:38     ` Luca Boccassi
  2024-10-30 12:05     ` Paolo Bonzini
  1 sibling, 0 replies; 12+ messages in thread
From: Luca Boccassi @ 2024-10-30  0:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paolo Bonzini, Roman Gushchin, kvm, cgroups, Michal Koutný,
	linux-kernel

On Wed, 30 Oct 2024 at 00:25, Tejun Heo <tj@kernel.org> wrote:
> If that can't be done, we can add a mechanism to ignore these tasks when
> determining when a cgroup is frozen provided that this doesn't affect the
> snapshotting (ie. when taking a snapshot of frozen cgroup, activities from
> the kthreads won't corrupt the snapshot).

What do these kvm kthreads actually do?

So one of the use cases in systemd is to freeze the entire user
session before suspend, so that the encryption key for the home
directory (homed) can be removed from memory (until after resume,
after the user has typed the password in the screen lock again),
without the risk of having programs trying to access the now
inaccessible files in the home directory. So if qemu is frozen, but
the kvm kthread is not and somehow lets the VM or part of it still run
and use files (e.g.: write to the disk image that is stored in the
homedir) that are now inaccessible, that would be problematic for this
use case.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: cgroup2 freezer and kvm_vm_worker_thread()
  2024-10-30  0:25   ` Tejun Heo
  2024-10-30  0:38     ` Luca Boccassi
@ 2024-10-30 12:05     ` Paolo Bonzini
  2024-10-30 18:14       ` Tejun Heo
  1 sibling, 1 reply; 12+ messages in thread
From: Paolo Bonzini @ 2024-10-30 12:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Luca Boccassi, Roman Gushchin, kvm, cgroups, Michal Koutný,
	linux-kernel

On Wed, Oct 30, 2024 at 1:25 AM Tejun Heo <tj@kernel.org> wrote:
> > I'm not sure if the KVM worker thread should process signals.  We want it
> > to take the CPU time it uses from the guest, but otherwise it's not running
> > on behalf of userspace in the way that io_wq_worker() is.
>
> I see, so io_wq_worker()'s handle signals only partially. It sets
> PF_USER_WORKER which ignores fatal signals, so the only signals which take
> effect are STOP/CONT (and friends) which is handled in do_signal_stop()
> which is also where the cgroup2 freezer is implemented.

What about SIGKILL? That's the one that I don't want to have for KVM
workers, because they should only stop when the file descriptor is
closed.

(Replying to Luca: the kthreads are dropping some internal data
structures that KVM had to "de-optimize" to deal with processor bugs.
They allow the data structures to be rebuilt in the optimal way using
large pages).

> Given that the kthreads are tied to user processes, I think it'd be better
> to behave similarly to user tasks as possible in this regard if userspace
> being able to stop/cont these kthreads are okay.

Yes, I totally agree with you on that, I'm just not sure of the best
way to do it.

I will try keeping the kthread and adding allow_signal(SIGSTOP).  That
should allow me to process the SIGSTOP via get_signal().

Paolo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: cgroup2 freezer and kvm_vm_worker_thread()
  2024-10-30 12:05     ` Paolo Bonzini
@ 2024-10-30 18:14       ` Tejun Heo
  2024-11-06 23:21         ` Luca Boccassi
  0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2024-10-30 18:14 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Luca Boccassi, Roman Gushchin, kvm, cgroups, Michal Koutný,
	linux-kernel

Hello,

On Wed, Oct 30, 2024 at 01:05:16PM +0100, Paolo Bonzini wrote:
> On Wed, Oct 30, 2024 at 1:25 AM Tejun Heo <tj@kernel.org> wrote:
> > > I'm not sure if the KVM worker thread should process signals.  We want it
> > > to take the CPU time it uses from the guest, but otherwise it's not running
> > > on behalf of userspace in the way that io_wq_worker() is.
> >
> > I see, so io_wq_worker()'s handle signals only partially. It sets
> > PF_USER_WORKER which ignores fatal signals, so the only signals which take
> > effect are STOP/CONT (and friends) which is handled in do_signal_stop()
> > which is also where the cgroup2 freezer is implemented.
> 
> What about SIGKILL? That's the one that I don't want to have for KVM
> workers, because they should only stop when the file descriptor is
> closed.

I don't think SIGKILL does anything for PF_USER_WORKER threads. Those are
all handled in the fatal: label in kernel/signal.c::get_signal() and the
function just returns for PF_USER_WORKER threads. I haven't used it myself
but looking at io_uring usage, it seems pretty straightforward.

> (Replying to Luca: the kthreads are dropping some internal data
> structures that KVM had to "de-optimize" to deal with processor bugs.
> They allow the data structures to be rebuilt in the optimal way using
> large pages).
> 
> > Given that the kthreads are tied to user processes, I think it'd be better
> > to behave similarly to user tasks as possible in this regard if userspace
> > being able to stop/cont these kthreads are okay.
> 
> Yes, I totally agree with you on that, I'm just not sure of the best
> way to do it.
> 
> I will try keeping the kthread and adding allow_signal(SIGSTOP).  That
> should allow me to process the SIGSTOP via get_signal().

I *think* you can just copy what io_wq_worker() is doing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: cgroup2 freezer and kvm_vm_worker_thread()
  2024-10-30 18:14       ` Tejun Heo
@ 2024-11-06 23:21         ` Luca Boccassi
  2024-11-06 23:22           ` Paolo Bonzini
  0 siblings, 1 reply; 12+ messages in thread
From: Luca Boccassi @ 2024-11-06 23:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paolo Bonzini, Roman Gushchin, kvm, cgroups, Michal Koutný,
	linux-kernel

On Wed, 30 Oct 2024 at 18:14, Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Wed, Oct 30, 2024 at 01:05:16PM +0100, Paolo Bonzini wrote:
> > On Wed, Oct 30, 2024 at 1:25 AM Tejun Heo <tj@kernel.org> wrote:
> > > > I'm not sure if the KVM worker thread should process signals.  We want it
> > > > to take the CPU time it uses from the guest, but otherwise it's not running
> > > > on behalf of userspace in the way that io_wq_worker() is.
> > >
> > > I see, so io_wq_worker()'s handle signals only partially. It sets
> > > PF_USER_WORKER which ignores fatal signals, so the only signals which take
> > > effect are STOP/CONT (and friends) which is handled in do_signal_stop()
> > > which is also where the cgroup2 freezer is implemented.
> >
> > What about SIGKILL? That's the one that I don't want to have for KVM
> > workers, because they should only stop when the file descriptor is
> > closed.
>
> I don't think SIGKILL does anything for PF_USER_WORKER threads. Those are
> all handled in the fatal: label in kernel/signal.c::get_signal() and the
> function just returns for PF_USER_WORKER threads. I haven't used it myself
> but looking at io_uring usage, it seems pretty straightforward.
>
> > (Replying to Luca: the kthreads are dropping some internal data
> > structures that KVM had to "de-optimize" to deal with processor bugs.
> > They allow the data structures to be rebuilt in the optimal way using
> > large pages).
> >
> > > Given that the kthreads are tied to user processes, I think it'd be better
> > > to behave similarly to user tasks as possible in this regard if userspace
> > > being able to stop/cont these kthreads are okay.
> >
> > Yes, I totally agree with you on that, I'm just not sure of the best
> > way to do it.
> >
> > I will try keeping the kthread and adding allow_signal(SIGSTOP).  That
> > should allow me to process the SIGSTOP via get_signal().
>
> I *think* you can just copy what io_wq_worker() is doing.
>
> Thanks.
>
> --
> tejun

Hi,

Any update on this? We keep getting reports of this issue, so it would
be great if there was a fix for 6.12. Thanks!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: cgroup2 freezer and kvm_vm_worker_thread()
  2024-11-06 23:21         ` Luca Boccassi
@ 2024-11-06 23:22           ` Paolo Bonzini
  0 siblings, 0 replies; 12+ messages in thread
From: Paolo Bonzini @ 2024-11-06 23:22 UTC (permalink / raw)
  To: Luca Boccassi
  Cc: Tejun Heo, Roman Gushchin, kvm, cgroups, Michal Koutný,
	linux-kernel

On Thu, Nov 7, 2024 at 12:21 AM Luca Boccassi <bluca@debian.org> wrote:
>
> On Wed, 30 Oct 2024 at 18:14, Tejun Heo <tj@kernel.org> wrote:
> >
> > Hello,
> >
> > On Wed, Oct 30, 2024 at 01:05:16PM +0100, Paolo Bonzini wrote:
> > > On Wed, Oct 30, 2024 at 1:25 AM Tejun Heo <tj@kernel.org> wrote:
> > > > > I'm not sure if the KVM worker thread should process signals.  We want it
> > > > > to take the CPU time it uses from the guest, but otherwise it's not running
> > > > > on behalf of userspace in the way that io_wq_worker() is.
> > > >
> > > > I see, so io_wq_worker()'s handle signals only partially. It sets
> > > > PF_USER_WORKER which ignores fatal signals, so the only signals which take
> > > > effect are STOP/CONT (and friends) which is handled in do_signal_stop()
> > > > which is also where the cgroup2 freezer is implemented.
> > >
> > > What about SIGKILL? That's the one that I don't want to have for KVM
> > > workers, because they should only stop when the file descriptor is
> > > closed.
> >
> > I don't think SIGKILL does anything for PF_USER_WORKER threads. Those are
> > all handled in the fatal: label in kernel/signal.c::get_signal() and the
> > function just returns for PF_USER_WORKER threads. I haven't used it myself
> > but looking at io_uring usage, it seems pretty straightforward.
> >
> > > (Replying to Luca: the kthreads are dropping some internal data
> > > structures that KVM had to "de-optimize" to deal with processor bugs.
> > > They allow the data structures to be rebuilt in the optimal way using
> > > large pages).
> > >
> > > > Given that the kthreads are tied to user processes, I think it'd be better
> > > > to behave similarly to user tasks as possible in this regard if userspace
> > > > being able to stop/cont these kthreads are okay.
> > >
> > > Yes, I totally agree with you on that, I'm just not sure of the best
> > > way to do it.
> > >
> > > I will try keeping the kthread and adding allow_signal(SIGSTOP).  That
> > > should allow me to process the SIGSTOP via get_signal().
> >
> > I *think* you can just copy what io_wq_worker() is doing.
>
> Any update on this? We keep getting reports of this issue, so it would
> be great if there was a fix for 6.12. Thanks!

No, did not have time yet.

Paolo


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: cgroup2 freezer and kvm_vm_worker_thread()
  2024-10-29  0:07 cgroup2 freezer and kvm_vm_worker_thread() Tejun Heo
  2024-10-29 20:46 ` Roman Gushchin
  2024-10-29 22:59 ` Paolo Bonzini
@ 2024-11-07 18:05 ` Michal Koutný
  2024-11-07 18:54   ` Tejun Heo
  2 siblings, 1 reply; 12+ messages in thread
From: Michal Koutný @ 2024-11-07 18:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paolo Bonzini, Luca Boccassi, Roman Gushchin, kvm, cgroups,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1970 bytes --]

On Mon, Oct 28, 2024 at 02:07:36PM GMT, Tejun Heo <tj@kernel.org> wrote:
> There are two paths that we can take:
> 
> 1. Make kvm_vm_worker_thread() call into signal delivery path.
>    io_wq_worker() is in a similar boat and handles signal delivery and can
>    be frozen and trapped like regular threads.
> 
> 2. Keep the count of threads which can't be frozen per cgroup so that cgroup
>    freezer can ignore these threads.
> 
> #1 is better in that the cgroup will actually be frozen when reported
> frozen. However, the rather ambiguous criterion we've been using for cgroup
> freezer is whether the cgroup can be safely snapshotted whil frozen and as
> long as the workers not being frozen doesn't break that, we can go for #2
> too.
> 
> What do you guys think?

I'd first ask why the kvm_vm_worker_thread needs to be in the KVM task's
cgroup (and copy its priority at creation time but no later adjustments)?

If it can remain inside root cgroup (like any other good kthread) its
job may be even chunked into periodic/deferred workqueue pieces with no
kthread per KVM at all.

If there are resource control/charging concerns, I was thinking about
the approach of cloning from the KVM task and never returning to
userspace, which I see you already discussed with PF_USER_WORKER (based
on #1). All context would be regularly inherited and no migration would
be needed.

(I remember issues with the kvm_vm_worker_thread surviving lifespan of
KVM task and preventing removal of the cgroup. Not sure if that was only
a race or there's real need for doing some cleanups on an exited task.)

As for #2, I'm not sure there's a good criterion for what to ignore.
Here it could possibly be PF_KTHREAD or PF_NOFREEZE (I get the latter
has purpose for system-wide (or v1) freezer). Generally, we can't tell
what's the effect of thread's liveliness so it seems better to
conservatively treat the cgroup as unfrozen.

HTH,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: cgroup2 freezer and kvm_vm_worker_thread()
  2024-11-07 18:05 ` Michal Koutný
@ 2024-11-07 18:54   ` Tejun Heo
  2024-11-07 20:48     ` Sean Christopherson
  0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2024-11-07 18:54 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Paolo Bonzini, Luca Boccassi, Roman Gushchin, kvm, cgroups,
	linux-kernel

Hello,

On Thu, Nov 07, 2024 at 07:05:46PM +0100, Michal Koutný wrote:
...
> I'd first ask why the kvm_vm_worker_thread needs to be in the KVM task's
> cgroup (and copy its priority at creation time but no later adjustments)?
> 
> If it can remain inside root cgroup (like any other good kthread) its
> job may be even chunked into periodic/deferred workqueue pieces with no
> kthread per KVM at all.

That'd be better but I suppose kvm wants them in the same cgroup for a
reason.

> If there are resource control/charging concerns, I was thinking about
> the approach of cloning from the KVM task and never returning to
> userspace, which I see you already discussed with PF_USER_WORKER (based
> on #1). All context would be regularly inherited and no migration would
> be needed.
> 
> (I remember issues with the kvm_vm_worker_thread surviving lifespan of
> KVM task and preventing removal of the cgroup. Not sure if that was only
> a race or there's real need for doing some cleanups on an exited task.)
> 
> As for #2, I'm not sure there's a good criterion for what to ignore.
> Here it could possibly be PF_KTHREAD or PF_NOFREEZE (I get the latter
> has purpose for system-wide (or v1) freezer). Generally, we can't tell
> what's the effect of thread's liveliness so it seems better to
> conservatively treat the cgroup as unfrozen.

It'd have to be a separate flag which explicitly says that the task is
exempt from cgroup2 freezer, so yeah, it'd be best if we can avoid this
option.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: cgroup2 freezer and kvm_vm_worker_thread()
  2024-11-07 18:54   ` Tejun Heo
@ 2024-11-07 20:48     ` Sean Christopherson
  0 siblings, 0 replies; 12+ messages in thread
From: Sean Christopherson @ 2024-11-07 20:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný, Paolo Bonzini, Luca Boccassi, Roman Gushchin,
	kvm, cgroups, linux-kernel

On Thu, Nov 07, 2024, Tejun Heo wrote:
> Hello,
> 
> On Thu, Nov 07, 2024 at 07:05:46PM +0100, Michal Koutný wrote:
> ...
> > I'd first ask why the kvm_vm_worker_thread needs to be in the KVM task's
> > cgroup (and copy its priority at creation time but no later adjustments)?
> > 
> > If it can remain inside root cgroup (like any other good kthread) its
> > job may be even chunked into periodic/deferred workqueue pieces with no
> > kthread per KVM at all.
> 
> That'd be better but I suppose kvm wants them in the same cgroup for a
> reason.

Yes.  The one and only user of kvm_vm_create_worker_thread() does non-trivial
work on behalf of the VM, and so we want all of the CPU time consumed by that
work to be charged to the VM's container, e.g. so that a VM that is generating
a lot of work doesn't negatively impact other VMs on the same host (the amount
of work done is directly affected by how the guest is accessing its memory).

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-11-07 20:48 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-29  0:07 cgroup2 freezer and kvm_vm_worker_thread() Tejun Heo
2024-10-29 20:46 ` Roman Gushchin
2024-10-29 22:59 ` Paolo Bonzini
2024-10-30  0:25   ` Tejun Heo
2024-10-30  0:38     ` Luca Boccassi
2024-10-30 12:05     ` Paolo Bonzini
2024-10-30 18:14       ` Tejun Heo
2024-11-06 23:21         ` Luca Boccassi
2024-11-06 23:22           ` Paolo Bonzini
2024-11-07 18:05 ` Michal Koutný
2024-11-07 18:54   ` Tejun Heo
2024-11-07 20:48     ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).