[PATCH] livepatch: Avoid hard lockup caused by klp_try_switch

Live Patching
 help / color / mirror / Atom feed

* [PATCH] livepatch: Avoid hard lockup caused by klp_try_switch_task()
@ 2025-01-22  8:51 Yafang Shao
  2025-01-22 12:50 ` Petr Mladek
  0 siblings, 1 reply; 10+ messages in thread
From: Yafang Shao @ 2025-01-22  8:51 UTC (permalink / raw)
  To: jpoimboe, jikos, mbenes, pmladek, joe.lawrence; +Cc: live-patching, Yafang Shao

I encountered a hard lockup while attempting to reproduce the panic issue
that occurred on our production servers [0]. The hard lockup manifests as
follows:

[15852778.150191] livepatch: klp_try_switch_task: grpc_executor:421106 is sleeping on function do_exit
[15852778.169471] livepatch: klp_try_switch_task: grpc_executor:421244 is sleeping on function do_exit
[15852778.188746] livepatch: klp_try_switch_task: grpc_executor:421457 is sleeping on function do_exit
[15852778.208021] livepatch: klp_try_switch_task: grpc_executor:422407 is sleeping on function do_exit
[15852778.227292] livepatch: klp_try_switch_task: grpc_executor:423184 is sleeping on function do_exit
[15852778.246576] livepatch: klp_try_switch_task: grpc_executor:423582 is sleeping on function do_exit
[15852778.265863] livepatch: klp_try_switch_task: grpc_executor:423738 is sleeping on function do_exit
[15852778.285149] livepatch: klp_try_switch_task: grpc_executor:423739 is sleeping on function do_exit
[15852778.304446] livepatch: klp_try_switch_task: grpc_executor:423833 is sleeping on function do_exit
[15852778.323738] livepatch: klp_try_switch_task: grpc_executor:423893 is sleeping on function do_exit
[15852778.343017] livepatch: klp_try_switch_task: grpc_executor:423894 is sleeping on function do_exit
[15852778.362292] livepatch: klp_try_switch_task: grpc_executor:423976 is sleeping on function do_exit
[15852778.381565] livepatch: klp_try_switch_task: grpc_executor:423977 is sleeping on function do_exit
[15852778.400847] livepatch: klp_try_switch_task: grpc_executor:424610 is sleeping on function do_exit
[15852778.412319] NMI watchdog: Watchdog detected hard LOCKUP on cpu 15
...
[15852778.412374] CPU: 15 PID: 1 Comm: systemd Kdump: loaded Tainted: G S      W  O  K    6.1.52-3
[15852778.412377] Hardware name: New H3C Technologies Co., Ltd. H3C UniServer R4950 G5/RS45M2C9S, BIOS 5.12 10/15/2021
[15852778.412378] RIP: 0010:queued_write_lock_slowpath+0x75/0x135
...
[15852778.412397] Call Trace:
[15852778.412398]  <NMI>
[15852778.412400]  ? show_regs.cold+0x1a/0x1f
[15852778.412403]  ? watchdog_overflow_callback.cold+0x1e/0x70
[15852778.412406]  ? __perf_event_overflow+0x102/0x1e0
[15852778.412409]  ? perf_event_overflow+0x19/0x20
[15852778.412411]  ? x86_pmu_handle_irq+0xf7/0x160
[15852778.412415]  ? flush_tlb_one_kernel+0xe/0x30
[15852778.412418]  ? __set_pte_vaddr+0x2d/0x40
[15852778.412421]  ? set_pte_vaddr_p4d+0x3d/0x50
[15852778.412423]  ? set_pte_vaddr+0x6d/0xa0
[15852778.412424]  ? __native_set_fixmap+0x28/0x40
[15852778.412426]  ? native_set_fixmap+0x54/0x60
[15852778.412428]  ? ghes_copy_tofrom_phys+0x75/0x120
[15852778.412431]  ? __ghes_peek_estatus.isra.0+0x4e/0xb0
[15852778.412434]  ? ghes_in_nmi_queue_one_entry.constprop.0+0x3d/0x240
[15852778.412437]  ? amd_pmu_handle_irq+0x48/0xc0
[15852778.412438]  ? perf_event_nmi_handler+0x2d/0x50
[15852778.412440]  ? nmi_handle+0x60/0x120
[15852778.412443]  ? default_do_nmi+0x45/0x120
[15852778.412446]  ? exc_nmi+0x118/0x150
[15852778.412447]  ? end_repeat_nmi+0x16/0x67
[15852778.412450]  ? copy_process+0xf01/0x19f0
[15852778.412452]  ? queued_write_lock_slowpath+0x75/0x135
[15852778.412455]  ? queued_write_lock_slowpath+0x75/0x135
[15852778.412457]  ? queued_write_lock_slowpath+0x75/0x135
[15852778.412459]  </NMI>
[15852778.412460]  <TASK>
[15852778.412461]  _raw_write_lock_irq+0x43/0x50
[15852778.412463]  copy_process+0xf01/0x19f0
[15852778.412466]  kernel_clone+0x9d/0x3e0
[15852778.412468]  ? autofs_dev_ioctl_requester+0x100/0x100
[15852778.412471]  __do_sys_clone+0x66/0x90
[15852778.412475]  __x64_sys_clone+0x25/0x30
[15852778.412477]  do_syscall_64+0x38/0x90
[15852778.412478]  entry_SYSCALL_64_after_hwframe+0x64/0xce
[15852778.412481] RIP: 0033:0x7f426bb3b9c1
...

Notably, dynamic_debug is enabled to collect debug information when
applying a livepatch, resulting in a large amount of debug output.

The issue arises because klp_try_switch_task() holds the tasklist_lock, and
if another task attempts to acquire it, it must spin until it's available.
This becomes problematic in the copy_process() path, where IRQs are
disabled, leading to the hard lockup. To prevent this, we should implement
a check for spinlock contention before proceeding.

Link: https://lore.kernel.org/live-patching/CALOAHbA9WHPjeZKUcUkwULagQjTMfqAdAg+akqPzbZ7Byc=qrw@mail.gmail.com/ [0]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/livepatch/transition.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index ba069459c101..774017825bb4 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -467,9 +467,14 @@ void klp_try_complete_transition(void)
 	 * unless the patch includes changes to a very common function.
 	 */
 	read_lock(&tasklist_lock);
-	for_each_process_thread(g, task)
+	for_each_process_thread(g, task) {
 		if (!klp_try_switch_task(task))
 			complete = false;
+		if (rwlock_is_contended(&tasklist_lock) || need_resched()) {
+			complete = false;
+			break;
+		}
+	}
 	read_unlock(&tasklist_lock);
 
 	/*
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] livepatch: Avoid hard lockup caused by klp_try_switch_task()
  2025-01-22  8:51 [PATCH] livepatch: Avoid hard lockup caused by klp_try_switch_task() Yafang Shao
@ 2025-01-22 12:50 ` Petr Mladek
  2025-01-22 13:46   ` Yafang Shao
  2025-01-31 13:22   ` zhang warden
  0 siblings, 2 replies; 10+ messages in thread
From: Petr Mladek @ 2025-01-22 12:50 UTC (permalink / raw)
  To: Yafang Shao; +Cc: jpoimboe, jikos, mbenes, joe.lawrence, live-patching

On Wed 2025-01-22 16:51:46, Yafang Shao wrote:
> I encountered a hard lockup while attempting to reproduce the panic issue
> that occurred on our production servers [0]. The hard lockup manifests as
> follows:
> 
> [15852778.150191] livepatch: klp_try_switch_task: grpc_executor:421106 is sleeping on function do_exit
> [15852778.169471] livepatch: klp_try_switch_task: grpc_executor:421244 is sleeping on function do_exit
> [15852778.188746] livepatch: klp_try_switch_task: grpc_executor:421457 is sleeping on function do_exit
> [15852778.208021] livepatch: klp_try_switch_task: grpc_executor:422407 is sleeping on function do_exit
> [15852778.227292] livepatch: klp_try_switch_task: grpc_executor:423184 is sleeping on function do_exit
> [15852778.246576] livepatch: klp_try_switch_task: grpc_executor:423582 is sleeping on function do_exit
> [15852778.265863] livepatch: klp_try_switch_task: grpc_executor:423738 is sleeping on function do_exit
> [15852778.285149] livepatch: klp_try_switch_task: grpc_executor:423739 is sleeping on function do_exit
> [15852778.304446] livepatch: klp_try_switch_task: grpc_executor:423833 is sleeping on function do_exit
> [15852778.323738] livepatch: klp_try_switch_task: grpc_executor:423893 is sleeping on function do_exit
> [15852778.343017] livepatch: klp_try_switch_task: grpc_executor:423894 is sleeping on function do_exit
> [15852778.362292] livepatch: klp_try_switch_task: grpc_executor:423976 is sleeping on function do_exit
> [15852778.381565] livepatch: klp_try_switch_task: grpc_executor:423977 is sleeping on function do_exit
> [15852778.400847] livepatch: klp_try_switch_task: grpc_executor:424610 is sleeping on function do_exit
> [15852778.412319] NMI watchdog: Watchdog detected hard LOCKUP on cpu 15
> ...
> [15852778.412374] CPU: 15 PID: 1 Comm: systemd Kdump: loaded Tainted: G S      W  O  K    6.1.52-3
> [15852778.412377] Hardware name: New H3C Technologies Co., Ltd. H3C UniServer R4950 G5/RS45M2C9S, BIOS 5.12 10/15/2021
> [15852778.412378] RIP: 0010:queued_write_lock_slowpath+0x75/0x135
> ...
> [15852778.412397] Call Trace:
> [15852778.412398]  <NMI>
> [15852778.412400]  ? show_regs.cold+0x1a/0x1f
> [15852778.412403]  ? watchdog_overflow_callback.cold+0x1e/0x70
> [15852778.412406]  ? __perf_event_overflow+0x102/0x1e0
> [15852778.412409]  ? perf_event_overflow+0x19/0x20
> [15852778.412411]  ? x86_pmu_handle_irq+0xf7/0x160
> [15852778.412415]  ? flush_tlb_one_kernel+0xe/0x30
> [15852778.412418]  ? __set_pte_vaddr+0x2d/0x40
> [15852778.412421]  ? set_pte_vaddr_p4d+0x3d/0x50
> [15852778.412423]  ? set_pte_vaddr+0x6d/0xa0
> [15852778.412424]  ? __native_set_fixmap+0x28/0x40
> [15852778.412426]  ? native_set_fixmap+0x54/0x60
> [15852778.412428]  ? ghes_copy_tofrom_phys+0x75/0x120
> [15852778.412431]  ? __ghes_peek_estatus.isra.0+0x4e/0xb0
> [15852778.412434]  ? ghes_in_nmi_queue_one_entry.constprop.0+0x3d/0x240
> [15852778.412437]  ? amd_pmu_handle_irq+0x48/0xc0
> [15852778.412438]  ? perf_event_nmi_handler+0x2d/0x50
> [15852778.412440]  ? nmi_handle+0x60/0x120
> [15852778.412443]  ? default_do_nmi+0x45/0x120
> [15852778.412446]  ? exc_nmi+0x118/0x150
> [15852778.412447]  ? end_repeat_nmi+0x16/0x67
> [15852778.412450]  ? copy_process+0xf01/0x19f0
> [15852778.412452]  ? queued_write_lock_slowpath+0x75/0x135
> [15852778.412455]  ? queued_write_lock_slowpath+0x75/0x135
> [15852778.412457]  ? queued_write_lock_slowpath+0x75/0x135
> [15852778.412459]  </NMI>
> [15852778.412460]  <TASK>
> [15852778.412461]  _raw_write_lock_irq+0x43/0x50
> [15852778.412463]  copy_process+0xf01/0x19f0
> [15852778.412466]  kernel_clone+0x9d/0x3e0
> [15852778.412468]  ? autofs_dev_ioctl_requester+0x100/0x100
> [15852778.412471]  __do_sys_clone+0x66/0x90
> [15852778.412475]  __x64_sys_clone+0x25/0x30
> [15852778.412477]  do_syscall_64+0x38/0x90
> [15852778.412478]  entry_SYSCALL_64_after_hwframe+0x64/0xce
> [15852778.412481] RIP: 0033:0x7f426bb3b9c1
> ...
> 
> Notably, dynamic_debug is enabled to collect debug information when
> applying a livepatch, resulting in a large amount of debug output.
> 
> The issue arises because klp_try_switch_task() holds the tasklist_lock, and
> if another task attempts to acquire it, it must spin until it's available.
> This becomes problematic in the copy_process() path, where IRQs are
> disabled, leading to the hard lockup. To prevent this, we should implement
> a check for spinlock contention before proceeding.
> 
> Link: https://lore.kernel.org/live-patching/CALOAHbA9WHPjeZKUcUkwULagQjTMfqAdAg+akqPzbZ7Byc=qrw@mail.gmail.com/ [0]
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  kernel/livepatch/transition.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> index ba069459c101..774017825bb4 100644
> --- a/kernel/livepatch/transition.c
> +++ b/kernel/livepatch/transition.c
> @@ -467,9 +467,14 @@ void klp_try_complete_transition(void)
>  	 * unless the patch includes changes to a very common function.
>  	 */
>  	read_lock(&tasklist_lock);
> -	for_each_process_thread(g, task)
> +	for_each_process_thread(g, task) {
>  		if (!klp_try_switch_task(task))
>  			complete = false;
> +		if (rwlock_is_contended(&tasklist_lock) || need_resched()) {

Are you able to finish the livepatch transition with this patch?

> +			complete = false;
> +			break;
> +		}
> +	}
>  	read_unlock(&tasklist_lock);
>  
>  	/*

With this patch, any operation which takes the tasklist_lock might
break klp_try_complete_transition(). I am afraid that this might
block the transition for a long time on huge systems with some
specific loads.

And the problem is caused by a printk() added just for debugging.
I wonder if you even use a slow serial port.

You might try to use printk_deferred() instead. Also you might need
to disable interrupts around the read_lock()/read_unlock() to
make sure that the console handling will be deferred after
the tasklist_lock gets released.

Anyway, I am against this patch.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] livepatch: Avoid hard lockup caused by klp_try_switch_task()
  2025-01-22 12:50 ` Petr Mladek
@ 2025-01-22 13:46   ` Yafang Shao
  2025-01-31 13:06     ` Miroslav Benes
  2025-01-31 13:22   ` zhang warden
  1 sibling, 1 reply; 10+ messages in thread
From: Yafang Shao @ 2025-01-22 13:46 UTC (permalink / raw)
  To: Petr Mladek; +Cc: jpoimboe, jikos, mbenes, joe.lawrence, live-patching

On Wed, Jan 22, 2025 at 8:50 PM Petr Mladek <pmladek@suse.com> wrote:
>
> On Wed 2025-01-22 16:51:46, Yafang Shao wrote:
> > I encountered a hard lockup while attempting to reproduce the panic issue
> > that occurred on our production servers [0]. The hard lockup manifests as
> > follows:
> >
> > [15852778.150191] livepatch: klp_try_switch_task: grpc_executor:421106 is sleeping on function do_exit
> > [15852778.169471] livepatch: klp_try_switch_task: grpc_executor:421244 is sleeping on function do_exit
> > [15852778.188746] livepatch: klp_try_switch_task: grpc_executor:421457 is sleeping on function do_exit
> > [15852778.208021] livepatch: klp_try_switch_task: grpc_executor:422407 is sleeping on function do_exit
> > [15852778.227292] livepatch: klp_try_switch_task: grpc_executor:423184 is sleeping on function do_exit
> > [15852778.246576] livepatch: klp_try_switch_task: grpc_executor:423582 is sleeping on function do_exit
> > [15852778.265863] livepatch: klp_try_switch_task: grpc_executor:423738 is sleeping on function do_exit
> > [15852778.285149] livepatch: klp_try_switch_task: grpc_executor:423739 is sleeping on function do_exit
> > [15852778.304446] livepatch: klp_try_switch_task: grpc_executor:423833 is sleeping on function do_exit
> > [15852778.323738] livepatch: klp_try_switch_task: grpc_executor:423893 is sleeping on function do_exit
> > [15852778.343017] livepatch: klp_try_switch_task: grpc_executor:423894 is sleeping on function do_exit
> > [15852778.362292] livepatch: klp_try_switch_task: grpc_executor:423976 is sleeping on function do_exit
> > [15852778.381565] livepatch: klp_try_switch_task: grpc_executor:423977 is sleeping on function do_exit
> > [15852778.400847] livepatch: klp_try_switch_task: grpc_executor:424610 is sleeping on function do_exit
> > [15852778.412319] NMI watchdog: Watchdog detected hard LOCKUP on cpu 15
> > ...
> > [15852778.412374] CPU: 15 PID: 1 Comm: systemd Kdump: loaded Tainted: G S      W  O  K    6.1.52-3
> > [15852778.412377] Hardware name: New H3C Technologies Co., Ltd. H3C UniServer R4950 G5/RS45M2C9S, BIOS 5.12 10/15/2021
> > [15852778.412378] RIP: 0010:queued_write_lock_slowpath+0x75/0x135
> > ...
> > [15852778.412397] Call Trace:
> > [15852778.412398]  <NMI>
> > [15852778.412400]  ? show_regs.cold+0x1a/0x1f
> > [15852778.412403]  ? watchdog_overflow_callback.cold+0x1e/0x70
> > [15852778.412406]  ? __perf_event_overflow+0x102/0x1e0
> > [15852778.412409]  ? perf_event_overflow+0x19/0x20
> > [15852778.412411]  ? x86_pmu_handle_irq+0xf7/0x160
> > [15852778.412415]  ? flush_tlb_one_kernel+0xe/0x30
> > [15852778.412418]  ? __set_pte_vaddr+0x2d/0x40
> > [15852778.412421]  ? set_pte_vaddr_p4d+0x3d/0x50
> > [15852778.412423]  ? set_pte_vaddr+0x6d/0xa0
> > [15852778.412424]  ? __native_set_fixmap+0x28/0x40
> > [15852778.412426]  ? native_set_fixmap+0x54/0x60
> > [15852778.412428]  ? ghes_copy_tofrom_phys+0x75/0x120
> > [15852778.412431]  ? __ghes_peek_estatus.isra.0+0x4e/0xb0
> > [15852778.412434]  ? ghes_in_nmi_queue_one_entry.constprop.0+0x3d/0x240
> > [15852778.412437]  ? amd_pmu_handle_irq+0x48/0xc0
> > [15852778.412438]  ? perf_event_nmi_handler+0x2d/0x50
> > [15852778.412440]  ? nmi_handle+0x60/0x120
> > [15852778.412443]  ? default_do_nmi+0x45/0x120
> > [15852778.412446]  ? exc_nmi+0x118/0x150
> > [15852778.412447]  ? end_repeat_nmi+0x16/0x67
> > [15852778.412450]  ? copy_process+0xf01/0x19f0
> > [15852778.412452]  ? queued_write_lock_slowpath+0x75/0x135
> > [15852778.412455]  ? queued_write_lock_slowpath+0x75/0x135
> > [15852778.412457]  ? queued_write_lock_slowpath+0x75/0x135
> > [15852778.412459]  </NMI>
> > [15852778.412460]  <TASK>
> > [15852778.412461]  _raw_write_lock_irq+0x43/0x50
> > [15852778.412463]  copy_process+0xf01/0x19f0
> > [15852778.412466]  kernel_clone+0x9d/0x3e0
> > [15852778.412468]  ? autofs_dev_ioctl_requester+0x100/0x100
> > [15852778.412471]  __do_sys_clone+0x66/0x90
> > [15852778.412475]  __x64_sys_clone+0x25/0x30
> > [15852778.412477]  do_syscall_64+0x38/0x90
> > [15852778.412478]  entry_SYSCALL_64_after_hwframe+0x64/0xce
> > [15852778.412481] RIP: 0033:0x7f426bb3b9c1
> > ...
> >
> > Notably, dynamic_debug is enabled to collect debug information when
> > applying a livepatch, resulting in a large amount of debug output.
> >
> > The issue arises because klp_try_switch_task() holds the tasklist_lock, and
> > if another task attempts to acquire it, it must spin until it's available.
> > This becomes problematic in the copy_process() path, where IRQs are
> > disabled, leading to the hard lockup. To prevent this, we should implement
> > a check for spinlock contention before proceeding.
> >
> > Link: https://lore.kernel.org/live-patching/CALOAHbA9WHPjeZKUcUkwULagQjTMfqAdAg+akqPzbZ7Byc=qrw@mail.gmail.com/ [0]
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  kernel/livepatch/transition.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> > index ba069459c101..774017825bb4 100644
> > --- a/kernel/livepatch/transition.c
> > +++ b/kernel/livepatch/transition.c
> > @@ -467,9 +467,14 @@ void klp_try_complete_transition(void)
> >        * unless the patch includes changes to a very common function.
> >        */
> >       read_lock(&tasklist_lock);
> > -     for_each_process_thread(g, task)
> > +     for_each_process_thread(g, task) {
> >               if (!klp_try_switch_task(task))
> >                       complete = false;
> > +             if (rwlock_is_contended(&tasklist_lock) || need_resched()) {
>
> Are you able to finish the livepatch transition with this patch?

Yes, I've deployed this change to a few test servers, and the
livepatch transition is still functioning correctly on them. These
servers are running the same workload which triggered the panic.

>
> > +                     complete = false;
> > +                     break;
> > +             }
> > +     }
> >       read_unlock(&tasklist_lock);
> >
> >       /*
>
> With this patch, any operation which takes the tasklist_lock might
> break klp_try_complete_transition(). I am afraid that this might
> block the transition for a long time on huge systems with some
> specific loads.
>
> And the problem is caused by a printk() added just for debugging.
> I wonder if you even use a slow serial port.

No, we don't use a slow console.

The console is :

$ cat /proc/cmdline
... console=tty0 ...

The log start from :
[15852771.495313] livepatch: 'livepatch_61_release6': starting
patching transition
[15852771.500341] livepatch: klp_try_switch_task: kcompactd0:664 is running
[15852771.516951] livepatch: klp_try_switch_task: java:2236329 is running
[15852771.522891] livepatch: klp_try_switch_task: python:338825 is
sleeping on function do_exit
[15852771.526292] livepatch: klp_try_switch_task:
jemalloc_bg_thd:338826 is sleeping on function do_exit
....

And end with:

[15852778.343017] livepatch: klp_try_switch_task: grpc_executor:423894
is sleeping on function do_exit
[15852778.362292] livepatch: klp_try_switch_task: grpc_executor:423976
is sleeping on function do_exit
[15852778.381565] livepatch: klp_try_switch_task: grpc_executor:423977
is sleeping on function do_exit
[15852778.400847] livepatch: klp_try_switch_task: grpc_executor:424610
is sleeping on function do_exit
[15852778.412319] NMI watchdog: Watchdog detected hard LOCKUP on cpu 15

$ cat log | grep do_exit | wc -l
1061

It seems that there are simply too many threads executing do_exit() at
the moment.

>
> You might try to use printk_deferred() instead. Also you might need
> to disable interrupts around the read_lock()/read_unlock() to
> make sure that the console handling will be deferred after
> the tasklist_lock gets released.
>
> Anyway, I am against this patch.

However, there is still a risk of triggering a hard lockup if a large
number of tasks are involved.


--
Regards
Yafang

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] livepatch: Avoid hard lockup caused by klp_try_switch_task()
  2025-01-22 13:46   ` Yafang Shao
@ 2025-01-31 13:06     ` Miroslav Benes
  0 siblings, 0 replies; 10+ messages in thread
From: Miroslav Benes @ 2025-01-31 13:06 UTC (permalink / raw)
  To: Yafang Shao; +Cc: Petr Mladek, jpoimboe, jikos, joe.lawrence, live-patching

Hi,

> $ cat log | grep do_exit | wc -l
> 1061
> 
> It seems that there are simply too many threads executing do_exit() at
> the moment.
> 
> >
> > You might try to use printk_deferred() instead. Also you might need
> > to disable interrupts around the read_lock()/read_unlock() to
> > make sure that the console handling will be deferred after
> > the tasklist_lock gets released.
> >
> > Anyway, I am against this patch.
> 
> However, there is still a risk of triggering a hard lockup if a large
> number of tasks are involved.

And as Petr said, it is very likely caused by pr_debug() in this setup. 
The proposed patch is not a fix and would make things only worse.

Regards,
Miroslav

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] livepatch: Avoid hard lockup caused by klp_try_switch_task()
  2025-01-22 12:50 ` Petr Mladek
  2025-01-22 13:46   ` Yafang Shao
@ 2025-01-31 13:22   ` zhang warden
  2025-01-31 13:39     ` Petr Mladek
  1 sibling, 1 reply; 10+ messages in thread
From: zhang warden @ 2025-01-31 13:22 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Yafang Shao, Josh Poimboeuf, Jiri Kosina, mbenes, joe.lawrence,
	live-patching



> On Jan 22, 2025, at 20:50, Petr Mladek <pmladek@suse.com> wrote:
> 
> With this patch, any operation which takes the tasklist_lock might
> break klp_try_complete_transition(). I am afraid that this might
> block the transition for a long time on huge systems with some
> specific loads.
> 
> And the problem is caused by a printk() added just for debugging.
> I wonder if you even use a slow serial port.
> 
> You might try to use printk_deferred() instead. Also you might need
> to disable interrupts around the read_lock()/read_unlock() to
> make sure that the console handling will be deferred after
> the tasklist_lock gets released.
> 
> Anyway, I am against this patch.
> 
> Best Regards,
> Petr

Hi, Petr.

I am unfamiliar with the function `rwlock_is_contended`, but it seems this function will not block and just only check the status of the rw_lock.

If I understand it right, the problem would raise from the `break` which will stop the process of `for_each_process_thread`, right?

Thanks.
Wardenjohn.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] livepatch: Avoid hard lockup caused by klp_try_switch_task()
  2025-01-31 13:22   ` zhang warden
@ 2025-01-31 13:39     ` Petr Mladek
  2025-02-01  2:04       ` zhang warden
  2025-02-05  8:39       ` Yafang Shao
  0 siblings, 2 replies; 10+ messages in thread
From: Petr Mladek @ 2025-01-31 13:39 UTC (permalink / raw)
  To: zhang warden
  Cc: Yafang Shao, Josh Poimboeuf, Jiri Kosina, mbenes, joe.lawrence,
	live-patching

On Fri 2025-01-31 21:22:13, zhang warden wrote:
> 
> 
> > On Jan 22, 2025, at 20:50, Petr Mladek <pmladek@suse.com> wrote:
> > 
> > With this patch, any operation which takes the tasklist_lock might
> > break klp_try_complete_transition(). I am afraid that this might
> > block the transition for a long time on huge systems with some
> > specific loads.
> > 
> > And the problem is caused by a printk() added just for debugging.
> > I wonder if you even use a slow serial port.
> > 
> > You might try to use printk_deferred() instead. Also you might need
> > to disable interrupts around the read_lock()/read_unlock() to
> > make sure that the console handling will be deferred after
> > the tasklist_lock gets released.
> > 
> > Anyway, I am against this patch.
> > 
> > Best Regards,
> > Petr
> 
> Hi, Petr.
> 
> I am unfamiliar with the function `rwlock_is_contended`, but it seems this function will not block and just only check the status of the rw_lock.
> 
> If I understand it right, the problem would raise from the `break` which will stop the process of `for_each_process_thread`, right?

You got it right. I am afraid that it might create a livelock
situation for the livepatch transition. I mean that the check
might almost always break on systems with thousands of processes
and frequently created/exited processes. It always has
to start from the beginning.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] livepatch: Avoid hard lockup caused by klp_try_switch_task()
  2025-01-31 13:39     ` Petr Mladek
@ 2025-02-01  2:04       ` zhang warden
  2025-02-05  8:39       ` Yafang Shao
  1 sibling, 0 replies; 10+ messages in thread
From: zhang warden @ 2025-02-01  2:04 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Yafang Shao, Josh Poimboeuf, Jiri Kosina, mbenes, joe.lawrence,
	live-patching



> On Jan 31, 2025, at 21:39, Petr Mladek <pmladek@suse.com> wrote:
> 
> You got it right. I am afraid that it might create a livelock
> situation for the livepatch transition. I mean that the check
> might almost always break on systems with thousands of processes
> and frequently created/exited processes. It always has
> to start from the beginning.
> 
> Best Regards,
> Petr
True. It will easily raise the problem that some important process cannot successfully finish the target patch state and make the livepatch-fix unable to work on some key process(which may happen).  

If this happens, I am sure it would be a nightmare...

Best Regards,
Wardenjohn


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] livepatch: Avoid hard lockup caused by klp_try_switch_task()
  2025-01-31 13:39     ` Petr Mladek
  2025-02-01  2:04       ` zhang warden
@ 2025-02-05  8:39       ` Yafang Shao
  2025-02-06 16:43         ` Petr Mladek
  1 sibling, 1 reply; 10+ messages in thread
From: Yafang Shao @ 2025-02-05  8:39 UTC (permalink / raw)
  To: Petr Mladek
  Cc: zhang warden, Josh Poimboeuf, Jiri Kosina, mbenes, joe.lawrence,
	live-patching

On Fri, Jan 31, 2025 at 9:39 PM Petr Mladek <pmladek@suse.com> wrote:
>
> On Fri 2025-01-31 21:22:13, zhang warden wrote:
> >
> >
> > > On Jan 22, 2025, at 20:50, Petr Mladek <pmladek@suse.com> wrote:
> > >
> > > With this patch, any operation which takes the tasklist_lock might
> > > break klp_try_complete_transition(). I am afraid that this might
> > > block the transition for a long time on huge systems with some
> > > specific loads.
> > >
> > > And the problem is caused by a printk() added just for debugging.
> > > I wonder if you even use a slow serial port.
> > >
> > > You might try to use printk_deferred() instead. Also you might need
> > > to disable interrupts around the read_lock()/read_unlock() to
> > > make sure that the console handling will be deferred after
> > > the tasklist_lock gets released.
> > >
> > > Anyway, I am against this patch.
> > >
> > > Best Regards,
> > > Petr
> >
> > Hi, Petr.
> >
> > I am unfamiliar with the function `rwlock_is_contended`, but it seems this function will not block and just only check the status of the rw_lock.
> >
> > If I understand it right, the problem would raise from the `break` which will stop the process of `for_each_process_thread`, right?
>
> You got it right. I am afraid that it might create a livelock
> situation for the livepatch transition. I mean that the check
> might almost always break on systems with thousands of processes
> and frequently created/exited processes. It always has
> to start from the beginning.

It doesn’t start from the beginning, as tasks that have already
switched over will be skipped.

Since the task->patch_state is set before the task is added to the
task list and the child’s patch_state is inherited from the parent, I
believe we can remove the tasklist_lock and use RCU instead, as
follows:

diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index 30187b1d8275..1d022f983bbc 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -399,11 +399,11 @@ void klp_try_complete_transition(void)
         * Usually this will transition most (or all) of the tasks on a system
         * unless the patch includes changes to a very common function.
         */
-       read_lock(&tasklist_lock);
+       read_rcu_lock();
        for_each_process_thread(g, task)
                if (!klp_try_switch_task(task))
                        complete = false;
-       read_unlock(&tasklist_lock);
+       read_rcu_unlock();

        /*
         * Ditto for the idle "swapper" tasks.


--
Regards
Yafang

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] livepatch: Avoid hard lockup caused by klp_try_switch_task()
  2025-02-05  8:39       ` Yafang Shao
@ 2025-02-06 16:43         ` Petr Mladek
  2025-02-07  2:16           ` Yafang Shao
  0 siblings, 1 reply; 10+ messages in thread
From: Petr Mladek @ 2025-02-06 16:43 UTC (permalink / raw)
  To: Yafang Shao
  Cc: zhang warden, Josh Poimboeuf, Jiri Kosina, mbenes, joe.lawrence,
	live-patching

On Wed 2025-02-05 16:39:11, Yafang Shao wrote:
> On Fri, Jan 31, 2025 at 9:39 PM Petr Mladek <pmladek@suse.com> wrote:
> >
> > On Fri 2025-01-31 21:22:13, zhang warden wrote:
> > >
> > >
> > > > On Jan 22, 2025, at 20:50, Petr Mladek <pmladek@suse.com> wrote:
> > > >
> > > > With this patch, any operation which takes the tasklist_lock might
> > > > break klp_try_complete_transition(). I am afraid that this might
> > > > block the transition for a long time on huge systems with some
> > > > specific loads.
> > > >
> > > > And the problem is caused by a printk() added just for debugging.
> > > > I wonder if you even use a slow serial port.
> > > >
> > > > You might try to use printk_deferred() instead. Also you might need
> > > > to disable interrupts around the read_lock()/read_unlock() to
> > > > make sure that the console handling will be deferred after
> > > > the tasklist_lock gets released.
> > > >
> > > > Anyway, I am against this patch.
> > > >
> > > > Best Regards,
> > > > Petr
> > >
> > > Hi, Petr.
> > >
> > > I am unfamiliar with the function `rwlock_is_contended`, but it seems this function will not block and just only check the status of the rw_lock.
> > >
> > > If I understand it right, the problem would raise from the `break` which will stop the process of `for_each_process_thread`, right?
> >
> > You got it right. I am afraid that it might create a livelock
> > situation for the livepatch transition. I mean that the check
> > might almost always break on systems with thousands of processes
> > and frequently created/exited processes. It always has
> > to start from the beginning.
> 
> It doesn’t start from the beginning, as tasks that have already
> switched over will be skipped.

To make it clear. The next klp_try_complete_transition() will start
from the beginning but it should be faster because it will quickly
skip already migrated processes. Right?

It makes some sense. I agree that checking the stack is relatively
slow operation.

That said, beware that the full stack is checked only when the process
is in the kernel code: kthread or userspace process calling a syscall.
Other processes should be handled much faster. The ratio of these
processes depends on the type of the load. And I could imagine that
even checking the TIF_PATCH_PENDING might take a long time when
there are thousands of processes.

OK, let's make a step from a theory back to the practice:

You say that this patch helped and worked fine with your
workload.

It might be the best approach after all. It looks easier then
the hybrid model. And it might be needed even with the hybrid
model.

If I get it correctly, the email
https://lore.kernel.org/all/CALOAHbBZc6ORGzXwBRwe+rD2=YGf1jub5TEr989_GpK54P2o1A@mail.gmail.com/
says that you saw the hardlockup even with a relatively simple
livepatch. It modified "only" about 15 functions.

My main concern is how to guarantee a forward progress. I would like
to make sure that klp_try_complete_transition() will eventually
check all processes.

I would modify the check to something like:

	read_lock(&tasklist_lock);

	timeout = jiffies + HZ;
	proceed_pending_processes = 0;

	for_each_process_thread(g, task) {
		/* check if this task has already switched over */
		if (task->patch_state == klp_target_state)
			continue;

		proceed_pending_processes++;

		if (!klp_try_switch_task(task))
			complete = false;

		/*
		 * Prevent hardlockup by not blocking tasklist_lock for too long.
		 * But guarantee the forward progress by making sure at least
		 * some pending processes were checked.
		 */
		 if (rwlock_is_contended(&tasklist_lock) &&
		    time_after(jiffies, timeout) &&
		    proceed_pending_processes > 100) {
				complete = false;
				break;
		}
	}

	read_unlock(&tasklist_lock);

> Since the task->patch_state is set before the task is added to the
> task list and the child’s patch_state is inherited from the parent, I
> believe we can remove the tasklist_lock and use RCU instead, as
> follows:
> 
> diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> index 30187b1d8275..1d022f983bbc 100644
> --- a/kernel/livepatch/transition.c
> +++ b/kernel/livepatch/transition.c
> @@ -399,11 +399,11 @@ void klp_try_complete_transition(void)
>          * Usually this will transition most (or all) of the tasks on a system
>          * unless the patch includes changes to a very common function.
>          */
> -       read_lock(&tasklist_lock);
> +       read_rcu_lock();
>         for_each_process_thread(g, task)
>                 if (!klp_try_switch_task(task))
>                         complete = false;
> -       read_unlock(&tasklist_lock);
> +       read_rcu_unlock();

IMHO, this does not guarantee that we checked all processes in the
cycle.

I mean:

We already have a problem that tasklist_lock is not enough to
serialize livepatches modifying do_exit(). It creates a race window
when the process still might be scheduled but it is not longer visible
in the for_each_process_thread() cycle.

And using read_rcu_lock() will make the race window even bigger.
I mean:

  + with read_lock(&tasklist_lock) the race window is limited by

       + read_lock(&tasklist_lock) in klp_try_complete_transition()
       + write_lock_irq(&tasklist_lock) in the middle of do_exit()

  + with read_rcu_lock() the race window is unlimited

I mean that more processes might get removed from the list
when klp_try_complete_transition() is running when they
are not serialized via the tasklist_lock. As a result, more
processes might be scheduled without being seen
by for_each_process_thread() cycle.

Does it make sense?

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] livepatch: Avoid hard lockup caused by klp_try_switch_task()
  2025-02-06 16:43         ` Petr Mladek
@ 2025-02-07  2:16           ` Yafang Shao
  0 siblings, 0 replies; 10+ messages in thread
From: Yafang Shao @ 2025-02-07  2:16 UTC (permalink / raw)
  To: Petr Mladek
  Cc: zhang warden, Josh Poimboeuf, Jiri Kosina, mbenes, joe.lawrence,
	live-patching

On Fri, Feb 7, 2025 at 12:43 AM Petr Mladek <pmladek@suse.com> wrote:
>
> On Wed 2025-02-05 16:39:11, Yafang Shao wrote:
> > On Fri, Jan 31, 2025 at 9:39 PM Petr Mladek <pmladek@suse.com> wrote:
> > >
> > > On Fri 2025-01-31 21:22:13, zhang warden wrote:
> > > >
> > > >
> > > > > On Jan 22, 2025, at 20:50, Petr Mladek <pmladek@suse.com> wrote:
> > > > >
> > > > > With this patch, any operation which takes the tasklist_lock might
> > > > > break klp_try_complete_transition(). I am afraid that this might
> > > > > block the transition for a long time on huge systems with some
> > > > > specific loads.
> > > > >
> > > > > And the problem is caused by a printk() added just for debugging.
> > > > > I wonder if you even use a slow serial port.
> > > > >
> > > > > You might try to use printk_deferred() instead. Also you might need
> > > > > to disable interrupts around the read_lock()/read_unlock() to
> > > > > make sure that the console handling will be deferred after
> > > > > the tasklist_lock gets released.
> > > > >
> > > > > Anyway, I am against this patch.
> > > > >
> > > > > Best Regards,
> > > > > Petr
> > > >
> > > > Hi, Petr.
> > > >
> > > > I am unfamiliar with the function `rwlock_is_contended`, but it seems this function will not block and just only check the status of the rw_lock.
> > > >
> > > > If I understand it right, the problem would raise from the `break` which will stop the process of `for_each_process_thread`, right?
> > >
> > > You got it right. I am afraid that it might create a livelock
> > > situation for the livepatch transition. I mean that the check
> > > might almost always break on systems with thousands of processes
> > > and frequently created/exited processes. It always has
> > > to start from the beginning.
> >
> > It doesn’t start from the beginning, as tasks that have already
> > switched over will be skipped.
>
> To make it clear. The next klp_try_complete_transition() will start
> from the beginning but it should be faster because it will quickly
> skip already migrated processes. Right?

Exactly, that’s precisely what I meant.

>
> It makes some sense. I agree that checking the stack is relatively
> slow operation.
>
> That said, beware that the full stack is checked only when the process
> is in the kernel code: kthread or userspace process calling a syscall.
> Other processes should be handled much faster. The ratio of these
> processes depends on the type of the load. And I could imagine that
> even checking the TIF_PATCH_PENDING might take a long time when
> there are thousands of processes.
>
>
> OK, let's make a step from a theory back to the practice:
>
> You say that this patch helped and worked fine with your
> workload.
>
> It might be the best approach after all. It looks easier then
> the hybrid model. And it might be needed even with the hybrid
> model.
>
> If I get it correctly, the email
> https://lore.kernel.org/all/CALOAHbBZc6ORGzXwBRwe+rD2=YGf1jub5TEr989_GpK54P2o1A@mail.gmail.com/
> says that you saw the hardlockup even with a relatively simple
> livepatch. It modified "only" about 15 functions.

That's correct. We’ve only modified 15 functions so far.

>
> My main concern is how to guarantee a forward progress. I would like
> to make sure that klp_try_complete_transition() will eventually
> check all processes.
>
> I would modify the check to something like:
>
>         read_lock(&tasklist_lock);
>
>         timeout = jiffies + HZ;
>         proceed_pending_processes = 0;
>
>         for_each_process_thread(g, task) {
>                 /* check if this task has already switched over */
>                 if (task->patch_state == klp_target_state)
>                         continue;
>
>                 proceed_pending_processes++;
>
>                 if (!klp_try_switch_task(task))
>                         complete = false;
>
>                 /*
>                  * Prevent hardlockup by not blocking tasklist_lock for too long.
>                  * But guarantee the forward progress by making sure at least
>                  * some pending processes were checked.
>                  */
>                  if (rwlock_is_contended(&tasklist_lock) &&
>                     time_after(jiffies, timeout) &&
>                     proceed_pending_processes > 100) {
>                                 complete = false;
>                                 break;
>                 }
>         }
>
>         read_unlock(&tasklist_lock);
>

Thanks for your suggestion. I'll deploy this change to our production
servers and verify its effectiveness.

>
>
> > Since the task->patch_state is set before the task is added to the
> > task list and the child’s patch_state is inherited from the parent, I
> > believe we can remove the tasklist_lock and use RCU instead, as
> > follows:
> >
> > diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
> > index 30187b1d8275..1d022f983bbc 100644
> > --- a/kernel/livepatch/transition.c
> > +++ b/kernel/livepatch/transition.c
> > @@ -399,11 +399,11 @@ void klp_try_complete_transition(void)
> >          * Usually this will transition most (or all) of the tasks on a system
> >          * unless the patch includes changes to a very common function.
> >          */
> > -       read_lock(&tasklist_lock);
> > +       read_rcu_lock();
> >         for_each_process_thread(g, task)
> >                 if (!klp_try_switch_task(task))
> >                         complete = false;
> > -       read_unlock(&tasklist_lock);
> > +       read_rcu_unlock();
>
> IMHO, this does not guarantee that we checked all processes in the
> cycle.
>
> I mean:
>
> We already have a problem that tasklist_lock is not enough to
> serialize livepatches modifying do_exit(). It creates a race window
> when the process still might be scheduled but it is not longer visible
> in the for_each_process_thread() cycle.
>
> And using read_rcu_lock() will make the race window even bigger.
> I mean:
>
>   + with read_lock(&tasklist_lock) the race window is limited by
>
>        + read_lock(&tasklist_lock) in klp_try_complete_transition()
>        + write_lock_irq(&tasklist_lock) in the middle of do_exit()
>
>   + with read_rcu_lock() the race window is unlimited
>
> I mean that more processes might get removed from the list
> when klp_try_complete_transition() is running when they
> are not serialized via the tasklist_lock. As a result, more
> processes might be scheduled without being seen
> by for_each_process_thread() cycle.
>
> Does it make sense?

That makes sense. Thanks for the detailed explanation.

--
Regards
Yafang

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-02-07  2:17 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-22  8:51 [PATCH] livepatch: Avoid hard lockup caused by klp_try_switch_task() Yafang Shao
2025-01-22 12:50 ` Petr Mladek
2025-01-22 13:46   ` Yafang Shao
2025-01-31 13:06     ` Miroslav Benes
2025-01-31 13:22   ` zhang warden
2025-01-31 13:39     ` Petr Mladek
2025-02-01  2:04       ` zhang warden
2025-02-05  8:39       ` Yafang Shao
2025-02-06 16:43         ` Petr Mladek
2025-02-07  2:16           ` Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox