All of lore.kernel.org
 help / color / mirror / Atom feed
* CVE-2024-46839: workqueue: Improve scalability of workqueue watchdog touch
@ 2024-09-27 12:40 Greg Kroah-Hartman
  2024-10-01  8:02 ` Petr Mladek
  0 siblings, 1 reply; 6+ messages in thread
From: Greg Kroah-Hartman @ 2024-09-27 12:40 UTC (permalink / raw)
  To: linux-cve-announce; +Cc: Greg Kroah-Hartman

Description
===========

In the Linux kernel, the following vulnerability has been resolved:

workqueue: Improve scalability of workqueue watchdog touch

On a ~2000 CPU powerpc system, hard lockups have been observed in the
workqueue code when stop_machine runs (in this case due to CPU hotplug).
This is due to lots of CPUs spinning in multi_cpu_stop, calling
touch_nmi_watchdog() which ends up calling wq_watchdog_touch().
wq_watchdog_touch() writes to the global variable wq_watchdog_touched,
and that can find itself in the same cacheline as other important
workqueue data, which slows down operations to the point of lockups.

In the case of the following abridged trace, worker_pool_idr was in
the hot line, causing the lockups to always appear at idr_find.

  watchdog: CPU 1125 self-detected hard LOCKUP @ idr_find
  Call Trace:
  get_work_pool
  __queue_work
  call_timer_fn
  run_timer_softirq
  __do_softirq
  do_softirq_own_stack
  irq_exit
  timer_interrupt
  decrementer_common_virt
  * interrupt: 900 (timer) at multi_cpu_stop
  multi_cpu_stop
  cpu_stopper_thread
  smpboot_thread_fn
  kthread

Fix this by having wq_watchdog_touch() only write to the line if the
last time a touch was recorded exceeds 1/4 of the watchdog threshold.

The Linux kernel CVE team has assigned CVE-2024-46839 to this issue.


Affected and fixed versions
===========================

	Fixed in 5.15.167 with commit 9d08fce64dd7
	Fixed in 6.1.110 with commit a2abd35e7dc5
	Fixed in 6.6.51 with commit 241bce1c757d
	Fixed in 6.10.10 with commit da5f374103a1
	Fixed in 6.11 with commit 98f887f820c9

Please see https://www.kernel.org for a full list of currently supported
kernel versions by the kernel community.

Unaffected versions might change over time as fixes are backported to
older supported kernel versions.  The official CVE entry at
	https://cve.org/CVERecord/?id=CVE-2024-46839
will be updated if fixes are backported, please check that for the most
up to date information about this issue.


Affected files
==============

The file(s) affected by this issue are:
	kernel/workqueue.c


Mitigation
==========

The Linux kernel CVE team recommends that you update to the latest
stable kernel version for this, and many other bugfixes.  Individual
changes are never tested alone, but rather are part of a larger kernel
release.  Cherry-picking individual commits is not recommended or
supported by the Linux kernel community at all.  If however, updating to
the latest release is impossible, the individual changes to resolve this
issue can be found at these commits:
	https://git.kernel.org/stable/c/9d08fce64dd77f42e2361a4818dbc4b50f3c7dad
	https://git.kernel.org/stable/c/a2abd35e7dc55bf9ed01e2b3481fa78e086d3bf4
	https://git.kernel.org/stable/c/241bce1c757d0587721512296952e6bba69631ed
	https://git.kernel.org/stable/c/da5f374103a1e0881bbd35847dc57b04ac155eb0
	https://git.kernel.org/stable/c/98f887f820c993e05a12e8aa816c80b8661d4c87

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: CVE-2024-46839: workqueue: Improve scalability of workqueue watchdog touch
  2024-09-27 12:40 CVE-2024-46839: workqueue: Improve scalability of workqueue watchdog touch Greg Kroah-Hartman
@ 2024-10-01  8:02 ` Petr Mladek
  2024-10-01  8:22   ` Greg Kroah-Hartman
  0 siblings, 1 reply; 6+ messages in thread
From: Petr Mladek @ 2024-10-01  8:02 UTC (permalink / raw)
  To: cve, linux-kernel
  Cc: linux-cve-announce, Greg Kroah-Hartman, Srikar Dronamraju,
	Nicholas Piggin, Paul E. McKenney, Tejun Heo, Sasha Levin,
	Michal Hocko, Michal Koutný

On Fri 2024-09-27 14:40:07, Greg Kroah-Hartman wrote:
> Description
> ===========
> 
> In the Linux kernel, the following vulnerability has been resolved:
> 
> workqueue: Improve scalability of workqueue watchdog touch
> 
> On a ~2000 CPU powerpc system, hard lockups have been observed in the
> workqueue code when stop_machine runs (in this case due to CPU hotplug).

I believe that this does not qualify as a security vulnerability.
Any hotplug is a privileged operation.

Best Regards,
Petr

> This is due to lots of CPUs spinning in multi_cpu_stop, calling
> touch_nmi_watchdog() which ends up calling wq_watchdog_touch().
> wq_watchdog_touch() writes to the global variable wq_watchdog_touched,
> and that can find itself in the same cacheline as other important
> workqueue data, which slows down operations to the point of lockups.
>
> In the case of the following abridged trace, worker_pool_idr was in
> the hot line, causing the lockups to always appear at idr_find.
> 
>   watchdog: CPU 1125 self-detected hard LOCKUP @ idr_find
>   Call Trace:
>   get_work_pool
>   __queue_work
>   call_timer_fn
>   run_timer_softirq
>   __do_softirq
>   do_softirq_own_stack
>   irq_exit
>   timer_interrupt
>   decrementer_common_virt
>   * interrupt: 900 (timer) at multi_cpu_stop
>   multi_cpu_stop
>   cpu_stopper_thread
>   smpboot_thread_fn
>   kthread
> 
> Fix this by having wq_watchdog_touch() only write to the line if the
> last time a touch was recorded exceeds 1/4 of the watchdog threshold.
> 
> The Linux kernel CVE team has assigned CVE-2024-46839 to this issue.
> 
> 
> Affected and fixed versions
> ===========================
> 
> 	Fixed in 5.15.167 with commit 9d08fce64dd7
> 	Fixed in 6.1.110 with commit a2abd35e7dc5
> 	Fixed in 6.6.51 with commit 241bce1c757d
> 	Fixed in 6.10.10 with commit da5f374103a1
> 	Fixed in 6.11 with commit 98f887f820c9
> 
> Please see https://www.kernel.org for a full list of currently supported
> kernel versions by the kernel community.
> 
> Unaffected versions might change over time as fixes are backported to
> older supported kernel versions.  The official CVE entry at
> 	https://cve.org/CVERecord/?id=CVE-2024-46839
> will be updated if fixes are backported, please check that for the most
> up to date information about this issue.
> 
> 
> Affected files
> ==============
> 
> The file(s) affected by this issue are:
> 	kernel/workqueue.c
> 
> 
> Mitigation
> ==========
> 
> The Linux kernel CVE team recommends that you update to the latest
> stable kernel version for this, and many other bugfixes.  Individual
> changes are never tested alone, but rather are part of a larger kernel
> release.  Cherry-picking individual commits is not recommended or
> supported by the Linux kernel community at all.  If however, updating to
> the latest release is impossible, the individual changes to resolve this
> issue can be found at these commits:
> 	https://git.kernel.org/stable/c/9d08fce64dd77f42e2361a4818dbc4b50f3c7dad
> 	https://git.kernel.org/stable/c/a2abd35e7dc55bf9ed01e2b3481fa78e086d3bf4
> 	https://git.kernel.org/stable/c/241bce1c757d0587721512296952e6bba69631ed
> 	https://git.kernel.org/stable/c/da5f374103a1e0881bbd35847dc57b04ac155eb0
> 	https://git.kernel.org/stable/c/98f887f820c993e05a12e8aa816c80b8661d4c87

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: CVE-2024-46839: workqueue: Improve scalability of workqueue watchdog touch
  2024-10-01  8:02 ` Petr Mladek
@ 2024-10-01  8:22   ` Greg Kroah-Hartman
  2024-10-01  9:07     ` Michal Hocko
  0 siblings, 1 reply; 6+ messages in thread
From: Greg Kroah-Hartman @ 2024-10-01  8:22 UTC (permalink / raw)
  To: Petr Mladek
  Cc: cve, linux-kernel, linux-cve-announce, Srikar Dronamraju,
	Nicholas Piggin, Paul E. McKenney, Tejun Heo, Sasha Levin,
	Michal Hocko, Michal Koutný

On Tue, Oct 01, 2024 at 10:02:02AM +0200, Petr Mladek wrote:
> On Fri 2024-09-27 14:40:07, Greg Kroah-Hartman wrote:
> > Description
> > ===========
> > 
> > In the Linux kernel, the following vulnerability has been resolved:
> > 
> > workqueue: Improve scalability of workqueue watchdog touch
> > 
> > On a ~2000 CPU powerpc system, hard lockups have been observed in the
> > workqueue code when stop_machine runs (in this case due to CPU hotplug).
> 
> I believe that this does not qualify as a security vulnerability.
> Any hotplug is a privileged operation.

Really?  I see that happen on many embedded systems all the time, they
add/remove CPUs while the device runs/sleeps constantly.

Now to be fair, right now an "embedded system" usually doesn't have 2000
cpus, but what's wrong with marking this real bugfix as a vulnerability
resolution?  If you don't run your system in a way that allows cpus to
be stopped unless an admin says so, it will not be relevant.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: CVE-2024-46839: workqueue: Improve scalability of workqueue watchdog touch
  2024-10-01  8:22   ` Greg Kroah-Hartman
@ 2024-10-01  9:07     ` Michal Hocko
  2024-10-01 11:37       ` Paul E. McKenney
  2024-10-01 13:53       ` Greg Kroah-Hartman
  0 siblings, 2 replies; 6+ messages in thread
From: Michal Hocko @ 2024-10-01  9:07 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Petr Mladek, cve, linux-kernel, linux-cve-announce,
	Srikar Dronamraju, Nicholas Piggin, Paul E. McKenney, Tejun Heo,
	Sasha Levin, Michal Koutný

On Tue 01-10-24 10:22:51, Greg KH wrote:
> On Tue, Oct 01, 2024 at 10:02:02AM +0200, Petr Mladek wrote:
> > On Fri 2024-09-27 14:40:07, Greg Kroah-Hartman wrote:
> > > Description
> > > ===========
> > > 
> > > In the Linux kernel, the following vulnerability has been resolved:
> > > 
> > > workqueue: Improve scalability of workqueue watchdog touch
> > > 
> > > On a ~2000 CPU powerpc system, hard lockups have been observed in the
> > > workqueue code when stop_machine runs (in this case due to CPU hotplug).
> > 
> > I believe that this does not qualify as a security vulnerability.
> > Any hotplug is a privileged operation.
> 
> Really?  I see that happen on many embedded systems all the time, they
> add/remove CPUs while the device runs/sleeps constantly.

This is a powerpc specific fix. Other architectures are not affected.
 
> Now to be fair, right now an "embedded system" usually doesn't have 2000
> cpus, but what's wrong with marking this real bugfix as a vulnerability
> resolution?

Yes, this is indeed a scalability fix for huge systems with a lot of
CPUs anybody owning those systems was simply not able to use memory
hotplug without seeing those hard lockup messages. The system is not
really locked up. The progress of the hotplug operation is just utterly
slow. Calling this a vulnerability is a stretch IMHO. 

The only potential attack vector is to have machine configured to panic
on hard lockups on those huge ppc systems and allow cpu hotremove to an
adversary which in itsels seems like a very bad idea anyway because
availability of such a system is then effectively compromised.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: CVE-2024-46839: workqueue: Improve scalability of workqueue watchdog touch
  2024-10-01  9:07     ` Michal Hocko
@ 2024-10-01 11:37       ` Paul E. McKenney
  2024-10-01 13:53       ` Greg Kroah-Hartman
  1 sibling, 0 replies; 6+ messages in thread
From: Paul E. McKenney @ 2024-10-01 11:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Kroah-Hartman, Petr Mladek, cve, linux-kernel,
	linux-cve-announce, Srikar Dronamraju, Nicholas Piggin, Tejun Heo,
	Sasha Levin, Michal Koutný

On Tue, Oct 01, 2024 at 11:07:49AM +0200, Michal Hocko wrote:
> On Tue 01-10-24 10:22:51, Greg KH wrote:
> > On Tue, Oct 01, 2024 at 10:02:02AM +0200, Petr Mladek wrote:
> > > On Fri 2024-09-27 14:40:07, Greg Kroah-Hartman wrote:
> > > > Description
> > > > ===========
> > > > 
> > > > In the Linux kernel, the following vulnerability has been resolved:
> > > > 
> > > > workqueue: Improve scalability of workqueue watchdog touch
> > > > 
> > > > On a ~2000 CPU powerpc system, hard lockups have been observed in the
> > > > workqueue code when stop_machine runs (in this case due to CPU hotplug).
> > > 
> > > I believe that this does not qualify as a security vulnerability.
> > > Any hotplug is a privileged operation.
> > 
> > Really?  I see that happen on many embedded systems all the time, they
> > add/remove CPUs while the device runs/sleeps constantly.
> 
> This is a powerpc specific fix. Other architectures are not affected.
>  
> > Now to be fair, right now an "embedded system" usually doesn't have 2000
> > cpus, but what's wrong with marking this real bugfix as a vulnerability
> > resolution?
> 
> Yes, this is indeed a scalability fix for huge systems with a lot of
> CPUs anybody owning those systems was simply not able to use memory
> hotplug without seeing those hard lockup messages. The system is not
> really locked up. The progress of the hotplug operation is just utterly
> slow. Calling this a vulnerability is a stretch IMHO. 
> 
> The only potential attack vector is to have machine configured to panic
> on hard lockups on those huge ppc systems and allow cpu hotremove to an
> adversary which in itsels seems like a very bad idea anyway because
> availability of such a system is then effectively compromised.

If the attacker can do CPU hotplug, then an effective (though admittedly
non-CVE) attack is to simply offline all but one of the CPUs.  Whatever
that system was doing with its 2,000 CPUs, it is unlikely to be doing
with only one of them.

And taking Michal's point further, if the load rises high enough, you
might get various types of lockups, and the system might be configured
to panic.  For example, the load resulting from dumping 2000 CPUs worth of
workload onto a single CPU could easily starve RCU's grace-period kthread
for the 21 seconds required to result in an RCU CPU stall warning.  And if
the system has sysctl_panic_on_rcu_stall set, then the system will panic.

But this really should be considered to be expected behavior given
privileged abuse rather than a vulnerability, correct?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: CVE-2024-46839: workqueue: Improve scalability of workqueue watchdog touch
  2024-10-01  9:07     ` Michal Hocko
  2024-10-01 11:37       ` Paul E. McKenney
@ 2024-10-01 13:53       ` Greg Kroah-Hartman
  1 sibling, 0 replies; 6+ messages in thread
From: Greg Kroah-Hartman @ 2024-10-01 13:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Petr Mladek, cve, linux-kernel, linux-cve-announce,
	Srikar Dronamraju, Nicholas Piggin, Paul E. McKenney, Tejun Heo,
	Sasha Levin, Michal Koutný

On Tue, Oct 01, 2024 at 11:07:49AM +0200, Michal Hocko wrote:
> On Tue 01-10-24 10:22:51, Greg KH wrote:
> > On Tue, Oct 01, 2024 at 10:02:02AM +0200, Petr Mladek wrote:
> > > On Fri 2024-09-27 14:40:07, Greg Kroah-Hartman wrote:
> > > > Description
> > > > ===========
> > > > 
> > > > In the Linux kernel, the following vulnerability has been resolved:
> > > > 
> > > > workqueue: Improve scalability of workqueue watchdog touch
> > > > 
> > > > On a ~2000 CPU powerpc system, hard lockups have been observed in the
> > > > workqueue code when stop_machine runs (in this case due to CPU hotplug).
> > > 
> > > I believe that this does not qualify as a security vulnerability.
> > > Any hotplug is a privileged operation.
> > 
> > Really?  I see that happen on many embedded systems all the time, they
> > add/remove CPUs while the device runs/sleeps constantly.
> 
> This is a powerpc specific fix. Other architectures are not affected.
>  
> > Now to be fair, right now an "embedded system" usually doesn't have 2000
> > cpus, but what's wrong with marking this real bugfix as a vulnerability
> > resolution?
> 
> Yes, this is indeed a scalability fix for huge systems with a lot of
> CPUs anybody owning those systems was simply not able to use memory
> hotplug without seeing those hard lockup messages. The system is not
> really locked up. The progress of the hotplug operation is just utterly
> slow. Calling this a vulnerability is a stretch IMHO. 
> 
> The only potential attack vector is to have machine configured to panic
> on hard lockups on those huge ppc systems and allow cpu hotremove to an
> adversary which in itsels seems like a very bad idea anyway because
> availability of such a system is then effectively compromised.

Ok, now rejected, thanks.

greg k-h

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-10-01 13:54 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-27 12:40 CVE-2024-46839: workqueue: Improve scalability of workqueue watchdog touch Greg Kroah-Hartman
2024-10-01  8:02 ` Petr Mladek
2024-10-01  8:22   ` Greg Kroah-Hartman
2024-10-01  9:07     ` Michal Hocko
2024-10-01 11:37       ` Paul E. McKenney
2024-10-01 13:53       ` Greg Kroah-Hartman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.