[PATCH AUTOSEL 6.6 01/13] perf/core: Bail out early if the request AUX area is out of bound

stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH AUTOSEL 6.6 01/13] perf/core: Bail out early if the request AUX area is out of bound
@ 2023-11-06 23:14 Sasha Levin
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 02/13] rcu: Dump memory object info if callback function is invalid Sasha Levin
                   ` (11 more replies)
  0 siblings, 12 replies; 15+ messages in thread
From: Sasha Levin @ 2023-11-06 23:14 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Shuai Xue, Peter Zijlstra, Ingo Molnar, Sasha Levin, mingo, acme,
	linux-perf-users

From: Shuai Xue <xueshuai@linux.alibaba.com>

[ Upstream commit 54aee5f15b83437f23b2b2469bcf21bdd9823916 ]

When perf-record with a large AUX area, e.g 4GB, it fails with:

    #perf record -C 0 -m ,4G -e arm_spe_0// -- sleep 1
    failed to mmap with 12 (Cannot allocate memory)

and it reveals a WARNING with __alloc_pages():

	------------[ cut here ]------------
	WARNING: CPU: 44 PID: 17573 at mm/page_alloc.c:5568 __alloc_pages+0x1ec/0x248
	Call trace:
	 __alloc_pages+0x1ec/0x248
	 __kmalloc_large_node+0xc0/0x1f8
	 __kmalloc_node+0x134/0x1e8
	 rb_alloc_aux+0xe0/0x298
	 perf_mmap+0x440/0x660
	 mmap_region+0x308/0x8a8
	 do_mmap+0x3c0/0x528
	 vm_mmap_pgoff+0xf4/0x1b8
	 ksys_mmap_pgoff+0x18c/0x218
	 __arm64_sys_mmap+0x38/0x58
	 invoke_syscall+0x50/0x128
	 el0_svc_common.constprop.0+0x58/0x188
	 do_el0_svc+0x34/0x50
	 el0_svc+0x34/0x108
	 el0t_64_sync_handler+0xb8/0xc0
	 el0t_64_sync+0x1a4/0x1a8

'rb->aux_pages' allocated by kcalloc() is a pointer array which is used to
maintains AUX trace pages. The allocated page for this array is physically
contiguous (and virtually contiguous) with an order of 0..MAX_ORDER. If the
size of pointer array crosses the limitation set by MAX_ORDER, it reveals a
WARNING.

So bail out early with -ENOMEM if the request AUX area is out of bound,
e.g.:

    #perf record -C 0 -m ,4G -e arm_spe_0// -- sleep 1
    failed to mmap with 12 (Cannot allocate memory)

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 kernel/events/ring_buffer.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index fb1e180b5f0af..e8d82c2f07d0e 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -700,6 +700,12 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
 		watermark = 0;
 	}
 
+	/*
+	 * kcalloc_node() is unable to allocate buffer if the size is larger
+	 * than: PAGE_SIZE << MAX_ORDER; directly bail out in this case.
+	 */
+	if (get_order((unsigned long)nr_pages * sizeof(void *)) > MAX_ORDER)
+		return -ENOMEM;
 	rb->aux_pages = kcalloc_node(nr_pages, sizeof(void *), GFP_KERNEL,
 				     node);
 	if (!rb->aux_pages)
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH AUTOSEL 6.6 02/13] rcu: Dump memory object info if callback function is invalid
  2023-11-06 23:14 [PATCH AUTOSEL 6.6 01/13] perf/core: Bail out early if the request AUX area is out of bound Sasha Levin
@ 2023-11-06 23:14 ` Sasha Levin
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 03/13] srcu: Fix srcu_struct node grpmask overflow on 64-bit systems Sasha Levin
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Sasha Levin @ 2023-11-06 23:14 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Zhen Lei, Paul E . McKenney, Frederic Weisbecker, Sasha Levin,
	quic_neeraju, joel, josh, boqun.feng, jiangshanlai, rcu

From: Zhen Lei <thunder.leizhen@huawei.com>

[ Upstream commit 2cbc482d325ee58001472c4359b311958c4efdd1 ]

When a structure containing an RCU callback rhp is (incorrectly) freed
and reallocated after rhp is passed to call_rcu(), it is not unusual for
rhp->func to be set to NULL. This defeats the debugging prints used by
__call_rcu_common() in kernels built with CONFIG_DEBUG_OBJECTS_RCU_HEAD=y,
which expect to identify the offending code using the identity of this
function.

And in kernels build without CONFIG_DEBUG_OBJECTS_RCU_HEAD=y, things
are even worse, as can be seen from this splat:

Unable to handle kernel NULL pointer dereference at virtual address 0
... ...
PC is at 0x0
LR is at rcu_do_batch+0x1c0/0x3b8
... ...
 (rcu_do_batch) from (rcu_core+0x1d4/0x284)
 (rcu_core) from (__do_softirq+0x24c/0x344)
 (__do_softirq) from (__irq_exit_rcu+0x64/0x108)
 (__irq_exit_rcu) from (irq_exit+0x8/0x10)
 (irq_exit) from (__handle_domain_irq+0x74/0x9c)
 (__handle_domain_irq) from (gic_handle_irq+0x8c/0x98)
 (gic_handle_irq) from (__irq_svc+0x5c/0x94)
 (__irq_svc) from (arch_cpu_idle+0x20/0x3c)
 (arch_cpu_idle) from (default_idle_call+0x4c/0x78)
 (default_idle_call) from (do_idle+0xf8/0x150)
 (do_idle) from (cpu_startup_entry+0x18/0x20)
 (cpu_startup_entry) from (0xc01530)

This commit therefore adds calls to mem_dump_obj(rhp) to output some
information, for example:

  slab kmalloc-256 start ffff410c45019900 pointer offset 0 size 256

This provides the rough size of the memory block and the offset of the
rcu_head structure, which as least provides at least a few clues to help
locate the problem. If the problem is reproducible, additional slab
debugging can be enabled, for example, CONFIG_DEBUG_SLAB=y, which can
provide significantly more information.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 kernel/rcu/rcu.h      | 7 +++++++
 kernel/rcu/srcutiny.c | 1 +
 kernel/rcu/srcutree.c | 1 +
 kernel/rcu/tasks.h    | 1 +
 kernel/rcu/tiny.c     | 1 +
 kernel/rcu/tree.c     | 1 +
 6 files changed, 12 insertions(+)

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 98e13be411afd..d612731feea48 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -10,6 +10,7 @@
 #ifndef __LINUX_RCU_H
 #define __LINUX_RCU_H
 
+#include <linux/slab.h>
 #include <trace/events/rcu.h>
 
 /*
@@ -248,6 +249,12 @@ static inline void debug_rcu_head_unqueue(struct rcu_head *head)
 }
 #endif	/* #else !CONFIG_DEBUG_OBJECTS_RCU_HEAD */
 
+static inline void debug_rcu_head_callback(struct rcu_head *rhp)
+{
+	if (unlikely(!rhp->func))
+		kmem_dump_obj(rhp);
+}
+
 extern int rcu_cpu_stall_suppress_at_boot;
 
 static inline bool rcu_stall_is_suppressed_at_boot(void)
diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c
index 336af24e0fe35..c38e5933a5d69 100644
--- a/kernel/rcu/srcutiny.c
+++ b/kernel/rcu/srcutiny.c
@@ -138,6 +138,7 @@ void srcu_drive_gp(struct work_struct *wp)
 	while (lh) {
 		rhp = lh;
 		lh = lh->next;
+		debug_rcu_head_callback(rhp);
 		local_bh_disable();
 		rhp->func(rhp);
 		local_bh_enable();
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 20d7a238d675a..b3e9bc3b60f0b 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -1708,6 +1708,7 @@ static void srcu_invoke_callbacks(struct work_struct *work)
 	rhp = rcu_cblist_dequeue(&ready_cbs);
 	for (; rhp != NULL; rhp = rcu_cblist_dequeue(&ready_cbs)) {
 		debug_rcu_head_unqueue(rhp);
+		debug_rcu_head_callback(rhp);
 		local_bh_disable();
 		rhp->func(rhp);
 		local_bh_enable();
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 8d65f7d576a34..7c845532a50ac 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -538,6 +538,7 @@ static void rcu_tasks_invoke_cbs(struct rcu_tasks *rtp, struct rcu_tasks_percpu
 	raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags);
 	len = rcl.len;
 	for (rhp = rcu_cblist_dequeue(&rcl); rhp; rhp = rcu_cblist_dequeue(&rcl)) {
+		debug_rcu_head_callback(rhp);
 		local_bh_disable();
 		rhp->func(rhp);
 		local_bh_enable();
diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
index 42f7589e51e09..fec804b790803 100644
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -97,6 +97,7 @@ static inline bool rcu_reclaim_tiny(struct rcu_head *head)
 
 	trace_rcu_invoke_callback("", head);
 	f = head->func;
+	debug_rcu_head_callback(head);
 	WRITE_ONCE(head->func, (rcu_callback_t)0L);
 	f(head);
 	rcu_lock_release(&rcu_callback_map);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index cb1caefa8bd07..29e4acbafefda 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2135,6 +2135,7 @@ static void rcu_do_batch(struct rcu_data *rdp)
 		trace_rcu_invoke_callback(rcu_state.name, rhp);
 
 		f = rhp->func;
+		debug_rcu_head_callback(rhp);
 		WRITE_ONCE(rhp->func, (rcu_callback_t)0L);
 		f(rhp);
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH AUTOSEL 6.6 03/13] srcu: Fix srcu_struct node grpmask overflow on 64-bit systems
  2023-11-06 23:14 [PATCH AUTOSEL 6.6 01/13] perf/core: Bail out early if the request AUX area is out of bound Sasha Levin
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 02/13] rcu: Dump memory object info if callback function is invalid Sasha Levin
@ 2023-11-06 23:14 ` Sasha Levin
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 04/13] selftests/lkdtm: Disable CONFIG_UBSAN_TRAP in test config Sasha Levin
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Sasha Levin @ 2023-11-06 23:14 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Denis Arefev, Mathieu Desnoyers, Joel Fernandes, David Laight,
	Paul E . McKenney, Frederic Weisbecker, Sasha Levin, jiangshanlai,
	josh, rcu

From: Denis Arefev <arefev@swemel.ru>

[ Upstream commit d8d5b7bf6f2105883bbd91bbd4d5b67e4e3dff71 ]

The value of a bitwise expression 1 << (cpu - sdp->mynode->grplo)
is subject to overflow due to a failure to cast operands to a larger
data type before performing the bitwise operation.

The maximum result of this subtraction is defined by the RCU_FANOUT_LEAF
Kconfig option, which on 64-bit systems defaults to 16 (resulting in a
maximum shift of 15), but which can be set up as high as 64 (resulting
in a maximum shift of 63).  A value of 31 can result in sign extension,
resulting in 0xffffffff80000000 instead of the desired 0x80000000.
A value of 32 or greater triggers undefined behavior per the C standard.

This bug has not been known to cause issues because almost all kernels
take the default CONFIG_RCU_FANOUT_LEAF=16.  Furthermore, as long as a
given compiler gives a deterministic non-zero result for 1<<N for N>=32,
the code correctly invokes all SRCU callbacks, albeit wasting CPU time
along the way.

This commit therefore substitutes the correct 1UL for the buggy 1.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Signed-off-by: Denis Arefev <arefev@swemel.ru>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: David Laight <David.Laight@aculab.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 kernel/rcu/srcutree.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index b3e9bc3b60f0b..a1fcb8566b2e3 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -223,7 +223,7 @@ static bool init_srcu_struct_nodes(struct srcu_struct *ssp, gfp_t gfp_flags)
 				snp->grplo = cpu;
 			snp->grphi = cpu;
 		}
-		sdp->grpmask = 1 << (cpu - sdp->mynode->grplo);
+		sdp->grpmask = 1UL << (cpu - sdp->mynode->grplo);
 	}
 	smp_store_release(&ssp->srcu_sup->srcu_size_state, SRCU_SIZE_WAIT_BARRIER);
 	return true;
@@ -833,7 +833,7 @@ static void srcu_schedule_cbs_snp(struct srcu_struct *ssp, struct srcu_node *snp
 	int cpu;

 	for (cpu = snp->grplo; cpu <= snp->grphi; cpu++) {
-		if (!(mask & (1 << (cpu - snp->grplo))))
+		if (!(mask & (1UL << (cpu - snp->grplo))))
 			continue;
 		srcu_schedule_cbs_sdp(per_cpu_ptr(ssp->sda, cpu), delay);
 	}
-- 
2.42.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH AUTOSEL 6.6 04/13] selftests/lkdtm: Disable CONFIG_UBSAN_TRAP in test config
  2023-11-06 23:14 [PATCH AUTOSEL 6.6 01/13] perf/core: Bail out early if the request AUX area is out of bound Sasha Levin
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 02/13] rcu: Dump memory object info if callback function is invalid Sasha Levin
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 03/13] srcu: Fix srcu_struct node grpmask overflow on 64-bit systems Sasha Levin
@ 2023-11-06 23:14 ` Sasha Levin
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 05/13] binfmt_elf: Support segments with 0 filesz and misaligned starts Sasha Levin
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Sasha Levin @ 2023-11-06 23:14 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Ricardo Cañuelo, Kees Cook, Sasha Levin, shuah,
	linux-kselftest

From: Ricardo Cañuelo <ricardo.canuelo@collabora.com>

[ Upstream commit cf77bf698887c3b9ebed76dea492b07a3c2c7632 ]

The lkdtm selftest config fragment enables CONFIG_UBSAN_TRAP to make the
ARRAY_BOUNDS test kill the calling process when an out-of-bound access
is detected by UBSAN. However, after this [1] commit, UBSAN is triggered
under many new scenarios that weren't detected before, such as in struct
definitions with fixed-size trailing arrays used as flexible arrays. As
a result, CONFIG_UBSAN_TRAP=y has become a very aggressive option to
enable except for specific situations.

`make kselftest-merge` applies CONFIG_UBSAN_TRAP=y to the kernel config
for all selftests, which makes many of them fail because of system hangs
during boot.

This change removes the config option from the lkdtm kselftest and
configures the ARRAY_BOUNDS test to look for UBSAN reports rather than
relying on the calling process being killed.

[1] commit 2d47c6956ab3 ("ubsan: Tighten UBSAN_BOUNDS on GCC")'

Signed-off-by: Ricardo Cañuelo <ricardo.canuelo@collabora.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230802063252.1917997-1-ricardo.canuelo@collabora.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 tools/testing/selftests/lkdtm/config    | 1 -
 tools/testing/selftests/lkdtm/tests.txt | 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/tools/testing/selftests/lkdtm/config b/tools/testing/selftests/lkdtm/config
index 5d52f64dfb430..7afe05e8c4d79 100644
--- a/tools/testing/selftests/lkdtm/config
+++ b/tools/testing/selftests/lkdtm/config
@@ -9,7 +9,6 @@ CONFIG_INIT_ON_FREE_DEFAULT_ON=y
 CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y
 CONFIG_UBSAN=y
 CONFIG_UBSAN_BOUNDS=y
-CONFIG_UBSAN_TRAP=y
 CONFIG_STACKPROTECTOR_STRONG=y
 CONFIG_SLUB_DEBUG=y
 CONFIG_SLUB_DEBUG_ON=y
diff --git a/tools/testing/selftests/lkdtm/tests.txt b/tools/testing/selftests/lkdtm/tests.txt
index 607b8d7e3ea34..2f3a1b96da6e3 100644
--- a/tools/testing/selftests/lkdtm/tests.txt
+++ b/tools/testing/selftests/lkdtm/tests.txt
@@ -7,7 +7,7 @@ EXCEPTION
 #EXHAUST_STACK Corrupts memory on failure
 #CORRUPT_STACK Crashes entire system on success
 #CORRUPT_STACK_STRONG Crashes entire system on success
-ARRAY_BOUNDS
+ARRAY_BOUNDS call trace:|UBSAN: array-index-out-of-bounds
 CORRUPT_LIST_ADD list_add corruption
 CORRUPT_LIST_DEL list_del corruption
 STACK_GUARD_PAGE_LEADING
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH AUTOSEL 6.6 05/13] binfmt_elf: Support segments with 0 filesz and misaligned starts
  2023-11-06 23:14 [PATCH AUTOSEL 6.6 01/13] perf/core: Bail out early if the request AUX area is out of bound Sasha Levin
                   ` (2 preceding siblings ...)
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 04/13] selftests/lkdtm: Disable CONFIG_UBSAN_TRAP in test config Sasha Levin
@ 2023-11-06 23:14 ` Sasha Levin
  2023-11-07  0:04   ` Kees Cook
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 06/13] clocksource/drivers/timer-imx-gpt: Fix potential memory leak Sasha Levin
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 15+ messages in thread
From: Sasha Levin @ 2023-11-06 23:14 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Eric W. Biederman, Sebastian Ott, Thomas Weißschuh,
	Pedro Falcato, Kees Cook, Sasha Levin, viro, brauner,
	linux-fsdevel, linux-mm

From: "Eric W. Biederman" <ebiederm@xmission.com>

[ Upstream commit 585a018627b4d7ed37387211f667916840b5c5ea ]

Implement a helper elf_load() that wraps elf_map() and performs all
of the necessary work to ensure that when "memsz > filesz" the bytes
described by "memsz > filesz" are zeroed.

An outstanding issue is if the first segment has filesz 0, and has a
randomized location. But that is the same as today.

In this change I replaced an open coded padzero() that did not clear
all of the way to the end of the page, with padzero() that does.

I also stopped checking the return of padzero() as there is at least
one known case where testing for failure is the wrong thing to do.
It looks like binfmt_elf_fdpic may have the proper set of tests
for when error handling can be safely completed.

I found a couple of commits in the old history
https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git,
that look very interesting in understanding this code.

commit 39b56d902bf3 ("[PATCH] binfmt_elf: clearing bss may fail")
commit c6e2227e4a3e ("[SPARC64]: Missing user access return value checks in fs/binfmt_elf.c and fs/compat.c")
commit 5bf3be033f50 ("v2.4.10.1 -> v2.4.10.2")

Looking at commit 39b56d902bf3 ("[PATCH] binfmt_elf: clearing bss may fail"):
>  commit 39b56d902bf35241e7cba6cc30b828ed937175ad
>  Author: Pavel Machek <pavel@ucw.cz>
>  Date:   Wed Feb 9 22:40:30 2005 -0800
>
>     [PATCH] binfmt_elf: clearing bss may fail
>
>     So we discover that Borland's Kylix application builder emits weird elf
>     files which describe a non-writeable bss segment.
>
>     So remove the clear_user() check at the place where we zero out the bss.  I
>     don't _think_ there are any security implications here (plus we've never
>     checked that clear_user() return value, so whoops if it is a problem).
>
>     Signed-off-by: Pavel Machek <pavel@suse.cz>
>     Signed-off-by: Andrew Morton <akpm@osdl.org>
>     Signed-off-by: Linus Torvalds <torvalds@osdl.org>

It seems pretty clear that binfmt_elf_fdpic with skipping clear_user() for
non-writable segments and otherwise calling clear_user(), aka padzero(),
and checking it's return code is the right thing to do.

I just skipped the error checking as that avoids breaking things.

And notably, it looks like Borland's Kylix died in 2005 so it might be
safe to just consider read-only segments with memsz > filesz an error.

Reported-by: Sebastian Ott <sebott@redhat.com>
Reported-by: Thomas Weißschuh <linux@weissschuh.net>
Closes: https://lkml.kernel.org/r/20230914-bss-alloc-v1-1-78de67d2c6dd@weissschuh.net
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://lore.kernel.org/r/87sf71f123.fsf@email.froward.int.ebiederm.org
Tested-by: Pedro Falcato <pedro.falcato@gmail.com>
Signed-off-by: Sebastian Ott <sebott@redhat.com>
Link: https://lore.kernel.org/r/20230929032435.2391507-1-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/binfmt_elf.c | 111 +++++++++++++++++++++---------------------------
 1 file changed, 48 insertions(+), 63 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 7b3d2d4914073..2a615f476e44e 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -110,25 +110,6 @@ static struct linux_binfmt elf_format = {
 
 #define BAD_ADDR(x) (unlikely((unsigned long)(x) >= TASK_SIZE))
 
-static int set_brk(unsigned long start, unsigned long end, int prot)
-{
-	start = ELF_PAGEALIGN(start);
-	end = ELF_PAGEALIGN(end);
-	if (end > start) {
-		/*
-		 * Map the last of the bss segment.
-		 * If the header is requesting these pages to be
-		 * executable, honour that (ppc32 needs this).
-		 */
-		int error = vm_brk_flags(start, end - start,
-				prot & PROT_EXEC ? VM_EXEC : 0);
-		if (error)
-			return error;
-	}
-	current->mm->start_brk = current->mm->brk = end;
-	return 0;
-}
-
 /* We need to explicitly zero any fractional pages
    after the data section (i.e. bss).  This would
    contain the junk from the file that should not
@@ -406,6 +387,51 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
 	return(map_addr);
 }
 
+static unsigned long elf_load(struct file *filep, unsigned long addr,
+		const struct elf_phdr *eppnt, int prot, int type,
+		unsigned long total_size)
+{
+	unsigned long zero_start, zero_end;
+	unsigned long map_addr;
+
+	if (eppnt->p_filesz) {
+		map_addr = elf_map(filep, addr, eppnt, prot, type, total_size);
+		if (BAD_ADDR(map_addr))
+			return map_addr;
+		if (eppnt->p_memsz > eppnt->p_filesz) {
+			zero_start = map_addr + ELF_PAGEOFFSET(eppnt->p_vaddr) +
+				eppnt->p_filesz;
+			zero_end = map_addr + ELF_PAGEOFFSET(eppnt->p_vaddr) +
+				eppnt->p_memsz;
+
+			/* Zero the end of the last mapped page */
+			padzero(zero_start);
+		}
+	} else {
+		map_addr = zero_start = ELF_PAGESTART(addr);
+		zero_end = zero_start + ELF_PAGEOFFSET(eppnt->p_vaddr) +
+			eppnt->p_memsz;
+	}
+	if (eppnt->p_memsz > eppnt->p_filesz) {
+		/*
+		 * Map the last of the segment.
+		 * If the header is requesting these pages to be
+		 * executable, honour that (ppc32 needs this).
+		 */
+		int error;
+
+		zero_start = ELF_PAGEALIGN(zero_start);
+		zero_end = ELF_PAGEALIGN(zero_end);
+
+		error = vm_brk_flags(zero_start, zero_end - zero_start,
+				     prot & PROT_EXEC ? VM_EXEC : 0);
+		if (error)
+			map_addr = error;
+	}
+	return map_addr;
+}
+
+
 static unsigned long total_mapping_size(const struct elf_phdr *phdr, int nr)
 {
 	elf_addr_t min_addr = -1;
@@ -829,7 +855,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
 	struct elf_phdr *elf_property_phdata = NULL;
 	unsigned long elf_bss, elf_brk;
-	int bss_prot = 0;
 	int retval, i;
 	unsigned long elf_entry;
 	unsigned long e_entry;
@@ -1040,33 +1065,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
 		if (elf_ppnt->p_type != PT_LOAD)
 			continue;
 
-		if (unlikely (elf_brk > elf_bss)) {
-			unsigned long nbyte;
-
-			/* There was a PT_LOAD segment with p_memsz > p_filesz
-			   before this one. Map anonymous pages, if needed,
-			   and clear the area.  */
-			retval = set_brk(elf_bss + load_bias,
-					 elf_brk + load_bias,
-					 bss_prot);
-			if (retval)
-				goto out_free_dentry;
-			nbyte = ELF_PAGEOFFSET(elf_bss);
-			if (nbyte) {
-				nbyte = ELF_MIN_ALIGN - nbyte;
-				if (nbyte > elf_brk - elf_bss)
-					nbyte = elf_brk - elf_bss;
-				if (clear_user((void __user *)elf_bss +
-							load_bias, nbyte)) {
-					/*
-					 * This bss-zeroing can fail if the ELF
-					 * file specifies odd protections. So
-					 * we don't check the return value
-					 */
-				}
-			}
-		}
-
 		elf_prot = make_prot(elf_ppnt->p_flags, &arch_state,
 				     !!interpreter, false);
 
@@ -1162,7 +1160,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 			}
 		}
 
-		error = elf_map(bprm->file, load_bias + vaddr, elf_ppnt,
+		error = elf_load(bprm->file, load_bias + vaddr, elf_ppnt,
 				elf_prot, elf_flags, total_size);
 		if (BAD_ADDR(error)) {
 			retval = IS_ERR_VALUE(error) ?
@@ -1217,10 +1215,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
 		if (end_data < k)
 			end_data = k;
 		k = elf_ppnt->p_vaddr + elf_ppnt->p_memsz;
-		if (k > elf_brk) {
-			bss_prot = elf_prot;
+		if (k > elf_brk)
 			elf_brk = k;
-		}
 	}
 
 	e_entry = elf_ex->e_entry + load_bias;
@@ -1232,18 +1228,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	start_data += load_bias;
 	end_data += load_bias;
 
-	/* Calling set_brk effectively mmaps the pages that we need
-	 * for the bss and break sections.  We must do this before
-	 * mapping in the interpreter, to make sure it doesn't wind
-	 * up getting placed where the bss needs to go.
-	 */
-	retval = set_brk(elf_bss, elf_brk, bss_prot);
-	if (retval)
-		goto out_free_dentry;
-	if (likely(elf_bss != elf_brk) && unlikely(padzero(elf_bss))) {
-		retval = -EFAULT; /* Nobody gets to see this, but.. */
-		goto out_free_dentry;
-	}
+	current->mm->start_brk = current->mm->brk = ELF_PAGEALIGN(elf_brk);
 
 	if (interpreter) {
 		elf_entry = load_elf_interp(interp_elf_ex,
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH AUTOSEL 6.6 06/13] clocksource/drivers/timer-imx-gpt: Fix potential memory leak
  2023-11-06 23:14 [PATCH AUTOSEL 6.6 01/13] perf/core: Bail out early if the request AUX area is out of bound Sasha Levin
                   ` (3 preceding siblings ...)
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 05/13] binfmt_elf: Support segments with 0 filesz and misaligned starts Sasha Levin
@ 2023-11-06 23:14 ` Sasha Levin
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 07/13] binfmt_misc: cleanup on filesystem umount Sasha Levin
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Sasha Levin @ 2023-11-06 23:14 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Jacky Bai, Peng Fan, Daniel Lezcano, Sasha Levin, tglx, shawnguo,
	linux-arm-kernel

From: Jacky Bai <ping.bai@nxp.com>

[ Upstream commit 8051a993ce222a5158bccc6ac22ace9253dd71cb ]

Fix coverity Issue CID 250382:  Resource leak (RESOURCE_LEAK).
Add kfree when error return.

Signed-off-by: Jacky Bai <ping.bai@nxp.com>
Reviewed-by: Peng Fan <peng.fan@nxp.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://lore.kernel.org/r/20231009083922.1942971-1-ping.bai@nxp.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/clocksource/timer-imx-gpt.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/drivers/clocksource/timer-imx-gpt.c b/drivers/clocksource/timer-imx-gpt.c
index 28ab4f1a7c713..6a878d227a13b 100644
--- a/drivers/clocksource/timer-imx-gpt.c
+++ b/drivers/clocksource/timer-imx-gpt.c
@@ -434,12 +434,16 @@ static int __init mxc_timer_init_dt(struct device_node *np,  enum imx_gpt_type t
 		return -ENOMEM;
 
 	imxtm->base = of_iomap(np, 0);
-	if (!imxtm->base)
-		return -ENXIO;
+	if (!imxtm->base) {
+		ret = -ENXIO;
+		goto err_kfree;
+	}
 
 	imxtm->irq = irq_of_parse_and_map(np, 0);
-	if (imxtm->irq <= 0)
-		return -EINVAL;
+	if (imxtm->irq <= 0) {
+		ret = -EINVAL;
+		goto err_kfree;
+	}
 
 	imxtm->clk_ipg = of_clk_get_by_name(np, "ipg");
 
@@ -452,11 +456,15 @@ static int __init mxc_timer_init_dt(struct device_node *np,  enum imx_gpt_type t
 
 	ret = _mxc_timer_init(imxtm);
 	if (ret)
-		return ret;
+		goto err_kfree;
 
 	initialized = 1;
 
 	return 0;
+
+err_kfree:
+	kfree(imxtm);
+	return ret;
 }
 
 static int __init imx1_timer_init_dt(struct device_node *np)
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH AUTOSEL 6.6 07/13] binfmt_misc: cleanup on filesystem umount
  2023-11-06 23:14 [PATCH AUTOSEL 6.6 01/13] perf/core: Bail out early if the request AUX area is out of bound Sasha Levin
                   ` (4 preceding siblings ...)
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 06/13] clocksource/drivers/timer-imx-gpt: Fix potential memory leak Sasha Levin
@ 2023-11-06 23:14 ` Sasha Levin
  2023-11-07  0:04   ` Kees Cook
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 08/13] clocksource/drivers/timer-atmel-tcb: Fix initialization on SAM9 hardware Sasha Levin
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 15+ messages in thread
From: Sasha Levin @ 2023-11-06 23:14 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Christian Brauner, Sargun Dhillon, Serge Hallyn, Jann Horn,
	Henning Schild, Andrei Vagin, Al Viro, Laurent Vivier,
	linux-fsdevel, Christian Brauner, Kees Cook, Sasha Levin,
	linux-mm

From: Christian Brauner <christian.brauner@ubuntu.com>

[ Upstream commit 1c5976ef0f7ad76319df748ccb99a4c7ba2ba464 ]

Currently, registering a new binary type pins the binfmt_misc
filesystem. Specifically, this means that as long as there is at least
one binary type registered the binfmt_misc filesystem survives all
umounts, i.e. the superblock is not destroyed. Meaning that a umount
followed by another mount will end up with the same superblock and the
same binary type handlers. This is a behavior we tend to discourage for
any new filesystems (apart from a few special filesystems such as e.g.
configfs or debugfs). A umount operation without the filesystem being
pinned - by e.g. someone holding a file descriptor to an open file -
should usually result in the destruction of the superblock and all
associated resources. This makes introspection easier and leads to
clearly defined, simple and clean semantics. An administrator can rely
on the fact that a umount will guarantee a clean slate making it
possible to reinitialize a filesystem. Right now all binary types would
need to be explicitly deleted before that can happen.

This allows us to remove the heavy-handed calls to simple_pin_fs() and
simple_release_fs() when creating and deleting binary types. This in
turn allows us to replace the current brittle pinning mechanism abusing
dget() which has caused a range of bugs judging from prior fixes in [2]
and [3]. The additional dget() in load_misc_binary() pins the dentry but
only does so for the sake to prevent ->evict_inode() from freeing the
node when a user removes the binary type and kill_node() is run. Which
would mean ->interpreter and ->interp_file would be freed causing a UAF.

This isn't really nicely documented nor is it very clean because it
relies on simple_pin_fs() pinning the filesystem as long as at least one
binary type exists. Otherwise it would cause load_misc_binary() to hold
on to a dentry belonging to a superblock that has been shutdown.
Replace that implicit pinning with a clean and simple per-node refcount
and get rid of the ugly dget() pinning. A similar mechanism exists for
e.g. binderfs (cf. [4]). All the cleanup work can now be done in
->evict_inode().

In a follow-up patch we will make it possible to use binfmt_misc in
sandboxes. We will use the cleaner semantics where a umount for the
filesystem will cause the superblock and all resources to be
deallocated. In preparation for this apply the same semantics to the
initial binfmt_misc mount. Note, that this is a user-visible change and
as such a uapi change but one that we can reasonably risk. We've
discussed this in earlier versions of this patchset (cf. [1]).

The main user and provider of binfmt_misc is systemd. Systemd provides
binfmt_misc via autofs since it is configurable as a kernel module and
is used by a few exotic packages and users. As such a binfmt_misc mount
is triggered when /proc/sys/fs/binfmt_misc is accessed and is only
provided on demand. Other autofs on demand filesystems include EFI ESP
which systemd umounts if the mountpoint stays idle for a certain amount
of time. This doesn't apply to the binfmt_misc autofs mount which isn't
touched once it is mounted meaning this change can't accidently wipe
binary type handlers without someone having explicitly unmounted
binfmt_misc. After speaking to systemd folks they don't expect this
change to affect them.

In line with our general policy, if we see a regression for systemd or
other users with this change we will switch back to the old behavior for
the initial binfmt_misc mount and have binary types pin the filesystem
again. But while we touch this code let's take the chance and let's
improve on the status quo.

[1]: https://lore.kernel.org/r/20191216091220.465626-2-laurent@vivier.eu
[2]: commit 43a4f2619038 ("exec: binfmt_misc: fix race between load_misc_binary() and kill_node()"
[3]: commit 83f918274e4b ("exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode()")
[4]: commit f0fe2c0f050d ("binder: prevent UAF for binderfs devices II")

Link: https://lore.kernel.org/r/20211028103114.2849140-1-brauner@kernel.org (v1)
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Jann Horn <jannh@google.com>
Cc: Henning Schild <henning.schild@siemens.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Laurent Vivier <laurent@vivier.eu>
Cc: linux-fsdevel@vger.kernel.org
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
---
/* v2 */
- Christian Brauner <christian.brauner@ubuntu.com>:
  - Add more comments that explain what's going on.
  - Rename functions while changing them to better reflect what they are
    doing to make the code easier to understand.
  - In the first version when a specific binary type handler was removed
    either through a write to the entry's file or all binary type
    handlers were removed by a write to the binfmt_misc mount's status
    file all cleanup work happened during inode eviction.
    That includes removal of the relevant entries from entry list. While
    that works fine I disliked that model after thinking about it for a
    bit. Because it means that there was a window were someone has
    already removed a or all binary handlers but they could still be
    safely reached from load_misc_binary() when it has managed to take
    the read_lock() on the entries list while inode eviction was already
    happening. Again, that perfectly benign but it's cleaner to remove
    the binary handler from the list immediately meaning that ones the
    write to then entry's file or the binfmt_misc status file returns
    the binary type cannot be executed anymore. That gives stronger
    guarantees to the user.
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/binfmt_misc.c | 216 ++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 168 insertions(+), 48 deletions(-)

diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index e0108d17b085c..cf5ed5cd4102d 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -60,12 +60,11 @@ typedef struct {
 	char *name;
 	struct dentry *dentry;
 	struct file *interp_file;
+	refcount_t users;		/* sync removal with load_misc_binary() */
 } Node;
 
 static DEFINE_RWLOCK(entries_lock);
 static struct file_system_type bm_fs_type;
-static struct vfsmount *bm_mnt;
-static int entry_count;
 
 /*
  * Max length of the register string.  Determined by:
@@ -82,19 +81,23 @@ static int entry_count;
  */
 #define MAX_REGISTER_LENGTH 1920
 
-/*
- * Check if we support the binfmt
- * if we do, return the node, else NULL
- * locking is done in load_misc_binary
+/**
+ * search_binfmt_handler - search for a binary handler for @bprm
+ * @misc: handle to binfmt_misc instance
+ * @bprm: binary for which we are looking for a handler
+ *
+ * Search for a binary type handler for @bprm in the list of registered binary
+ * type handlers.
+ *
+ * Return: binary type list entry on success, NULL on failure
  */
-static Node *check_file(struct linux_binprm *bprm)
+static Node *search_binfmt_handler(struct linux_binprm *bprm)
 {
 	char *p = strrchr(bprm->interp, '.');
-	struct list_head *l;
+	Node *e;
 
 	/* Walk all the registered handlers. */
-	list_for_each(l, &entries) {
-		Node *e = list_entry(l, Node, list);
+	list_for_each_entry(e, &entries, list) {
 		char *s;
 		int j;
 
@@ -123,9 +126,49 @@ static Node *check_file(struct linux_binprm *bprm)
 		if (j == e->size)
 			return e;
 	}
+
 	return NULL;
 }
 
+/**
+ * get_binfmt_handler - try to find a binary type handler
+ * @misc: handle to binfmt_misc instance
+ * @bprm: binary for which we are looking for a handler
+ *
+ * Try to find a binfmt handler for the binary type. If one is found take a
+ * reference to protect against removal via bm_{entry,status}_write().
+ *
+ * Return: binary type list entry on success, NULL on failure
+ */
+static Node *get_binfmt_handler(struct linux_binprm *bprm)
+{
+	Node *e;
+
+	read_lock(&entries_lock);
+	e = search_binfmt_handler(bprm);
+	if (e)
+		refcount_inc(&e->users);
+	read_unlock(&entries_lock);
+	return e;
+}
+
+/**
+ * put_binfmt_handler - put binary handler node
+ * @e: node to put
+ *
+ * Free node syncing with load_misc_binary() and defer final free to
+ * load_misc_binary() in case it is using the binary type handler we were
+ * requested to remove.
+ */
+static void put_binfmt_handler(Node *e)
+{
+	if (refcount_dec_and_test(&e->users)) {
+		if (e->flags & MISC_FMT_OPEN_FILE)
+			filp_close(e->interp_file, NULL);
+		kfree(e);
+	}
+}
+
 /*
  * the loader itself
  */
@@ -139,12 +182,7 @@ static int load_misc_binary(struct linux_binprm *bprm)
 	if (!enabled)
 		return retval;
 
-	/* to keep locking time low, we copy the interpreter string */
-	read_lock(&entries_lock);
-	fmt = check_file(bprm);
-	if (fmt)
-		dget(fmt->dentry);
-	read_unlock(&entries_lock);
+	fmt = get_binfmt_handler(bprm);
 	if (!fmt)
 		return retval;
 
@@ -198,7 +236,16 @@ static int load_misc_binary(struct linux_binprm *bprm)
 
 	retval = 0;
 ret:
-	dput(fmt->dentry);
+
+	/*
+	 * If we actually put the node here all concurrent calls to
+	 * load_misc_binary() will have finished. We also know
+	 * that for the refcount to be zero ->evict_inode() must have removed
+	 * the node to be deleted from the list. All that is left for us is to
+	 * close and free.
+	 */
+	put_binfmt_handler(fmt);
+
 	return retval;
 }
 
@@ -552,30 +599,90 @@ static struct inode *bm_get_inode(struct super_block *sb, int mode)
 	return inode;
 }
 
+/**
+ * bm_evict_inode - cleanup data associated with @inode
+ * @inode: inode to which the data is attached
+ *
+ * Cleanup the binary type handler data associated with @inode if a binary type
+ * entry is removed or the filesystem is unmounted and the super block is
+ * shutdown.
+ *
+ * If the ->evict call was not caused by a super block shutdown but by a write
+ * to remove the entry or all entries via bm_{entry,status}_write() the entry
+ * will have already been removed from the list. We keep the list_empty() check
+ * to make that explicit.
+*/
 static void bm_evict_inode(struct inode *inode)
 {
 	Node *e = inode->i_private;
 
-	if (e && e->flags & MISC_FMT_OPEN_FILE)
-		filp_close(e->interp_file, NULL);
-
 	clear_inode(inode);
-	kfree(e);
+
+	if (e) {
+		write_lock(&entries_lock);
+		if (!list_empty(&e->list))
+			list_del_init(&e->list);
+		write_unlock(&entries_lock);
+		put_binfmt_handler(e);
+	}
 }
 
-static void kill_node(Node *e)
+/**
+ * unlink_binfmt_dentry - remove the dentry for the binary type handler
+ * @dentry: dentry associated with the binary type handler
+ *
+ * Do the actual filesystem work to remove a dentry for a registered binary
+ * type handler. Since binfmt_misc only allows simple files to be created
+ * directly under the root dentry of the filesystem we ensure that we are
+ * indeed passed a dentry directly beneath the root dentry, that the inode
+ * associated with the root dentry is locked, and that it is a regular file we
+ * are asked to remove.
+ */
+static void unlink_binfmt_dentry(struct dentry *dentry)
 {
-	struct dentry *dentry;
+	struct dentry *parent = dentry->d_parent;
+	struct inode *inode, *parent_inode;
+
+	/* All entries are immediate descendants of the root dentry. */
+	if (WARN_ON_ONCE(dentry->d_sb->s_root != parent))
+		return;
 
+	/* We only expect to be called on regular files. */
+	inode = d_inode(dentry);
+	if (WARN_ON_ONCE(!S_ISREG(inode->i_mode)))
+		return;
+
+	/* The parent inode must be locked. */
+	parent_inode = d_inode(parent);
+	if (WARN_ON_ONCE(!inode_is_locked(parent_inode)))
+		return;
+
+	if (simple_positive(dentry)) {
+		dget(dentry);
+		simple_unlink(parent_inode, dentry);
+		d_delete(dentry);
+		dput(dentry);
+	}
+}
+
+/**
+ * remove_binfmt_handler - remove a binary type handler
+ * @misc: handle to binfmt_misc instance
+ * @e: binary type handler to remove
+ *
+ * Remove a binary type handler from the list of binary type handlers and
+ * remove its associated dentry. This is called from
+ * binfmt_{entry,status}_write(). In the future, we might want to think about
+ * adding a proper ->unlink() method to binfmt_misc instead of forcing caller's
+ * to use writes to files in order to delete binary type handlers. But it has
+ * worked for so long that it's not a pressing issue.
+ */
+static void remove_binfmt_handler(Node *e)
+{
 	write_lock(&entries_lock);
 	list_del_init(&e->list);
 	write_unlock(&entries_lock);
-
-	dentry = e->dentry;
-	drop_nlink(d_inode(dentry));
-	d_drop(dentry);
-	dput(dentry);
-	simple_release_fs(&bm_mnt, &entry_count);
+	unlink_binfmt_dentry(e->dentry);
 }
 
 /* /<entry> */
@@ -602,8 +709,8 @@ bm_entry_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
 static ssize_t bm_entry_write(struct file *file, const char __user *buffer,
 				size_t count, loff_t *ppos)
 {
-	struct dentry *root;
-	Node *e = file_inode(file)->i_private;
+	struct inode *inode = file_inode(file);
+	Node *e = inode->i_private;
 	int res = parse_command(buffer, count);
 
 	switch (res) {
@@ -617,13 +724,22 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer,
 		break;
 	case 3:
 		/* Delete this handler. */
-		root = file_inode(file)->i_sb->s_root;
-		inode_lock(d_inode(root));
+		inode = d_inode(inode->i_sb->s_root);
+		inode_lock(inode);
 
+		/*
+		 * In order to add new element or remove elements from the list
+		 * via bm_{entry,register,status}_write() inode_lock() on the
+		 * root inode must be held.
+		 * The lock is exclusive ensuring that the list can't be
+		 * modified. Only load_misc_binary() can access but does so
+		 * read-only. So we only need to take the write lock when we
+		 * actually remove the entry from the list.
+		 */
 		if (!list_empty(&e->list))
-			kill_node(e);
+			remove_binfmt_handler(e);
 
-		inode_unlock(d_inode(root));
+		inode_unlock(inode);
 		break;
 	default:
 		return res;
@@ -682,13 +798,7 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
 	if (!inode)
 		goto out2;
 
-	err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count);
-	if (err) {
-		iput(inode);
-		inode = NULL;
-		goto out2;
-	}
-
+	refcount_set(&e->users, 1);
 	e->dentry = dget(dentry);
 	inode->i_private = e;
 	inode->i_fop = &bm_entry_operations;
@@ -732,7 +842,8 @@ static ssize_t bm_status_write(struct file *file, const char __user *buffer,
 		size_t count, loff_t *ppos)
 {
 	int res = parse_command(buffer, count);
-	struct dentry *root;
+	Node *e, *next;
+	struct inode *inode;
 
 	switch (res) {
 	case 1:
@@ -745,13 +856,22 @@ static ssize_t bm_status_write(struct file *file, const char __user *buffer,
 		break;
 	case 3:
 		/* Delete all handlers. */
-		root = file_inode(file)->i_sb->s_root;
-		inode_lock(d_inode(root));
+		inode = d_inode(file_inode(file)->i_sb->s_root);
+		inode_lock(inode);
 
-		while (!list_empty(&entries))
-			kill_node(list_first_entry(&entries, Node, list));
+		/*
+		 * In order to add new element or remove elements from the list
+		 * via bm_{entry,register,status}_write() inode_lock() on the
+		 * root inode must be held.
+		 * The lock is exclusive ensuring that the list can't be
+		 * modified. Only load_misc_binary() can access but does so
+		 * read-only. So we only need to take the write lock when we
+		 * actually remove the entry from the list.
+		 */
+		list_for_each_entry_safe(e, next, &entries, list)
+			remove_binfmt_handler(e);
 
-		inode_unlock(d_inode(root));
+		inode_unlock(inode);
 		break;
 	default:
 		return res;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH AUTOSEL 6.6 08/13] clocksource/drivers/timer-atmel-tcb: Fix initialization on SAM9 hardware
  2023-11-06 23:14 [PATCH AUTOSEL 6.6 01/13] perf/core: Bail out early if the request AUX area is out of bound Sasha Levin
                   ` (5 preceding siblings ...)
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 07/13] binfmt_misc: cleanup on filesystem umount Sasha Levin
@ 2023-11-06 23:14 ` Sasha Levin
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 09/13] srcu: Only accelerate on enqueue time Sasha Levin
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Sasha Levin @ 2023-11-06 23:14 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Ronald Wahl, Alexandre Belloni, Daniel Lezcano, Sasha Levin, tglx,
	nicolas.ferre, claudiu.beznea, linux-arm-kernel

From: Ronald Wahl <ronald.wahl@raritan.com>

[ Upstream commit 6d3bc4c02d59996d1d3180d8ed409a9d7d5900e0 ]

On SAM9 hardware two cascaded 16 bit timers are used to form a 32 bit
high resolution timer that is used as scheduler clock when the kernel
has been configured that way (CONFIG_ATMEL_CLOCKSOURCE_TCB).

The driver initially triggers a reset-to-zero of the two timers but this
reset is only performed on the next rising clock. For the first timer
this is ok - it will be in the next 60ns (16MHz clock). For the chained
second timer this will only happen after the first timer overflows, i.e.
after 2^16 clocks (~4ms with a 16MHz clock). So with other words the
scheduler clock resets to 0 after the first 2^16 clock cycles.

It looks like that the scheduler does not like this and behaves wrongly
over its lifetime, e.g. some tasks are scheduled with a long delay. Why
that is and if there are additional requirements for this behaviour has
not been further analysed.

There is a simple fix for resetting the second timer as well when the
first timer is reset and this is to set the ATMEL_TC_ASWTRG_SET bit in
the Channel Mode register (CMR) of the first timer. This will also rise
the TIOA line (clock input of the second timer) when a software trigger
respective SYNC is issued.

Signed-off-by: Ronald Wahl <ronald.wahl@raritan.com>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://lore.kernel.org/r/20231007161803.31342-1-rwahl@gmx.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/clocksource/timer-atmel-tcb.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/clocksource/timer-atmel-tcb.c b/drivers/clocksource/timer-atmel-tcb.c
index 27af17c995900..2a90c92a9182a 100644
--- a/drivers/clocksource/timer-atmel-tcb.c
+++ b/drivers/clocksource/timer-atmel-tcb.c
@@ -315,6 +315,7 @@ static void __init tcb_setup_dual_chan(struct atmel_tc *tc, int mck_divisor_idx)
 	writel(mck_divisor_idx			/* likely divide-by-8 */
 			| ATMEL_TC_WAVE
 			| ATMEL_TC_WAVESEL_UP		/* free-run */
+			| ATMEL_TC_ASWTRG_SET		/* TIOA0 rises at software trigger */
 			| ATMEL_TC_ACPA_SET		/* TIOA0 rises at 0 */
 			| ATMEL_TC_ACPC_CLEAR,		/* (duty cycle 50%) */
 			tcaddr + ATMEL_TC_REG(0, CMR));
-- 
2.42.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH AUTOSEL 6.6 09/13] srcu: Only accelerate on enqueue time
  2023-11-06 23:14 [PATCH AUTOSEL 6.6 01/13] perf/core: Bail out early if the request AUX area is out of bound Sasha Levin
                   ` (6 preceding siblings ...)
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 08/13] clocksource/drivers/timer-atmel-tcb: Fix initialization on SAM9 hardware Sasha Levin
@ 2023-11-06 23:14 ` Sasha Levin
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 10/13] smp,csd: Throw an error if a CSD lock is stuck for too long Sasha Levin
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Sasha Levin @ 2023-11-06 23:14 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Frederic Weisbecker, Yong He, Joel Fernandes, Neeraj upadhyay,
	Like Xu, Sasha Levin, jiangshanlai, paulmck, josh, rcu

From: Frederic Weisbecker <frederic@kernel.org>

[ Upstream commit 8a77f38bcd28d3c22ab7dd8eff3f299d43c00411 ]

Acceleration in SRCU happens on enqueue time for each new callback. This
operation is expected not to fail and therefore any similar attempt
from other places shouldn't find any remaining callbacks to accelerate.

Moreover accelerations performed beyond enqueue time are error prone
because rcu_seq_snap() then may return the snapshot for a new grace
period that is not going to be started.

Remove these dangerous and needless accelerations and introduce instead
assertions reporting leaking unaccelerated callbacks beyond enqueue
time.

Co-developed-by: Yong He <alexyonghe@tencent.com>
Signed-off-by: Yong He <alexyonghe@tencent.com>
Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Co-developed-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com>
Reviewed-by: Like Xu <likexu@tencent.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 kernel/rcu/srcutree.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index a1fcb8566b2e3..dbb5116bb0200 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -782,8 +782,7 @@ static void srcu_gp_start(struct srcu_struct *ssp)
 	spin_lock_rcu_node(sdp);  /* Interrupts already disabled. */
 	rcu_segcblist_advance(&sdp->srcu_cblist,
 			      rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
-	(void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
-				       rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq));
+	WARN_ON_ONCE(!rcu_segcblist_segempty(&sdp->srcu_cblist, RCU_NEXT_TAIL));
 	spin_unlock_rcu_node(sdp);  /* Interrupts remain disabled. */
 	WRITE_ONCE(ssp->srcu_sup->srcu_gp_start, jiffies);
 	WRITE_ONCE(ssp->srcu_sup->srcu_n_exp_nodelay, 0);
@@ -1692,6 +1691,7 @@ static void srcu_invoke_callbacks(struct work_struct *work)
 	ssp = sdp->ssp;
 	rcu_cblist_init(&ready_cbs);
 	spin_lock_irq_rcu_node(sdp);
+	WARN_ON_ONCE(!rcu_segcblist_segempty(&sdp->srcu_cblist, RCU_NEXT_TAIL));
 	rcu_segcblist_advance(&sdp->srcu_cblist,
 			      rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
 	if (sdp->srcu_cblist_invoking ||
@@ -1721,8 +1721,6 @@ static void srcu_invoke_callbacks(struct work_struct *work)
 	 */
 	spin_lock_irq_rcu_node(sdp);
 	rcu_segcblist_add_len(&sdp->srcu_cblist, -len);
-	(void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
-				       rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq));
 	sdp->srcu_cblist_invoking = false;
 	more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist);
 	spin_unlock_irq_rcu_node(sdp);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH AUTOSEL 6.6 10/13] smp,csd: Throw an error if a CSD lock is stuck for too long
  2023-11-06 23:14 [PATCH AUTOSEL 6.6 01/13] perf/core: Bail out early if the request AUX area is out of bound Sasha Levin
                   ` (7 preceding siblings ...)
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 09/13] srcu: Only accelerate on enqueue time Sasha Levin
@ 2023-11-06 23:14 ` Sasha Levin
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 11/13] cpu/hotplug: Don't offline the last non-isolated CPU Sasha Levin
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Sasha Levin @ 2023-11-06 23:14 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Rik van Riel, Paul E . McKenney, Imran Khan, Leonardo Bras,
	Peter Zijlstra, Valentin Schneider, Juergen Gross,
	Jonathan Corbet, Randy Dunlap, Sasha Levin, tj, catalin.marinas,
	rostedt, linux-doc

From: Rik van Riel <riel@surriel.com>

[ Upstream commit 94b3f0b5af2c7af69e3d6e0cdd9b0ea535f22186 ]

The CSD lock seems to get stuck in 2 "modes". When it gets stuck
temporarily, it usually gets released in a few seconds, and sometimes
up to one or two minutes.

If the CSD lock stays stuck for more than several minutes, it never
seems to get unstuck, and gradually more and more things in the system
end up also getting stuck.

In the latter case, we should just give up, so the system can dump out
a little more information about what went wrong, and, with panic_on_oops
and a kdump kernel loaded, dump a whole bunch more information about what
might have gone wrong.  In addition, there is an smp.panic_on_ipistall
kernel boot parameter that by default retains the old behavior, but when
set enables the panic after the CSD lock has been stuck for more than
the specified number of milliseconds, as in 300,000 for five minutes.

[ paulmck: Apply Imran Khan feedback. ]
[ paulmck: Apply Leonardo Bras feedback. ]

Link: https://lore.kernel.org/lkml/bc7cc8b0-f587-4451-8bcd-0daae627bcc7@paulmck-laptop/
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Imran Khan <imran.f.khan@oracle.com>
Reviewed-by: Leonardo Bras <leobras@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 Documentation/admin-guide/kernel-parameters.txt |  7 +++++++
 kernel/smp.c                                    | 13 ++++++++++++-
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 0a1731a0f0ef3..41644336e3587 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5858,6 +5858,13 @@
 			This feature may be more efficiently disabled
 			using the csdlock_debug- kernel parameter.
 
+	smp.panic_on_ipistall= [KNL]
+			If a csd_lock_timeout extends for more than
+			the specified number of milliseconds, panic the
+			system.  By default, let CSD-lock acquisition
+			take as long as they take.  Specifying 300,000
+			for this value provides a 5-minute timeout.
+
 	smsc-ircc2.nopnp	[HW] Don't use PNP to discover SMC devices
 	smsc-ircc2.ircc_cfg=	[HW] Device configuration I/O port
 	smsc-ircc2.ircc_sir=	[HW] SIR base I/O port
diff --git a/kernel/smp.c b/kernel/smp.c
index 8455a53465af8..695eb13a276d2 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -170,6 +170,8 @@ static DEFINE_PER_CPU(void *, cur_csd_info);
 
 static ulong csd_lock_timeout = 5000;  /* CSD lock timeout in milliseconds. */
 module_param(csd_lock_timeout, ulong, 0444);
+static int panic_on_ipistall;  /* CSD panic timeout in milliseconds, 300000 for five minutes. */
+module_param(panic_on_ipistall, int, 0444);
 
 static atomic_t csd_bug_count = ATOMIC_INIT(0);
 
@@ -230,6 +232,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 *
 	}
 
 	ts2 = sched_clock();
+	/* How long since we last checked for a stuck CSD lock.*/
 	ts_delta = ts2 - *ts1;
 	if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0))
 		return false;
@@ -243,9 +246,17 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 *
 	else
 		cpux = cpu;
 	cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */
+	/* How long since this CSD lock was stuck. */
+	ts_delta = ts2 - ts0;
 	pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n",
-		 firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts2 - ts0,
+		 firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts_delta,
 		 cpu, csd->func, csd->info);
+	/*
+	 * If the CSD lock is still stuck after 5 minutes, it is unlikely
+	 * to become unstuck. Use a signed comparison to avoid triggering
+	 * on underflows when the TSC is out of sync between sockets.
+	 */
+	BUG_ON(panic_on_ipistall > 0 && (s64)ts_delta > ((s64)panic_on_ipistall * NSEC_PER_MSEC));
 	if (cpu_cur_csd && csd != cpu_cur_csd) {
 		pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n",
 			 *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)),
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH AUTOSEL 6.6 11/13] cpu/hotplug: Don't offline the last non-isolated CPU
  2023-11-06 23:14 [PATCH AUTOSEL 6.6 01/13] perf/core: Bail out early if the request AUX area is out of bound Sasha Levin
                   ` (8 preceding siblings ...)
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 10/13] smp,csd: Throw an error if a CSD lock is stuck for too long Sasha Levin
@ 2023-11-06 23:14 ` Sasha Levin
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 12/13] workqueue: Provide one lock class key per work_on_cpu() callsite Sasha Levin
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 13/13] x86/mm: Drop the 4 MB restriction on minimal NUMA node memory size Sasha Levin
  11 siblings, 0 replies; 15+ messages in thread
From: Sasha Levin @ 2023-11-06 23:14 UTC (permalink / raw)
  To: linux-kernel, stable; +Cc: Ran Xiaokai, Thomas Gleixner, Sasha Levin, peterz

From: Ran Xiaokai <ran.xiaokai@zte.com.cn>

[ Upstream commit 38685e2a0476127db766f81b1c06019ddc4c9ffa ]

If a system has isolated CPUs via the "isolcpus=" command line parameter,
then an attempt to offline the last housekeeping CPU will result in a
WARN_ON() when rebuilding the scheduler domains and a subsequent panic due
to and unhandled empty CPU mas in partition_sched_domains_locked().

cpuset_hotplug_workfn()
  rebuild_sched_domains_locked()
    ndoms = generate_sched_domains(&doms, &attr);
      cpumask_and(doms[0], top_cpuset.effective_cpus, housekeeping_cpumask(HK_FLAG_DOMAIN));

Thus results in an empty CPU mask which triggers the warning and then the
subsequent crash:

WARNING: CPU: 4 PID: 80 at kernel/sched/topology.c:2366 build_sched_domains+0x120c/0x1408
Call trace:
 build_sched_domains+0x120c/0x1408
 partition_sched_domains_locked+0x234/0x880
 rebuild_sched_domains_locked+0x37c/0x798
 rebuild_sched_domains+0x30/0x58
 cpuset_hotplug_workfn+0x2a8/0x930

Unable to handle kernel paging request at virtual address fffe80027ab37080
 partition_sched_domains_locked+0x318/0x880
 rebuild_sched_domains_locked+0x37c/0x798

Aside of the resulting crash, it does not make any sense to offline the last
last housekeeping CPU.

Prevent this by masking out the non-housekeeping CPUs when selecting a
target CPU for initiating the CPU unplug operation via the work queue.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/202310171709530660462@zte.com.cn
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 kernel/cpu.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 6de7c6bb74eee..94430ea8b4a54 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1515,11 +1515,14 @@ static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
 	/*
 	 * Ensure that the control task does not run on the to be offlined
 	 * CPU to prevent a deadlock against cfs_b->period_timer.
+	 * Also keep at least one housekeeping cpu onlined to avoid generating
+	 * an empty sched_domain span.
 	 */
-	cpu = cpumask_any_but(cpu_online_mask, cpu);
-	if (cpu >= nr_cpu_ids)
-		return -EBUSY;
-	return work_on_cpu(cpu, __cpu_down_maps_locked, &work);
+	for_each_cpu_and(cpu, cpu_online_mask, housekeeping_cpumask(HK_TYPE_DOMAIN)) {
+		if (cpu != work.cpu)
+			return work_on_cpu(cpu, __cpu_down_maps_locked, &work);
+	}
+	return -EBUSY;
 }
 
 static int cpu_down(unsigned int cpu, enum cpuhp_state target)
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH AUTOSEL 6.6 12/13] workqueue: Provide one lock class key per work_on_cpu() callsite
  2023-11-06 23:14 [PATCH AUTOSEL 6.6 01/13] perf/core: Bail out early if the request AUX area is out of bound Sasha Levin
                   ` (9 preceding siblings ...)
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 11/13] cpu/hotplug: Don't offline the last non-isolated CPU Sasha Levin
@ 2023-11-06 23:14 ` Sasha Levin
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 13/13] x86/mm: Drop the 4 MB restriction on minimal NUMA node memory size Sasha Levin
  11 siblings, 0 replies; 15+ messages in thread
From: Sasha Levin @ 2023-11-06 23:14 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Frederic Weisbecker, Paul E . McKenney, Tejun Heo, Sasha Levin

From: Frederic Weisbecker <frederic@kernel.org>

[ Upstream commit 265f3ed077036f053981f5eea0b5b43e7c5b39ff ]

All callers of work_on_cpu() share the same lock class key for all the
functions queued. As a result the workqueue related locking scenario for
a function A may be spuriously accounted as an inversion against the
locking scenario of function B such as in the following model:

	long A(void *arg)
	{
		mutex_lock(&mutex);
		mutex_unlock(&mutex);
	}

	long B(void *arg)
	{
	}

	void launchA(void)
	{
		work_on_cpu(0, A, NULL);
	}

	void launchB(void)
	{
		mutex_lock(&mutex);
		work_on_cpu(1, B, NULL);
		mutex_unlock(&mutex);
	}

launchA and launchB running concurrently have no chance to deadlock.
However the above can be reported by lockdep as a possible locking
inversion because the works containing A() and B() are treated as
belonging to the same locking class.

The following shows an existing example of such a spurious lockdep splat:

	 ======================================================
	 WARNING: possible circular locking dependency detected
	 6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted
	 ------------------------------------------------------
	 kworker/0:1/9 is trying to acquire lock:
	 ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0

	 but task is already holding lock:
	 ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500

	 which lock already depends on the new lock.

	 the existing dependency chain (in reverse order) is:

	 -> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
			__flush_work+0x83/0x4e0
			work_on_cpu+0x97/0xc0
			rcu_nocb_cpu_offload+0x62/0xb0
			rcu_nocb_toggle+0xd0/0x1d0
			kthread+0xe6/0x120
			ret_from_fork+0x2f/0x40
			ret_from_fork_asm+0x1b/0x30

	 -> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}:
			__mutex_lock+0x81/0xc80
			rcu_nocb_cpu_deoffload+0x38/0xb0
			rcu_nocb_toggle+0x144/0x1d0
			kthread+0xe6/0x120
			ret_from_fork+0x2f/0x40
			ret_from_fork_asm+0x1b/0x30

	 -> #0 (cpu_hotplug_lock){++++}-{0:0}:
			__lock_acquire+0x1538/0x2500
			lock_acquire+0xbf/0x2a0
			percpu_down_write+0x31/0x200
			_cpu_down+0x57/0x2b0
			__cpu_down_maps_locked+0x10/0x20
			work_for_cpu_fn+0x15/0x20
			process_scheduled_works+0x2a7/0x500
			worker_thread+0x173/0x330
			kthread+0xe6/0x120
			ret_from_fork+0x2f/0x40
			ret_from_fork_asm+0x1b/0x30

	 other info that might help us debug this:

	 Chain exists of:
	   cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work)

	  Possible unsafe locking scenario:

			CPU0                    CPU1
			----                    ----
	   lock((work_completion)(&wfc.work));
									lock(rcu_state.barrier_mutex);
									lock((work_completion)(&wfc.work));
	   lock(cpu_hotplug_lock);

	  *** DEADLOCK ***

	 2 locks held by kworker/0:1/9:
	  #0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500
	  #1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500

	 stack backtrace:
	 CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409
	 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
	 Workqueue: events work_for_cpu_fn
	 Call Trace:
	 rcu-torture: rcu_torture_read_exit: Start of episode
	  <TASK>
	  dump_stack_lvl+0x4a/0x80
	  check_noncircular+0x132/0x150
	  __lock_acquire+0x1538/0x2500
	  lock_acquire+0xbf/0x2a0
	  ? _cpu_down+0x57/0x2b0
	  percpu_down_write+0x31/0x200
	  ? _cpu_down+0x57/0x2b0
	  _cpu_down+0x57/0x2b0
	  __cpu_down_maps_locked+0x10/0x20
	  work_for_cpu_fn+0x15/0x20
	  process_scheduled_works+0x2a7/0x500
	  worker_thread+0x173/0x330
	  ? __pfx_worker_thread+0x10/0x10
	  kthread+0xe6/0x120
	  ? __pfx_kthread+0x10/0x10
	  ret_from_fork+0x2f/0x40
	  ? __pfx_kthread+0x10/0x10
	  ret_from_fork_asm+0x1b/0x30
	  </TASK

Fix this with providing one lock class key per work_on_cpu() caller.

Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 include/linux/workqueue.h | 46 +++++++++++++++++++++++++++++++++------
 kernel/workqueue.c        | 20 ++++++++++-------
 2 files changed, 51 insertions(+), 15 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 1c1d06804d450..24b1e5070f4d4 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -274,18 +274,16 @@ static inline unsigned int work_static(struct work_struct *work) { return 0; }
  * to generate better code.
  */
 #ifdef CONFIG_LOCKDEP
-#define __INIT_WORK(_work, _func, _onstack)				\
+#define __INIT_WORK_KEY(_work, _func, _onstack, _key)			\
 	do {								\
-		static struct lock_class_key __key;			\
-									\
 		__init_work((_work), _onstack);				\
 		(_work)->data = (atomic_long_t) WORK_DATA_INIT();	\
-		lockdep_init_map(&(_work)->lockdep_map, "(work_completion)"#_work, &__key, 0); \
+		lockdep_init_map(&(_work)->lockdep_map, "(work_completion)"#_work, (_key), 0); \
 		INIT_LIST_HEAD(&(_work)->entry);			\
 		(_work)->func = (_func);				\
 	} while (0)
 #else
-#define __INIT_WORK(_work, _func, _onstack)				\
+#define __INIT_WORK_KEY(_work, _func, _onstack, _key)			\
 	do {								\
 		__init_work((_work), _onstack);				\
 		(_work)->data = (atomic_long_t) WORK_DATA_INIT();	\
@@ -294,12 +292,22 @@ static inline unsigned int work_static(struct work_struct *work) { return 0; }
 	} while (0)
 #endif
 
+#define __INIT_WORK(_work, _func, _onstack)				\
+	do {								\
+		static __maybe_unused struct lock_class_key __key;	\
+									\
+		__INIT_WORK_KEY(_work, _func, _onstack, &__key);	\
+	} while (0)
+
 #define INIT_WORK(_work, _func)						\
 	__INIT_WORK((_work), (_func), 0)
 
 #define INIT_WORK_ONSTACK(_work, _func)					\
 	__INIT_WORK((_work), (_func), 1)
 
+#define INIT_WORK_ONSTACK_KEY(_work, _func, _key)			\
+	__INIT_WORK_KEY((_work), (_func), 1, _key)
+
 #define __INIT_DELAYED_WORK(_work, _func, _tflags)			\
 	do {								\
 		INIT_WORK(&(_work)->work, (_func));			\
@@ -693,8 +701,32 @@ static inline long work_on_cpu_safe(int cpu, long (*fn)(void *), void *arg)
 	return fn(arg);
 }
 #else
-long work_on_cpu(int cpu, long (*fn)(void *), void *arg);
-long work_on_cpu_safe(int cpu, long (*fn)(void *), void *arg);
+long work_on_cpu_key(int cpu, long (*fn)(void *),
+		     void *arg, struct lock_class_key *key);
+/*
+ * A new key is defined for each caller to make sure the work
+ * associated with the function doesn't share its locking class.
+ */
+#define work_on_cpu(_cpu, _fn, _arg)			\
+({							\
+	static struct lock_class_key __key;		\
+							\
+	work_on_cpu_key(_cpu, _fn, _arg, &__key);	\
+})
+
+long work_on_cpu_safe_key(int cpu, long (*fn)(void *),
+			  void *arg, struct lock_class_key *key);
+
+/*
+ * A new key is defined for each caller to make sure the work
+ * associated with the function doesn't share its locking class.
+ */
+#define work_on_cpu_safe(_cpu, _fn, _arg)		\
+({							\
+	static struct lock_class_key __key;		\
+							\
+	work_on_cpu_safe_key(_cpu, _fn, _arg, &__key);	\
+})
 #endif /* CONFIG_SMP */
 
 #ifdef CONFIG_FREEZER
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index a3522b70218d3..0f682da96e1c5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5622,50 +5622,54 @@ static void work_for_cpu_fn(struct work_struct *work)
 }
 
 /**
- * work_on_cpu - run a function in thread context on a particular cpu
+ * work_on_cpu_key - run a function in thread context on a particular cpu
  * @cpu: the cpu to run on
  * @fn: the function to run
  * @arg: the function arg
+ * @key: The lock class key for lock debugging purposes
  *
  * It is up to the caller to ensure that the cpu doesn't go offline.
  * The caller must not hold any locks which would prevent @fn from completing.
  *
  * Return: The value @fn returns.
  */
-long work_on_cpu(int cpu, long (*fn)(void *), void *arg)
+long work_on_cpu_key(int cpu, long (*fn)(void *),
+		     void *arg, struct lock_class_key *key)
 {
 	struct work_for_cpu wfc = { .fn = fn, .arg = arg };
 
-	INIT_WORK_ONSTACK(&wfc.work, work_for_cpu_fn);
+	INIT_WORK_ONSTACK_KEY(&wfc.work, work_for_cpu_fn, key);
 	schedule_work_on(cpu, &wfc.work);
 	flush_work(&wfc.work);
 	destroy_work_on_stack(&wfc.work);
 	return wfc.ret;
 }
-EXPORT_SYMBOL_GPL(work_on_cpu);
+EXPORT_SYMBOL_GPL(work_on_cpu_key);
 
 /**
- * work_on_cpu_safe - run a function in thread context on a particular cpu
+ * work_on_cpu_safe_key - run a function in thread context on a particular cpu
  * @cpu: the cpu to run on
  * @fn:  the function to run
  * @arg: the function argument
+ * @key: The lock class key for lock debugging purposes
  *
  * Disables CPU hotplug and calls work_on_cpu(). The caller must not hold
  * any locks which would prevent @fn from completing.
  *
  * Return: The value @fn returns.
  */
-long work_on_cpu_safe(int cpu, long (*fn)(void *), void *arg)
+long work_on_cpu_safe_key(int cpu, long (*fn)(void *),
+			  void *arg, struct lock_class_key *key)
 {
 	long ret = -ENODEV;
 
 	cpus_read_lock();
 	if (cpu_online(cpu))
-		ret = work_on_cpu(cpu, fn, arg);
+		ret = work_on_cpu_key(cpu, fn, arg, key);
 	cpus_read_unlock();
 	return ret;
 }
-EXPORT_SYMBOL_GPL(work_on_cpu_safe);
+EXPORT_SYMBOL_GPL(work_on_cpu_safe_key);
 #endif /* CONFIG_SMP */
 
 #ifdef CONFIG_FREEZER
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH AUTOSEL 6.6 13/13] x86/mm: Drop the 4 MB restriction on minimal NUMA node memory size
  2023-11-06 23:14 [PATCH AUTOSEL 6.6 01/13] perf/core: Bail out early if the request AUX area is out of bound Sasha Levin
                   ` (10 preceding siblings ...)
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 12/13] workqueue: Provide one lock class key per work_on_cpu() callsite Sasha Levin
@ 2023-11-06 23:14 ` Sasha Levin
  11 siblings, 0 replies; 15+ messages in thread
From: Sasha Levin @ 2023-11-06 23:14 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Mike Rapoport (IBM), Qi Zheng, Mario Casquero, Ingo Molnar,
	David Hildenbrand, Michal Hocko, Dave Hansen, Rik van Riel,
	Sasha Levin, tglx, mingo, bp, x86, luto, peterz

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

[ Upstream commit a1e2b8b36820d8c91275f207e77e91645b7c6836 ]

Qi Zheng reported crashes in a production environment and provided a
simplified example as a reproducer:

 |  For example, if we use Qemu to start a two NUMA node kernel,
 |  one of the nodes has 2M memory (less than NODE_MIN_SIZE),
 |  and the other node has 2G, then we will encounter the
 |  following panic:
 |
 |    BUG: kernel NULL pointer dereference, address: 0000000000000000
 |    <...>
 |    RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
 |    <...>
 |    Call Trace:
 |      <TASK>
 |      deactivate_slab()
 |      bootstrap()
 |      kmem_cache_init()
 |      start_kernel()
 |      secondary_startup_64_no_verify()

The crashes happen because of inconsistency between the nodemask that
has nodes with less than 4MB as memoryless, and the actual memory fed
into the core mm.

The commit:

  9391a3f9c7f1 ("[PATCH] x86_64: Clear more state when ignoring empty node in SRAT parsing")

... that introduced minimal size of a NUMA node does not explain why
a node size cannot be less than 4MB and what boot failures this
restriction might fix.

Fixes have been submitted to the core MM code to tighten up the
memory topologies it accepts and to not crash on weird input:

  mm: page_alloc: skip memoryless nodes entirely
  mm: memory_hotplug: drop memoryless node from fallback lists

Andrew has accepted them into the -mm tree, but there are no
stable SHA1's yet.

This patch drops the limitation for minimal node size on x86:

  - which works around the crash without the fixes to the core MM.
  - makes x86 topologies less weird,
  - removes an arbitrary and undocumented limitation on NUMA topologies.

[ mingo: Improved changelog clarity. ]

Reported-by: Qi Zheng <zhengqi.arch@bytedance.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@surriel.com>
Link: https://lore.kernel.org/r/ZS+2qqjEO5/867br@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/x86/include/asm/numa.h | 7 -------
 arch/x86/mm/numa.c          | 7 -------
 2 files changed, 14 deletions(-)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index e3bae2b60a0db..ef2844d691735 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -12,13 +12,6 @@
 
 #define NR_NODE_MEMBLKS		(MAX_NUMNODES*2)
 
-/*
- * Too small node sizes may confuse the VM badly. Usually they
- * result from BIOS bugs. So dont recognize nodes as standalone
- * NUMA entities that have less than this amount of RAM listed:
- */
-#define NODE_MIN_SIZE (4*1024*1024)
-
 extern int numa_off;
 
 /*
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2aadb2019b4f2..55e3d895f15c3 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -601,13 +601,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
 		if (start >= end)
 			continue;
 
-		/*
-		 * Don't confuse VM with a node that doesn't have the
-		 * minimum amount of memory:
-		 */
-		if (end && (end - start) < NODE_MIN_SIZE)
-			continue;
-
 		alloc_node_data(nid);
 	}
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH AUTOSEL 6.6 05/13] binfmt_elf: Support segments with 0 filesz and misaligned starts
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 05/13] binfmt_elf: Support segments with 0 filesz and misaligned starts Sasha Levin
@ 2023-11-07  0:04   ` Kees Cook
  0 siblings, 0 replies; 15+ messages in thread
From: Kees Cook @ 2023-11-07  0:04 UTC (permalink / raw)
  To: Sasha Levin
  Cc: linux-kernel, stable, Eric W. Biederman, Sebastian Ott,
	Thomas Weißschuh, Pedro Falcato, viro, brauner,
	linux-fsdevel, linux-mm

Please drop this from -stable -- it's part of a larger refactoring that
shouldn't be backported without explicit effort/testing.

-Kees

On Mon, Nov 06, 2023 at 06:14:18PM -0500, Sasha Levin wrote:
> From: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> [ Upstream commit 585a018627b4d7ed37387211f667916840b5c5ea ]
> 
> Implement a helper elf_load() that wraps elf_map() and performs all
> of the necessary work to ensure that when "memsz > filesz" the bytes
> described by "memsz > filesz" are zeroed.
> 
> An outstanding issue is if the first segment has filesz 0, and has a
> randomized location. But that is the same as today.
> 
> In this change I replaced an open coded padzero() that did not clear
> all of the way to the end of the page, with padzero() that does.
> 
> I also stopped checking the return of padzero() as there is at least
> one known case where testing for failure is the wrong thing to do.
> It looks like binfmt_elf_fdpic may have the proper set of tests
> for when error handling can be safely completed.
> 
> I found a couple of commits in the old history
> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git,
> that look very interesting in understanding this code.
> 
> commit 39b56d902bf3 ("[PATCH] binfmt_elf: clearing bss may fail")
> commit c6e2227e4a3e ("[SPARC64]: Missing user access return value checks in fs/binfmt_elf.c and fs/compat.c")
> commit 5bf3be033f50 ("v2.4.10.1 -> v2.4.10.2")
> 
> Looking at commit 39b56d902bf3 ("[PATCH] binfmt_elf: clearing bss may fail"):
> >  commit 39b56d902bf35241e7cba6cc30b828ed937175ad
> >  Author: Pavel Machek <pavel@ucw.cz>
> >  Date:   Wed Feb 9 22:40:30 2005 -0800
> >
> >     [PATCH] binfmt_elf: clearing bss may fail
> >
> >     So we discover that Borland's Kylix application builder emits weird elf
> >     files which describe a non-writeable bss segment.
> >
> >     So remove the clear_user() check at the place where we zero out the bss.  I
> >     don't _think_ there are any security implications here (plus we've never
> >     checked that clear_user() return value, so whoops if it is a problem).
> >
> >     Signed-off-by: Pavel Machek <pavel@suse.cz>
> >     Signed-off-by: Andrew Morton <akpm@osdl.org>
> >     Signed-off-by: Linus Torvalds <torvalds@osdl.org>
> 
> It seems pretty clear that binfmt_elf_fdpic with skipping clear_user() for
> non-writable segments and otherwise calling clear_user(), aka padzero(),
> and checking it's return code is the right thing to do.
> 
> I just skipped the error checking as that avoids breaking things.
> 
> And notably, it looks like Borland's Kylix died in 2005 so it might be
> safe to just consider read-only segments with memsz > filesz an error.
> 
> Reported-by: Sebastian Ott <sebott@redhat.com>
> Reported-by: Thomas Weißschuh <linux@weissschuh.net>
> Closes: https://lkml.kernel.org/r/20230914-bss-alloc-v1-1-78de67d2c6dd@weissschuh.net
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> Link: https://lore.kernel.org/r/87sf71f123.fsf@email.froward.int.ebiederm.org
> Tested-by: Pedro Falcato <pedro.falcato@gmail.com>
> Signed-off-by: Sebastian Ott <sebott@redhat.com>
> Link: https://lore.kernel.org/r/20230929032435.2391507-1-keescook@chromium.org
> Signed-off-by: Kees Cook <keescook@chromium.org>
> Signed-off-by: Sasha Levin <sashal@kernel.org>
> ---
>  fs/binfmt_elf.c | 111 +++++++++++++++++++++---------------------------
>  1 file changed, 48 insertions(+), 63 deletions(-)
> 
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 7b3d2d4914073..2a615f476e44e 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -110,25 +110,6 @@ static struct linux_binfmt elf_format = {
>  
>  #define BAD_ADDR(x) (unlikely((unsigned long)(x) >= TASK_SIZE))
>  
> -static int set_brk(unsigned long start, unsigned long end, int prot)
> -{
> -	start = ELF_PAGEALIGN(start);
> -	end = ELF_PAGEALIGN(end);
> -	if (end > start) {
> -		/*
> -		 * Map the last of the bss segment.
> -		 * If the header is requesting these pages to be
> -		 * executable, honour that (ppc32 needs this).
> -		 */
> -		int error = vm_brk_flags(start, end - start,
> -				prot & PROT_EXEC ? VM_EXEC : 0);
> -		if (error)
> -			return error;
> -	}
> -	current->mm->start_brk = current->mm->brk = end;
> -	return 0;
> -}
> -
>  /* We need to explicitly zero any fractional pages
>     after the data section (i.e. bss).  This would
>     contain the junk from the file that should not
> @@ -406,6 +387,51 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
>  	return(map_addr);
>  }
>  
> +static unsigned long elf_load(struct file *filep, unsigned long addr,
> +		const struct elf_phdr *eppnt, int prot, int type,
> +		unsigned long total_size)
> +{
> +	unsigned long zero_start, zero_end;
> +	unsigned long map_addr;
> +
> +	if (eppnt->p_filesz) {
> +		map_addr = elf_map(filep, addr, eppnt, prot, type, total_size);
> +		if (BAD_ADDR(map_addr))
> +			return map_addr;
> +		if (eppnt->p_memsz > eppnt->p_filesz) {
> +			zero_start = map_addr + ELF_PAGEOFFSET(eppnt->p_vaddr) +
> +				eppnt->p_filesz;
> +			zero_end = map_addr + ELF_PAGEOFFSET(eppnt->p_vaddr) +
> +				eppnt->p_memsz;
> +
> +			/* Zero the end of the last mapped page */
> +			padzero(zero_start);
> +		}
> +	} else {
> +		map_addr = zero_start = ELF_PAGESTART(addr);
> +		zero_end = zero_start + ELF_PAGEOFFSET(eppnt->p_vaddr) +
> +			eppnt->p_memsz;
> +	}
> +	if (eppnt->p_memsz > eppnt->p_filesz) {
> +		/*
> +		 * Map the last of the segment.
> +		 * If the header is requesting these pages to be
> +		 * executable, honour that (ppc32 needs this).
> +		 */
> +		int error;
> +
> +		zero_start = ELF_PAGEALIGN(zero_start);
> +		zero_end = ELF_PAGEALIGN(zero_end);
> +
> +		error = vm_brk_flags(zero_start, zero_end - zero_start,
> +				     prot & PROT_EXEC ? VM_EXEC : 0);
> +		if (error)
> +			map_addr = error;
> +	}
> +	return map_addr;
> +}
> +
> +
>  static unsigned long total_mapping_size(const struct elf_phdr *phdr, int nr)
>  {
>  	elf_addr_t min_addr = -1;
> @@ -829,7 +855,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
>  	struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
>  	struct elf_phdr *elf_property_phdata = NULL;
>  	unsigned long elf_bss, elf_brk;
> -	int bss_prot = 0;
>  	int retval, i;
>  	unsigned long elf_entry;
>  	unsigned long e_entry;
> @@ -1040,33 +1065,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
>  		if (elf_ppnt->p_type != PT_LOAD)
>  			continue;
>  
> -		if (unlikely (elf_brk > elf_bss)) {
> -			unsigned long nbyte;
> -
> -			/* There was a PT_LOAD segment with p_memsz > p_filesz
> -			   before this one. Map anonymous pages, if needed,
> -			   and clear the area.  */
> -			retval = set_brk(elf_bss + load_bias,
> -					 elf_brk + load_bias,
> -					 bss_prot);
> -			if (retval)
> -				goto out_free_dentry;
> -			nbyte = ELF_PAGEOFFSET(elf_bss);
> -			if (nbyte) {
> -				nbyte = ELF_MIN_ALIGN - nbyte;
> -				if (nbyte > elf_brk - elf_bss)
> -					nbyte = elf_brk - elf_bss;
> -				if (clear_user((void __user *)elf_bss +
> -							load_bias, nbyte)) {
> -					/*
> -					 * This bss-zeroing can fail if the ELF
> -					 * file specifies odd protections. So
> -					 * we don't check the return value
> -					 */
> -				}
> -			}
> -		}
> -
>  		elf_prot = make_prot(elf_ppnt->p_flags, &arch_state,
>  				     !!interpreter, false);
>  
> @@ -1162,7 +1160,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
>  			}
>  		}
>  
> -		error = elf_map(bprm->file, load_bias + vaddr, elf_ppnt,
> +		error = elf_load(bprm->file, load_bias + vaddr, elf_ppnt,
>  				elf_prot, elf_flags, total_size);
>  		if (BAD_ADDR(error)) {
>  			retval = IS_ERR_VALUE(error) ?
> @@ -1217,10 +1215,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
>  		if (end_data < k)
>  			end_data = k;
>  		k = elf_ppnt->p_vaddr + elf_ppnt->p_memsz;
> -		if (k > elf_brk) {
> -			bss_prot = elf_prot;
> +		if (k > elf_brk)
>  			elf_brk = k;
> -		}
>  	}
>  
>  	e_entry = elf_ex->e_entry + load_bias;
> @@ -1232,18 +1228,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
>  	start_data += load_bias;
>  	end_data += load_bias;
>  
> -	/* Calling set_brk effectively mmaps the pages that we need
> -	 * for the bss and break sections.  We must do this before
> -	 * mapping in the interpreter, to make sure it doesn't wind
> -	 * up getting placed where the bss needs to go.
> -	 */
> -	retval = set_brk(elf_bss, elf_brk, bss_prot);
> -	if (retval)
> -		goto out_free_dentry;
> -	if (likely(elf_bss != elf_brk) && unlikely(padzero(elf_bss))) {
> -		retval = -EFAULT; /* Nobody gets to see this, but.. */
> -		goto out_free_dentry;
> -	}
> +	current->mm->start_brk = current->mm->brk = ELF_PAGEALIGN(elf_brk);
>  
>  	if (interpreter) {
>  		elf_entry = load_elf_interp(interp_elf_ex,
> -- 
> 2.42.0
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH AUTOSEL 6.6 07/13] binfmt_misc: cleanup on filesystem umount
  2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 07/13] binfmt_misc: cleanup on filesystem umount Sasha Levin
@ 2023-11-07  0:04   ` Kees Cook
  0 siblings, 0 replies; 15+ messages in thread
From: Kees Cook @ 2023-11-07  0:04 UTC (permalink / raw)
  To: Sasha Levin
  Cc: linux-kernel, stable, Christian Brauner, Sargun Dhillon,
	Serge Hallyn, Jann Horn, Henning Schild, Andrei Vagin, Al Viro,
	Laurent Vivier, linux-fsdevel, Christian Brauner, linux-mm

Please drop this from -stable -- it's part of a larger refactoring that
shouldn't be backported without explicit effort/testing.

-Kees

On Mon, Nov 06, 2023 at 06:14:20PM -0500, Sasha Levin wrote:
> From: Christian Brauner <christian.brauner@ubuntu.com>
> 
> [ Upstream commit 1c5976ef0f7ad76319df748ccb99a4c7ba2ba464 ]
> 
> Currently, registering a new binary type pins the binfmt_misc
> filesystem. Specifically, this means that as long as there is at least
> one binary type registered the binfmt_misc filesystem survives all
> umounts, i.e. the superblock is not destroyed. Meaning that a umount
> followed by another mount will end up with the same superblock and the
> same binary type handlers. This is a behavior we tend to discourage for
> any new filesystems (apart from a few special filesystems such as e.g.
> configfs or debugfs). A umount operation without the filesystem being
> pinned - by e.g. someone holding a file descriptor to an open file -
> should usually result in the destruction of the superblock and all
> associated resources. This makes introspection easier and leads to
> clearly defined, simple and clean semantics. An administrator can rely
> on the fact that a umount will guarantee a clean slate making it
> possible to reinitialize a filesystem. Right now all binary types would
> need to be explicitly deleted before that can happen.
> 
> This allows us to remove the heavy-handed calls to simple_pin_fs() and
> simple_release_fs() when creating and deleting binary types. This in
> turn allows us to replace the current brittle pinning mechanism abusing
> dget() which has caused a range of bugs judging from prior fixes in [2]
> and [3]. The additional dget() in load_misc_binary() pins the dentry but
> only does so for the sake to prevent ->evict_inode() from freeing the
> node when a user removes the binary type and kill_node() is run. Which
> would mean ->interpreter and ->interp_file would be freed causing a UAF.
> 
> This isn't really nicely documented nor is it very clean because it
> relies on simple_pin_fs() pinning the filesystem as long as at least one
> binary type exists. Otherwise it would cause load_misc_binary() to hold
> on to a dentry belonging to a superblock that has been shutdown.
> Replace that implicit pinning with a clean and simple per-node refcount
> and get rid of the ugly dget() pinning. A similar mechanism exists for
> e.g. binderfs (cf. [4]). All the cleanup work can now be done in
> ->evict_inode().
> 
> In a follow-up patch we will make it possible to use binfmt_misc in
> sandboxes. We will use the cleaner semantics where a umount for the
> filesystem will cause the superblock and all resources to be
> deallocated. In preparation for this apply the same semantics to the
> initial binfmt_misc mount. Note, that this is a user-visible change and
> as such a uapi change but one that we can reasonably risk. We've
> discussed this in earlier versions of this patchset (cf. [1]).
> 
> The main user and provider of binfmt_misc is systemd. Systemd provides
> binfmt_misc via autofs since it is configurable as a kernel module and
> is used by a few exotic packages and users. As such a binfmt_misc mount
> is triggered when /proc/sys/fs/binfmt_misc is accessed and is only
> provided on demand. Other autofs on demand filesystems include EFI ESP
> which systemd umounts if the mountpoint stays idle for a certain amount
> of time. This doesn't apply to the binfmt_misc autofs mount which isn't
> touched once it is mounted meaning this change can't accidently wipe
> binary type handlers without someone having explicitly unmounted
> binfmt_misc. After speaking to systemd folks they don't expect this
> change to affect them.
> 
> In line with our general policy, if we see a regression for systemd or
> other users with this change we will switch back to the old behavior for
> the initial binfmt_misc mount and have binary types pin the filesystem
> again. But while we touch this code let's take the chance and let's
> improve on the status quo.
> 
> [1]: https://lore.kernel.org/r/20191216091220.465626-2-laurent@vivier.eu
> [2]: commit 43a4f2619038 ("exec: binfmt_misc: fix race between load_misc_binary() and kill_node()"
> [3]: commit 83f918274e4b ("exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode()")
> [4]: commit f0fe2c0f050d ("binder: prevent UAF for binderfs devices II")
> 
> Link: https://lore.kernel.org/r/20211028103114.2849140-1-brauner@kernel.org (v1)
> Cc: Sargun Dhillon <sargun@sargun.me>
> Cc: Serge Hallyn <serge@hallyn.com>
> Cc: Jann Horn <jannh@google.com>
> Cc: Henning Schild <henning.schild@siemens.com>
> Cc: Andrei Vagin <avagin@gmail.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Laurent Vivier <laurent@vivier.eu>
> Cc: linux-fsdevel@vger.kernel.org
> Acked-by: Serge Hallyn <serge@hallyn.com>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---
> /* v2 */
> - Christian Brauner <christian.brauner@ubuntu.com>:
>   - Add more comments that explain what's going on.
>   - Rename functions while changing them to better reflect what they are
>     doing to make the code easier to understand.
>   - In the first version when a specific binary type handler was removed
>     either through a write to the entry's file or all binary type
>     handlers were removed by a write to the binfmt_misc mount's status
>     file all cleanup work happened during inode eviction.
>     That includes removal of the relevant entries from entry list. While
>     that works fine I disliked that model after thinking about it for a
>     bit. Because it means that there was a window were someone has
>     already removed a or all binary handlers but they could still be
>     safely reached from load_misc_binary() when it has managed to take
>     the read_lock() on the entries list while inode eviction was already
>     happening. Again, that perfectly benign but it's cleaner to remove
>     the binary handler from the list immediately meaning that ones the
>     write to then entry's file or the binfmt_misc status file returns
>     the binary type cannot be executed anymore. That gives stronger
>     guarantees to the user.
> Signed-off-by: Sasha Levin <sashal@kernel.org>
> ---
>  fs/binfmt_misc.c | 216 ++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 168 insertions(+), 48 deletions(-)
> 
> diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
> index e0108d17b085c..cf5ed5cd4102d 100644
> --- a/fs/binfmt_misc.c
> +++ b/fs/binfmt_misc.c
> @@ -60,12 +60,11 @@ typedef struct {
>  	char *name;
>  	struct dentry *dentry;
>  	struct file *interp_file;
> +	refcount_t users;		/* sync removal with load_misc_binary() */
>  } Node;
>  
>  static DEFINE_RWLOCK(entries_lock);
>  static struct file_system_type bm_fs_type;
> -static struct vfsmount *bm_mnt;
> -static int entry_count;
>  
>  /*
>   * Max length of the register string.  Determined by:
> @@ -82,19 +81,23 @@ static int entry_count;
>   */
>  #define MAX_REGISTER_LENGTH 1920
>  
> -/*
> - * Check if we support the binfmt
> - * if we do, return the node, else NULL
> - * locking is done in load_misc_binary
> +/**
> + * search_binfmt_handler - search for a binary handler for @bprm
> + * @misc: handle to binfmt_misc instance
> + * @bprm: binary for which we are looking for a handler
> + *
> + * Search for a binary type handler for @bprm in the list of registered binary
> + * type handlers.
> + *
> + * Return: binary type list entry on success, NULL on failure
>   */
> -static Node *check_file(struct linux_binprm *bprm)
> +static Node *search_binfmt_handler(struct linux_binprm *bprm)
>  {
>  	char *p = strrchr(bprm->interp, '.');
> -	struct list_head *l;
> +	Node *e;
>  
>  	/* Walk all the registered handlers. */
> -	list_for_each(l, &entries) {
> -		Node *e = list_entry(l, Node, list);
> +	list_for_each_entry(e, &entries, list) {
>  		char *s;
>  		int j;
>  
> @@ -123,9 +126,49 @@ static Node *check_file(struct linux_binprm *bprm)
>  		if (j == e->size)
>  			return e;
>  	}
> +
>  	return NULL;
>  }
>  
> +/**
> + * get_binfmt_handler - try to find a binary type handler
> + * @misc: handle to binfmt_misc instance
> + * @bprm: binary for which we are looking for a handler
> + *
> + * Try to find a binfmt handler for the binary type. If one is found take a
> + * reference to protect against removal via bm_{entry,status}_write().
> + *
> + * Return: binary type list entry on success, NULL on failure
> + */
> +static Node *get_binfmt_handler(struct linux_binprm *bprm)
> +{
> +	Node *e;
> +
> +	read_lock(&entries_lock);
> +	e = search_binfmt_handler(bprm);
> +	if (e)
> +		refcount_inc(&e->users);
> +	read_unlock(&entries_lock);
> +	return e;
> +}
> +
> +/**
> + * put_binfmt_handler - put binary handler node
> + * @e: node to put
> + *
> + * Free node syncing with load_misc_binary() and defer final free to
> + * load_misc_binary() in case it is using the binary type handler we were
> + * requested to remove.
> + */
> +static void put_binfmt_handler(Node *e)
> +{
> +	if (refcount_dec_and_test(&e->users)) {
> +		if (e->flags & MISC_FMT_OPEN_FILE)
> +			filp_close(e->interp_file, NULL);
> +		kfree(e);
> +	}
> +}
> +
>  /*
>   * the loader itself
>   */
> @@ -139,12 +182,7 @@ static int load_misc_binary(struct linux_binprm *bprm)
>  	if (!enabled)
>  		return retval;
>  
> -	/* to keep locking time low, we copy the interpreter string */
> -	read_lock(&entries_lock);
> -	fmt = check_file(bprm);
> -	if (fmt)
> -		dget(fmt->dentry);
> -	read_unlock(&entries_lock);
> +	fmt = get_binfmt_handler(bprm);
>  	if (!fmt)
>  		return retval;
>  
> @@ -198,7 +236,16 @@ static int load_misc_binary(struct linux_binprm *bprm)
>  
>  	retval = 0;
>  ret:
> -	dput(fmt->dentry);
> +
> +	/*
> +	 * If we actually put the node here all concurrent calls to
> +	 * load_misc_binary() will have finished. We also know
> +	 * that for the refcount to be zero ->evict_inode() must have removed
> +	 * the node to be deleted from the list. All that is left for us is to
> +	 * close and free.
> +	 */
> +	put_binfmt_handler(fmt);
> +
>  	return retval;
>  }
>  
> @@ -552,30 +599,90 @@ static struct inode *bm_get_inode(struct super_block *sb, int mode)
>  	return inode;
>  }
>  
> +/**
> + * bm_evict_inode - cleanup data associated with @inode
> + * @inode: inode to which the data is attached
> + *
> + * Cleanup the binary type handler data associated with @inode if a binary type
> + * entry is removed or the filesystem is unmounted and the super block is
> + * shutdown.
> + *
> + * If the ->evict call was not caused by a super block shutdown but by a write
> + * to remove the entry or all entries via bm_{entry,status}_write() the entry
> + * will have already been removed from the list. We keep the list_empty() check
> + * to make that explicit.
> +*/
>  static void bm_evict_inode(struct inode *inode)
>  {
>  	Node *e = inode->i_private;
>  
> -	if (e && e->flags & MISC_FMT_OPEN_FILE)
> -		filp_close(e->interp_file, NULL);
> -
>  	clear_inode(inode);
> -	kfree(e);
> +
> +	if (e) {
> +		write_lock(&entries_lock);
> +		if (!list_empty(&e->list))
> +			list_del_init(&e->list);
> +		write_unlock(&entries_lock);
> +		put_binfmt_handler(e);
> +	}
>  }
>  
> -static void kill_node(Node *e)
> +/**
> + * unlink_binfmt_dentry - remove the dentry for the binary type handler
> + * @dentry: dentry associated with the binary type handler
> + *
> + * Do the actual filesystem work to remove a dentry for a registered binary
> + * type handler. Since binfmt_misc only allows simple files to be created
> + * directly under the root dentry of the filesystem we ensure that we are
> + * indeed passed a dentry directly beneath the root dentry, that the inode
> + * associated with the root dentry is locked, and that it is a regular file we
> + * are asked to remove.
> + */
> +static void unlink_binfmt_dentry(struct dentry *dentry)
>  {
> -	struct dentry *dentry;
> +	struct dentry *parent = dentry->d_parent;
> +	struct inode *inode, *parent_inode;
> +
> +	/* All entries are immediate descendants of the root dentry. */
> +	if (WARN_ON_ONCE(dentry->d_sb->s_root != parent))
> +		return;
>  
> +	/* We only expect to be called on regular files. */
> +	inode = d_inode(dentry);
> +	if (WARN_ON_ONCE(!S_ISREG(inode->i_mode)))
> +		return;
> +
> +	/* The parent inode must be locked. */
> +	parent_inode = d_inode(parent);
> +	if (WARN_ON_ONCE(!inode_is_locked(parent_inode)))
> +		return;
> +
> +	if (simple_positive(dentry)) {
> +		dget(dentry);
> +		simple_unlink(parent_inode, dentry);
> +		d_delete(dentry);
> +		dput(dentry);
> +	}
> +}
> +
> +/**
> + * remove_binfmt_handler - remove a binary type handler
> + * @misc: handle to binfmt_misc instance
> + * @e: binary type handler to remove
> + *
> + * Remove a binary type handler from the list of binary type handlers and
> + * remove its associated dentry. This is called from
> + * binfmt_{entry,status}_write(). In the future, we might want to think about
> + * adding a proper ->unlink() method to binfmt_misc instead of forcing caller's
> + * to use writes to files in order to delete binary type handlers. But it has
> + * worked for so long that it's not a pressing issue.
> + */
> +static void remove_binfmt_handler(Node *e)
> +{
>  	write_lock(&entries_lock);
>  	list_del_init(&e->list);
>  	write_unlock(&entries_lock);
> -
> -	dentry = e->dentry;
> -	drop_nlink(d_inode(dentry));
> -	d_drop(dentry);
> -	dput(dentry);
> -	simple_release_fs(&bm_mnt, &entry_count);
> +	unlink_binfmt_dentry(e->dentry);
>  }
>  
>  /* /<entry> */
> @@ -602,8 +709,8 @@ bm_entry_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
>  static ssize_t bm_entry_write(struct file *file, const char __user *buffer,
>  				size_t count, loff_t *ppos)
>  {
> -	struct dentry *root;
> -	Node *e = file_inode(file)->i_private;
> +	struct inode *inode = file_inode(file);
> +	Node *e = inode->i_private;
>  	int res = parse_command(buffer, count);
>  
>  	switch (res) {
> @@ -617,13 +724,22 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer,
>  		break;
>  	case 3:
>  		/* Delete this handler. */
> -		root = file_inode(file)->i_sb->s_root;
> -		inode_lock(d_inode(root));
> +		inode = d_inode(inode->i_sb->s_root);
> +		inode_lock(inode);
>  
> +		/*
> +		 * In order to add new element or remove elements from the list
> +		 * via bm_{entry,register,status}_write() inode_lock() on the
> +		 * root inode must be held.
> +		 * The lock is exclusive ensuring that the list can't be
> +		 * modified. Only load_misc_binary() can access but does so
> +		 * read-only. So we only need to take the write lock when we
> +		 * actually remove the entry from the list.
> +		 */
>  		if (!list_empty(&e->list))
> -			kill_node(e);
> +			remove_binfmt_handler(e);
>  
> -		inode_unlock(d_inode(root));
> +		inode_unlock(inode);
>  		break;
>  	default:
>  		return res;
> @@ -682,13 +798,7 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
>  	if (!inode)
>  		goto out2;
>  
> -	err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count);
> -	if (err) {
> -		iput(inode);
> -		inode = NULL;
> -		goto out2;
> -	}
> -
> +	refcount_set(&e->users, 1);
>  	e->dentry = dget(dentry);
>  	inode->i_private = e;
>  	inode->i_fop = &bm_entry_operations;
> @@ -732,7 +842,8 @@ static ssize_t bm_status_write(struct file *file, const char __user *buffer,
>  		size_t count, loff_t *ppos)
>  {
>  	int res = parse_command(buffer, count);
> -	struct dentry *root;
> +	Node *e, *next;
> +	struct inode *inode;
>  
>  	switch (res) {
>  	case 1:
> @@ -745,13 +856,22 @@ static ssize_t bm_status_write(struct file *file, const char __user *buffer,
>  		break;
>  	case 3:
>  		/* Delete all handlers. */
> -		root = file_inode(file)->i_sb->s_root;
> -		inode_lock(d_inode(root));
> +		inode = d_inode(file_inode(file)->i_sb->s_root);
> +		inode_lock(inode);
>  
> -		while (!list_empty(&entries))
> -			kill_node(list_first_entry(&entries, Node, list));
> +		/*
> +		 * In order to add new element or remove elements from the list
> +		 * via bm_{entry,register,status}_write() inode_lock() on the
> +		 * root inode must be held.
> +		 * The lock is exclusive ensuring that the list can't be
> +		 * modified. Only load_misc_binary() can access but does so
> +		 * read-only. So we only need to take the write lock when we
> +		 * actually remove the entry from the list.
> +		 */
> +		list_for_each_entry_safe(e, next, &entries, list)
> +			remove_binfmt_handler(e);
>  
> -		inode_unlock(d_inode(root));
> +		inode_unlock(inode);
>  		break;
>  	default:
>  		return res;
> -- 
> 2.42.0
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-11-07  0:04 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-06 23:14 [PATCH AUTOSEL 6.6 01/13] perf/core: Bail out early if the request AUX area is out of bound Sasha Levin
2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 02/13] rcu: Dump memory object info if callback function is invalid Sasha Levin
2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 03/13] srcu: Fix srcu_struct node grpmask overflow on 64-bit systems Sasha Levin
2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 04/13] selftests/lkdtm: Disable CONFIG_UBSAN_TRAP in test config Sasha Levin
2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 05/13] binfmt_elf: Support segments with 0 filesz and misaligned starts Sasha Levin
2023-11-07  0:04   ` Kees Cook
2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 06/13] clocksource/drivers/timer-imx-gpt: Fix potential memory leak Sasha Levin
2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 07/13] binfmt_misc: cleanup on filesystem umount Sasha Levin
2023-11-07  0:04   ` Kees Cook
2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 08/13] clocksource/drivers/timer-atmel-tcb: Fix initialization on SAM9 hardware Sasha Levin
2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 09/13] srcu: Only accelerate on enqueue time Sasha Levin
2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 10/13] smp,csd: Throw an error if a CSD lock is stuck for too long Sasha Levin
2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 11/13] cpu/hotplug: Don't offline the last non-isolated CPU Sasha Levin
2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 12/13] workqueue: Provide one lock class key per work_on_cpu() callsite Sasha Levin
2023-11-06 23:14 ` [PATCH AUTOSEL 6.6 13/13] x86/mm: Drop the 4 MB restriction on minimal NUMA node memory size Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).