* [PATCH AUTOSEL 5.15 1/6] perf/core: Bail out early if the request AUX area is out of bound
@ 2023-11-06 23:16 Sasha Levin
2023-11-06 23:16 ` [PATCH AUTOSEL 5.15 2/6] clocksource/drivers/timer-imx-gpt: Fix potential memory leak Sasha Levin
` (4 more replies)
0 siblings, 5 replies; 6+ messages in thread
From: Sasha Levin @ 2023-11-06 23:16 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Shuai Xue, Peter Zijlstra, Ingo Molnar, Sasha Levin, mingo, acme,
linux-perf-users
From: Shuai Xue <xueshuai@linux.alibaba.com>
[ Upstream commit 54aee5f15b83437f23b2b2469bcf21bdd9823916 ]
When perf-record with a large AUX area, e.g 4GB, it fails with:
#perf record -C 0 -m ,4G -e arm_spe_0// -- sleep 1
failed to mmap with 12 (Cannot allocate memory)
and it reveals a WARNING with __alloc_pages():
------------[ cut here ]------------
WARNING: CPU: 44 PID: 17573 at mm/page_alloc.c:5568 __alloc_pages+0x1ec/0x248
Call trace:
__alloc_pages+0x1ec/0x248
__kmalloc_large_node+0xc0/0x1f8
__kmalloc_node+0x134/0x1e8
rb_alloc_aux+0xe0/0x298
perf_mmap+0x440/0x660
mmap_region+0x308/0x8a8
do_mmap+0x3c0/0x528
vm_mmap_pgoff+0xf4/0x1b8
ksys_mmap_pgoff+0x18c/0x218
__arm64_sys_mmap+0x38/0x58
invoke_syscall+0x50/0x128
el0_svc_common.constprop.0+0x58/0x188
do_el0_svc+0x34/0x50
el0_svc+0x34/0x108
el0t_64_sync_handler+0xb8/0xc0
el0t_64_sync+0x1a4/0x1a8
'rb->aux_pages' allocated by kcalloc() is a pointer array which is used to
maintains AUX trace pages. The allocated page for this array is physically
contiguous (and virtually contiguous) with an order of 0..MAX_ORDER. If the
size of pointer array crosses the limitation set by MAX_ORDER, it reveals a
WARNING.
So bail out early with -ENOMEM if the request AUX area is out of bound,
e.g.:
#perf record -C 0 -m ,4G -e arm_spe_0// -- sleep 1
failed to mmap with 12 (Cannot allocate memory)
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
kernel/events/ring_buffer.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index f40da32f5e753..6808873555f0d 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -696,6 +696,12 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
watermark = 0;
}
+ /*
+ * kcalloc_node() is unable to allocate buffer if the size is larger
+ * than: PAGE_SIZE << MAX_ORDER; directly bail out in this case.
+ */
+ if (get_order((unsigned long)nr_pages * sizeof(void *)) > MAX_ORDER)
+ return -ENOMEM;
rb->aux_pages = kcalloc_node(nr_pages, sizeof(void *), GFP_KERNEL,
node);
if (!rb->aux_pages)
--
2.42.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH AUTOSEL 5.15 2/6] clocksource/drivers/timer-imx-gpt: Fix potential memory leak
2023-11-06 23:16 [PATCH AUTOSEL 5.15 1/6] perf/core: Bail out early if the request AUX area is out of bound Sasha Levin
@ 2023-11-06 23:16 ` Sasha Levin
2023-11-06 23:16 ` [PATCH AUTOSEL 5.15 3/6] binfmt_misc: cleanup on filesystem umount Sasha Levin
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Sasha Levin @ 2023-11-06 23:16 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Jacky Bai, Peng Fan, Daniel Lezcano, Sasha Levin, tglx, shawnguo,
linux-arm-kernel
From: Jacky Bai <ping.bai@nxp.com>
[ Upstream commit 8051a993ce222a5158bccc6ac22ace9253dd71cb ]
Fix coverity Issue CID 250382: Resource leak (RESOURCE_LEAK).
Add kfree when error return.
Signed-off-by: Jacky Bai <ping.bai@nxp.com>
Reviewed-by: Peng Fan <peng.fan@nxp.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://lore.kernel.org/r/20231009083922.1942971-1-ping.bai@nxp.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/clocksource/timer-imx-gpt.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)
diff --git a/drivers/clocksource/timer-imx-gpt.c b/drivers/clocksource/timer-imx-gpt.c
index 7b2c70f2f353b..fabff69e52e58 100644
--- a/drivers/clocksource/timer-imx-gpt.c
+++ b/drivers/clocksource/timer-imx-gpt.c
@@ -454,12 +454,16 @@ static int __init mxc_timer_init_dt(struct device_node *np, enum imx_gpt_type t
return -ENOMEM;
imxtm->base = of_iomap(np, 0);
- if (!imxtm->base)
- return -ENXIO;
+ if (!imxtm->base) {
+ ret = -ENXIO;
+ goto err_kfree;
+ }
imxtm->irq = irq_of_parse_and_map(np, 0);
- if (imxtm->irq <= 0)
- return -EINVAL;
+ if (imxtm->irq <= 0) {
+ ret = -EINVAL;
+ goto err_kfree;
+ }
imxtm->clk_ipg = of_clk_get_by_name(np, "ipg");
@@ -472,11 +476,15 @@ static int __init mxc_timer_init_dt(struct device_node *np, enum imx_gpt_type t
ret = _mxc_timer_init(imxtm);
if (ret)
- return ret;
+ goto err_kfree;
initialized = 1;
return 0;
+
+err_kfree:
+ kfree(imxtm);
+ return ret;
}
static int __init imx1_timer_init_dt(struct device_node *np)
--
2.42.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH AUTOSEL 5.15 3/6] binfmt_misc: cleanup on filesystem umount
2023-11-06 23:16 [PATCH AUTOSEL 5.15 1/6] perf/core: Bail out early if the request AUX area is out of bound Sasha Levin
2023-11-06 23:16 ` [PATCH AUTOSEL 5.15 2/6] clocksource/drivers/timer-imx-gpt: Fix potential memory leak Sasha Levin
@ 2023-11-06 23:16 ` Sasha Levin
2023-11-06 23:16 ` [PATCH AUTOSEL 5.15 4/6] clocksource/drivers/timer-atmel-tcb: Fix initialization on SAM9 hardware Sasha Levin
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Sasha Levin @ 2023-11-06 23:16 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Christian Brauner, Sargun Dhillon, Serge Hallyn, Jann Horn,
Henning Schild, Andrei Vagin, Al Viro, Laurent Vivier,
linux-fsdevel, Christian Brauner, Kees Cook, Sasha Levin,
linux-mm
From: Christian Brauner <christian.brauner@ubuntu.com>
[ Upstream commit 1c5976ef0f7ad76319df748ccb99a4c7ba2ba464 ]
Currently, registering a new binary type pins the binfmt_misc
filesystem. Specifically, this means that as long as there is at least
one binary type registered the binfmt_misc filesystem survives all
umounts, i.e. the superblock is not destroyed. Meaning that a umount
followed by another mount will end up with the same superblock and the
same binary type handlers. This is a behavior we tend to discourage for
any new filesystems (apart from a few special filesystems such as e.g.
configfs or debugfs). A umount operation without the filesystem being
pinned - by e.g. someone holding a file descriptor to an open file -
should usually result in the destruction of the superblock and all
associated resources. This makes introspection easier and leads to
clearly defined, simple and clean semantics. An administrator can rely
on the fact that a umount will guarantee a clean slate making it
possible to reinitialize a filesystem. Right now all binary types would
need to be explicitly deleted before that can happen.
This allows us to remove the heavy-handed calls to simple_pin_fs() and
simple_release_fs() when creating and deleting binary types. This in
turn allows us to replace the current brittle pinning mechanism abusing
dget() which has caused a range of bugs judging from prior fixes in [2]
and [3]. The additional dget() in load_misc_binary() pins the dentry but
only does so for the sake to prevent ->evict_inode() from freeing the
node when a user removes the binary type and kill_node() is run. Which
would mean ->interpreter and ->interp_file would be freed causing a UAF.
This isn't really nicely documented nor is it very clean because it
relies on simple_pin_fs() pinning the filesystem as long as at least one
binary type exists. Otherwise it would cause load_misc_binary() to hold
on to a dentry belonging to a superblock that has been shutdown.
Replace that implicit pinning with a clean and simple per-node refcount
and get rid of the ugly dget() pinning. A similar mechanism exists for
e.g. binderfs (cf. [4]). All the cleanup work can now be done in
->evict_inode().
In a follow-up patch we will make it possible to use binfmt_misc in
sandboxes. We will use the cleaner semantics where a umount for the
filesystem will cause the superblock and all resources to be
deallocated. In preparation for this apply the same semantics to the
initial binfmt_misc mount. Note, that this is a user-visible change and
as such a uapi change but one that we can reasonably risk. We've
discussed this in earlier versions of this patchset (cf. [1]).
The main user and provider of binfmt_misc is systemd. Systemd provides
binfmt_misc via autofs since it is configurable as a kernel module and
is used by a few exotic packages and users. As such a binfmt_misc mount
is triggered when /proc/sys/fs/binfmt_misc is accessed and is only
provided on demand. Other autofs on demand filesystems include EFI ESP
which systemd umounts if the mountpoint stays idle for a certain amount
of time. This doesn't apply to the binfmt_misc autofs mount which isn't
touched once it is mounted meaning this change can't accidently wipe
binary type handlers without someone having explicitly unmounted
binfmt_misc. After speaking to systemd folks they don't expect this
change to affect them.
In line with our general policy, if we see a regression for systemd or
other users with this change we will switch back to the old behavior for
the initial binfmt_misc mount and have binary types pin the filesystem
again. But while we touch this code let's take the chance and let's
improve on the status quo.
[1]: https://lore.kernel.org/r/20191216091220.465626-2-laurent@vivier.eu
[2]: commit 43a4f2619038 ("exec: binfmt_misc: fix race between load_misc_binary() and kill_node()"
[3]: commit 83f918274e4b ("exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode()")
[4]: commit f0fe2c0f050d ("binder: prevent UAF for binderfs devices II")
Link: https://lore.kernel.org/r/20211028103114.2849140-1-brauner@kernel.org (v1)
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Jann Horn <jannh@google.com>
Cc: Henning Schild <henning.schild@siemens.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Laurent Vivier <laurent@vivier.eu>
Cc: linux-fsdevel@vger.kernel.org
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
---
/* v2 */
- Christian Brauner <christian.brauner@ubuntu.com>:
- Add more comments that explain what's going on.
- Rename functions while changing them to better reflect what they are
doing to make the code easier to understand.
- In the first version when a specific binary type handler was removed
either through a write to the entry's file or all binary type
handlers were removed by a write to the binfmt_misc mount's status
file all cleanup work happened during inode eviction.
That includes removal of the relevant entries from entry list. While
that works fine I disliked that model after thinking about it for a
bit. Because it means that there was a window were someone has
already removed a or all binary handlers but they could still be
safely reached from load_misc_binary() when it has managed to take
the read_lock() on the entries list while inode eviction was already
happening. Again, that perfectly benign but it's cleaner to remove
the binary handler from the list immediately meaning that ones the
write to then entry's file or the binfmt_misc status file returns
the binary type cannot be executed anymore. That gives stronger
guarantees to the user.
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
fs/binfmt_misc.c | 216 ++++++++++++++++++++++++++++++++++++-----------
1 file changed, 168 insertions(+), 48 deletions(-)
diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index bb202ad369d53..740dac1012ae8 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -60,12 +60,11 @@ typedef struct {
char *name;
struct dentry *dentry;
struct file *interp_file;
+ refcount_t users; /* sync removal with load_misc_binary() */
} Node;
static DEFINE_RWLOCK(entries_lock);
static struct file_system_type bm_fs_type;
-static struct vfsmount *bm_mnt;
-static int entry_count;
/*
* Max length of the register string. Determined by:
@@ -82,19 +81,23 @@ static int entry_count;
*/
#define MAX_REGISTER_LENGTH 1920
-/*
- * Check if we support the binfmt
- * if we do, return the node, else NULL
- * locking is done in load_misc_binary
+/**
+ * search_binfmt_handler - search for a binary handler for @bprm
+ * @misc: handle to binfmt_misc instance
+ * @bprm: binary for which we are looking for a handler
+ *
+ * Search for a binary type handler for @bprm in the list of registered binary
+ * type handlers.
+ *
+ * Return: binary type list entry on success, NULL on failure
*/
-static Node *check_file(struct linux_binprm *bprm)
+static Node *search_binfmt_handler(struct linux_binprm *bprm)
{
char *p = strrchr(bprm->interp, '.');
- struct list_head *l;
+ Node *e;
/* Walk all the registered handlers. */
- list_for_each(l, &entries) {
- Node *e = list_entry(l, Node, list);
+ list_for_each_entry(e, &entries, list) {
char *s;
int j;
@@ -123,9 +126,49 @@ static Node *check_file(struct linux_binprm *bprm)
if (j == e->size)
return e;
}
+
return NULL;
}
+/**
+ * get_binfmt_handler - try to find a binary type handler
+ * @misc: handle to binfmt_misc instance
+ * @bprm: binary for which we are looking for a handler
+ *
+ * Try to find a binfmt handler for the binary type. If one is found take a
+ * reference to protect against removal via bm_{entry,status}_write().
+ *
+ * Return: binary type list entry on success, NULL on failure
+ */
+static Node *get_binfmt_handler(struct linux_binprm *bprm)
+{
+ Node *e;
+
+ read_lock(&entries_lock);
+ e = search_binfmt_handler(bprm);
+ if (e)
+ refcount_inc(&e->users);
+ read_unlock(&entries_lock);
+ return e;
+}
+
+/**
+ * put_binfmt_handler - put binary handler node
+ * @e: node to put
+ *
+ * Free node syncing with load_misc_binary() and defer final free to
+ * load_misc_binary() in case it is using the binary type handler we were
+ * requested to remove.
+ */
+static void put_binfmt_handler(Node *e)
+{
+ if (refcount_dec_and_test(&e->users)) {
+ if (e->flags & MISC_FMT_OPEN_FILE)
+ filp_close(e->interp_file, NULL);
+ kfree(e);
+ }
+}
+
/*
* the loader itself
*/
@@ -139,12 +182,7 @@ static int load_misc_binary(struct linux_binprm *bprm)
if (!enabled)
return retval;
- /* to keep locking time low, we copy the interpreter string */
- read_lock(&entries_lock);
- fmt = check_file(bprm);
- if (fmt)
- dget(fmt->dentry);
- read_unlock(&entries_lock);
+ fmt = get_binfmt_handler(bprm);
if (!fmt)
return retval;
@@ -198,7 +236,16 @@ static int load_misc_binary(struct linux_binprm *bprm)
retval = 0;
ret:
- dput(fmt->dentry);
+
+ /*
+ * If we actually put the node here all concurrent calls to
+ * load_misc_binary() will have finished. We also know
+ * that for the refcount to be zero ->evict_inode() must have removed
+ * the node to be deleted from the list. All that is left for us is to
+ * close and free.
+ */
+ put_binfmt_handler(fmt);
+
return retval;
}
@@ -553,30 +600,90 @@ static struct inode *bm_get_inode(struct super_block *sb, int mode)
return inode;
}
+/**
+ * bm_evict_inode - cleanup data associated with @inode
+ * @inode: inode to which the data is attached
+ *
+ * Cleanup the binary type handler data associated with @inode if a binary type
+ * entry is removed or the filesystem is unmounted and the super block is
+ * shutdown.
+ *
+ * If the ->evict call was not caused by a super block shutdown but by a write
+ * to remove the entry or all entries via bm_{entry,status}_write() the entry
+ * will have already been removed from the list. We keep the list_empty() check
+ * to make that explicit.
+*/
static void bm_evict_inode(struct inode *inode)
{
Node *e = inode->i_private;
- if (e && e->flags & MISC_FMT_OPEN_FILE)
- filp_close(e->interp_file, NULL);
-
clear_inode(inode);
- kfree(e);
+
+ if (e) {
+ write_lock(&entries_lock);
+ if (!list_empty(&e->list))
+ list_del_init(&e->list);
+ write_unlock(&entries_lock);
+ put_binfmt_handler(e);
+ }
}
-static void kill_node(Node *e)
+/**
+ * unlink_binfmt_dentry - remove the dentry for the binary type handler
+ * @dentry: dentry associated with the binary type handler
+ *
+ * Do the actual filesystem work to remove a dentry for a registered binary
+ * type handler. Since binfmt_misc only allows simple files to be created
+ * directly under the root dentry of the filesystem we ensure that we are
+ * indeed passed a dentry directly beneath the root dentry, that the inode
+ * associated with the root dentry is locked, and that it is a regular file we
+ * are asked to remove.
+ */
+static void unlink_binfmt_dentry(struct dentry *dentry)
{
- struct dentry *dentry;
+ struct dentry *parent = dentry->d_parent;
+ struct inode *inode, *parent_inode;
+
+ /* All entries are immediate descendants of the root dentry. */
+ if (WARN_ON_ONCE(dentry->d_sb->s_root != parent))
+ return;
+ /* We only expect to be called on regular files. */
+ inode = d_inode(dentry);
+ if (WARN_ON_ONCE(!S_ISREG(inode->i_mode)))
+ return;
+
+ /* The parent inode must be locked. */
+ parent_inode = d_inode(parent);
+ if (WARN_ON_ONCE(!inode_is_locked(parent_inode)))
+ return;
+
+ if (simple_positive(dentry)) {
+ dget(dentry);
+ simple_unlink(parent_inode, dentry);
+ d_delete(dentry);
+ dput(dentry);
+ }
+}
+
+/**
+ * remove_binfmt_handler - remove a binary type handler
+ * @misc: handle to binfmt_misc instance
+ * @e: binary type handler to remove
+ *
+ * Remove a binary type handler from the list of binary type handlers and
+ * remove its associated dentry. This is called from
+ * binfmt_{entry,status}_write(). In the future, we might want to think about
+ * adding a proper ->unlink() method to binfmt_misc instead of forcing caller's
+ * to use writes to files in order to delete binary type handlers. But it has
+ * worked for so long that it's not a pressing issue.
+ */
+static void remove_binfmt_handler(Node *e)
+{
write_lock(&entries_lock);
list_del_init(&e->list);
write_unlock(&entries_lock);
-
- dentry = e->dentry;
- drop_nlink(d_inode(dentry));
- d_drop(dentry);
- dput(dentry);
- simple_release_fs(&bm_mnt, &entry_count);
+ unlink_binfmt_dentry(e->dentry);
}
/* /<entry> */
@@ -603,8 +710,8 @@ bm_entry_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
static ssize_t bm_entry_write(struct file *file, const char __user *buffer,
size_t count, loff_t *ppos)
{
- struct dentry *root;
- Node *e = file_inode(file)->i_private;
+ struct inode *inode = file_inode(file);
+ Node *e = inode->i_private;
int res = parse_command(buffer, count);
switch (res) {
@@ -618,13 +725,22 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer,
break;
case 3:
/* Delete this handler. */
- root = file_inode(file)->i_sb->s_root;
- inode_lock(d_inode(root));
+ inode = d_inode(inode->i_sb->s_root);
+ inode_lock(inode);
+ /*
+ * In order to add new element or remove elements from the list
+ * via bm_{entry,register,status}_write() inode_lock() on the
+ * root inode must be held.
+ * The lock is exclusive ensuring that the list can't be
+ * modified. Only load_misc_binary() can access but does so
+ * read-only. So we only need to take the write lock when we
+ * actually remove the entry from the list.
+ */
if (!list_empty(&e->list))
- kill_node(e);
+ remove_binfmt_handler(e);
- inode_unlock(d_inode(root));
+ inode_unlock(inode);
break;
default:
return res;
@@ -683,13 +799,7 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
if (!inode)
goto out2;
- err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count);
- if (err) {
- iput(inode);
- inode = NULL;
- goto out2;
- }
-
+ refcount_set(&e->users, 1);
e->dentry = dget(dentry);
inode->i_private = e;
inode->i_fop = &bm_entry_operations;
@@ -733,7 +843,8 @@ static ssize_t bm_status_write(struct file *file, const char __user *buffer,
size_t count, loff_t *ppos)
{
int res = parse_command(buffer, count);
- struct dentry *root;
+ Node *e, *next;
+ struct inode *inode;
switch (res) {
case 1:
@@ -746,13 +857,22 @@ static ssize_t bm_status_write(struct file *file, const char __user *buffer,
break;
case 3:
/* Delete all handlers. */
- root = file_inode(file)->i_sb->s_root;
- inode_lock(d_inode(root));
+ inode = d_inode(file_inode(file)->i_sb->s_root);
+ inode_lock(inode);
- while (!list_empty(&entries))
- kill_node(list_first_entry(&entries, Node, list));
+ /*
+ * In order to add new element or remove elements from the list
+ * via bm_{entry,register,status}_write() inode_lock() on the
+ * root inode must be held.
+ * The lock is exclusive ensuring that the list can't be
+ * modified. Only load_misc_binary() can access but does so
+ * read-only. So we only need to take the write lock when we
+ * actually remove the entry from the list.
+ */
+ list_for_each_entry_safe(e, next, &entries, list)
+ remove_binfmt_handler(e);
- inode_unlock(d_inode(root));
+ inode_unlock(inode);
break;
default:
return res;
--
2.42.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH AUTOSEL 5.15 4/6] clocksource/drivers/timer-atmel-tcb: Fix initialization on SAM9 hardware
2023-11-06 23:16 [PATCH AUTOSEL 5.15 1/6] perf/core: Bail out early if the request AUX area is out of bound Sasha Levin
2023-11-06 23:16 ` [PATCH AUTOSEL 5.15 2/6] clocksource/drivers/timer-imx-gpt: Fix potential memory leak Sasha Levin
2023-11-06 23:16 ` [PATCH AUTOSEL 5.15 3/6] binfmt_misc: cleanup on filesystem umount Sasha Levin
@ 2023-11-06 23:16 ` Sasha Levin
2023-11-06 23:16 ` [PATCH AUTOSEL 5.15 5/6] workqueue: Provide one lock class key per work_on_cpu() callsite Sasha Levin
2023-11-06 23:16 ` [PATCH AUTOSEL 5.15 6/6] x86/mm: Drop the 4 MB restriction on minimal NUMA node memory size Sasha Levin
4 siblings, 0 replies; 6+ messages in thread
From: Sasha Levin @ 2023-11-06 23:16 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Ronald Wahl, Alexandre Belloni, Daniel Lezcano, Sasha Levin, tglx,
nicolas.ferre, claudiu.beznea, linux-arm-kernel
From: Ronald Wahl <ronald.wahl@raritan.com>
[ Upstream commit 6d3bc4c02d59996d1d3180d8ed409a9d7d5900e0 ]
On SAM9 hardware two cascaded 16 bit timers are used to form a 32 bit
high resolution timer that is used as scheduler clock when the kernel
has been configured that way (CONFIG_ATMEL_CLOCKSOURCE_TCB).
The driver initially triggers a reset-to-zero of the two timers but this
reset is only performed on the next rising clock. For the first timer
this is ok - it will be in the next 60ns (16MHz clock). For the chained
second timer this will only happen after the first timer overflows, i.e.
after 2^16 clocks (~4ms with a 16MHz clock). So with other words the
scheduler clock resets to 0 after the first 2^16 clock cycles.
It looks like that the scheduler does not like this and behaves wrongly
over its lifetime, e.g. some tasks are scheduled with a long delay. Why
that is and if there are additional requirements for this behaviour has
not been further analysed.
There is a simple fix for resetting the second timer as well when the
first timer is reset and this is to set the ATMEL_TC_ASWTRG_SET bit in
the Channel Mode register (CMR) of the first timer. This will also rise
the TIOA line (clock input of the second timer) when a software trigger
respective SYNC is issued.
Signed-off-by: Ronald Wahl <ronald.wahl@raritan.com>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://lore.kernel.org/r/20231007161803.31342-1-rwahl@gmx.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/clocksource/timer-atmel-tcb.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/clocksource/timer-atmel-tcb.c b/drivers/clocksource/timer-atmel-tcb.c
index 27af17c995900..2a90c92a9182a 100644
--- a/drivers/clocksource/timer-atmel-tcb.c
+++ b/drivers/clocksource/timer-atmel-tcb.c
@@ -315,6 +315,7 @@ static void __init tcb_setup_dual_chan(struct atmel_tc *tc, int mck_divisor_idx)
writel(mck_divisor_idx /* likely divide-by-8 */
| ATMEL_TC_WAVE
| ATMEL_TC_WAVESEL_UP /* free-run */
+ | ATMEL_TC_ASWTRG_SET /* TIOA0 rises at software trigger */
| ATMEL_TC_ACPA_SET /* TIOA0 rises at 0 */
| ATMEL_TC_ACPC_CLEAR, /* (duty cycle 50%) */
tcaddr + ATMEL_TC_REG(0, CMR));
--
2.42.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH AUTOSEL 5.15 5/6] workqueue: Provide one lock class key per work_on_cpu() callsite
2023-11-06 23:16 [PATCH AUTOSEL 5.15 1/6] perf/core: Bail out early if the request AUX area is out of bound Sasha Levin
` (2 preceding siblings ...)
2023-11-06 23:16 ` [PATCH AUTOSEL 5.15 4/6] clocksource/drivers/timer-atmel-tcb: Fix initialization on SAM9 hardware Sasha Levin
@ 2023-11-06 23:16 ` Sasha Levin
2023-11-06 23:16 ` [PATCH AUTOSEL 5.15 6/6] x86/mm: Drop the 4 MB restriction on minimal NUMA node memory size Sasha Levin
4 siblings, 0 replies; 6+ messages in thread
From: Sasha Levin @ 2023-11-06 23:16 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Frederic Weisbecker, Paul E . McKenney, Tejun Heo, Sasha Levin
From: Frederic Weisbecker <frederic@kernel.org>
[ Upstream commit 265f3ed077036f053981f5eea0b5b43e7c5b39ff ]
All callers of work_on_cpu() share the same lock class key for all the
functions queued. As a result the workqueue related locking scenario for
a function A may be spuriously accounted as an inversion against the
locking scenario of function B such as in the following model:
long A(void *arg)
{
mutex_lock(&mutex);
mutex_unlock(&mutex);
}
long B(void *arg)
{
}
void launchA(void)
{
work_on_cpu(0, A, NULL);
}
void launchB(void)
{
mutex_lock(&mutex);
work_on_cpu(1, B, NULL);
mutex_unlock(&mutex);
}
launchA and launchB running concurrently have no chance to deadlock.
However the above can be reported by lockdep as a possible locking
inversion because the works containing A() and B() are treated as
belonging to the same locking class.
The following shows an existing example of such a spurious lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted
------------------------------------------------------
kworker/0:1/9 is trying to acquire lock:
ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0
but task is already holding lock:
ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
__flush_work+0x83/0x4e0
work_on_cpu+0x97/0xc0
rcu_nocb_cpu_offload+0x62/0xb0
rcu_nocb_toggle+0xd0/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}:
__mutex_lock+0x81/0xc80
rcu_nocb_cpu_deoffload+0x38/0xb0
rcu_nocb_toggle+0x144/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
percpu_down_write+0x31/0x200
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work)
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock((work_completion)(&wfc.work));
lock(rcu_state.barrier_mutex);
lock((work_completion)(&wfc.work));
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by kworker/0:1/9:
#0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500
#1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
stack backtrace:
CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Workqueue: events work_for_cpu_fn
Call Trace:
rcu-torture: rcu_torture_read_exit: Start of episode
<TASK>
dump_stack_lvl+0x4a/0x80
check_noncircular+0x132/0x150
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
? _cpu_down+0x57/0x2b0
percpu_down_write+0x31/0x200
? _cpu_down+0x57/0x2b0
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
? __pfx_worker_thread+0x10/0x10
kthread+0xe6/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2f/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK
Fix this with providing one lock class key per work_on_cpu() caller.
Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
include/linux/workqueue.h | 46 +++++++++++++++++++++++++++++++++------
kernel/workqueue.c | 20 ++++++++++-------
2 files changed, 51 insertions(+), 15 deletions(-)
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 20a47eb94b0f3..1e96680f50230 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -222,18 +222,16 @@ static inline unsigned int work_static(struct work_struct *work) { return 0; }
* to generate better code.
*/
#ifdef CONFIG_LOCKDEP
-#define __INIT_WORK(_work, _func, _onstack) \
+#define __INIT_WORK_KEY(_work, _func, _onstack, _key) \
do { \
- static struct lock_class_key __key; \
- \
__init_work((_work), _onstack); \
(_work)->data = (atomic_long_t) WORK_DATA_INIT(); \
- lockdep_init_map(&(_work)->lockdep_map, "(work_completion)"#_work, &__key, 0); \
+ lockdep_init_map(&(_work)->lockdep_map, "(work_completion)"#_work, (_key), 0); \
INIT_LIST_HEAD(&(_work)->entry); \
(_work)->func = (_func); \
} while (0)
#else
-#define __INIT_WORK(_work, _func, _onstack) \
+#define __INIT_WORK_KEY(_work, _func, _onstack, _key) \
do { \
__init_work((_work), _onstack); \
(_work)->data = (atomic_long_t) WORK_DATA_INIT(); \
@@ -242,12 +240,22 @@ static inline unsigned int work_static(struct work_struct *work) { return 0; }
} while (0)
#endif
+#define __INIT_WORK(_work, _func, _onstack) \
+ do { \
+ static __maybe_unused struct lock_class_key __key; \
+ \
+ __INIT_WORK_KEY(_work, _func, _onstack, &__key); \
+ } while (0)
+
#define INIT_WORK(_work, _func) \
__INIT_WORK((_work), (_func), 0)
#define INIT_WORK_ONSTACK(_work, _func) \
__INIT_WORK((_work), (_func), 1)
+#define INIT_WORK_ONSTACK_KEY(_work, _func, _key) \
+ __INIT_WORK_KEY((_work), (_func), 1, _key)
+
#define __INIT_DELAYED_WORK(_work, _func, _tflags) \
do { \
INIT_WORK(&(_work)->work, (_func)); \
@@ -632,8 +640,32 @@ static inline long work_on_cpu_safe(int cpu, long (*fn)(void *), void *arg)
return fn(arg);
}
#else
-long work_on_cpu(int cpu, long (*fn)(void *), void *arg);
-long work_on_cpu_safe(int cpu, long (*fn)(void *), void *arg);
+long work_on_cpu_key(int cpu, long (*fn)(void *),
+ void *arg, struct lock_class_key *key);
+/*
+ * A new key is defined for each caller to make sure the work
+ * associated with the function doesn't share its locking class.
+ */
+#define work_on_cpu(_cpu, _fn, _arg) \
+({ \
+ static struct lock_class_key __key; \
+ \
+ work_on_cpu_key(_cpu, _fn, _arg, &__key); \
+})
+
+long work_on_cpu_safe_key(int cpu, long (*fn)(void *),
+ void *arg, struct lock_class_key *key);
+
+/*
+ * A new key is defined for each caller to make sure the work
+ * associated with the function doesn't share its locking class.
+ */
+#define work_on_cpu_safe(_cpu, _fn, _arg) \
+({ \
+ static struct lock_class_key __key; \
+ \
+ work_on_cpu_safe_key(_cpu, _fn, _arg, &__key); \
+})
#endif /* CONFIG_SMP */
#ifdef CONFIG_FREEZER
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 19868cf588779..962ee27ec7d70 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5209,50 +5209,54 @@ static void work_for_cpu_fn(struct work_struct *work)
}
/**
- * work_on_cpu - run a function in thread context on a particular cpu
+ * work_on_cpu_key - run a function in thread context on a particular cpu
* @cpu: the cpu to run on
* @fn: the function to run
* @arg: the function arg
+ * @key: The lock class key for lock debugging purposes
*
* It is up to the caller to ensure that the cpu doesn't go offline.
* The caller must not hold any locks which would prevent @fn from completing.
*
* Return: The value @fn returns.
*/
-long work_on_cpu(int cpu, long (*fn)(void *), void *arg)
+long work_on_cpu_key(int cpu, long (*fn)(void *),
+ void *arg, struct lock_class_key *key)
{
struct work_for_cpu wfc = { .fn = fn, .arg = arg };
- INIT_WORK_ONSTACK(&wfc.work, work_for_cpu_fn);
+ INIT_WORK_ONSTACK_KEY(&wfc.work, work_for_cpu_fn, key);
schedule_work_on(cpu, &wfc.work);
flush_work(&wfc.work);
destroy_work_on_stack(&wfc.work);
return wfc.ret;
}
-EXPORT_SYMBOL_GPL(work_on_cpu);
+EXPORT_SYMBOL_GPL(work_on_cpu_key);
/**
- * work_on_cpu_safe - run a function in thread context on a particular cpu
+ * work_on_cpu_safe_key - run a function in thread context on a particular cpu
* @cpu: the cpu to run on
* @fn: the function to run
* @arg: the function argument
+ * @key: The lock class key for lock debugging purposes
*
* Disables CPU hotplug and calls work_on_cpu(). The caller must not hold
* any locks which would prevent @fn from completing.
*
* Return: The value @fn returns.
*/
-long work_on_cpu_safe(int cpu, long (*fn)(void *), void *arg)
+long work_on_cpu_safe_key(int cpu, long (*fn)(void *),
+ void *arg, struct lock_class_key *key)
{
long ret = -ENODEV;
cpus_read_lock();
if (cpu_online(cpu))
- ret = work_on_cpu(cpu, fn, arg);
+ ret = work_on_cpu_key(cpu, fn, arg, key);
cpus_read_unlock();
return ret;
}
-EXPORT_SYMBOL_GPL(work_on_cpu_safe);
+EXPORT_SYMBOL_GPL(work_on_cpu_safe_key);
#endif /* CONFIG_SMP */
#ifdef CONFIG_FREEZER
--
2.42.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH AUTOSEL 5.15 6/6] x86/mm: Drop the 4 MB restriction on minimal NUMA node memory size
2023-11-06 23:16 [PATCH AUTOSEL 5.15 1/6] perf/core: Bail out early if the request AUX area is out of bound Sasha Levin
` (3 preceding siblings ...)
2023-11-06 23:16 ` [PATCH AUTOSEL 5.15 5/6] workqueue: Provide one lock class key per work_on_cpu() callsite Sasha Levin
@ 2023-11-06 23:16 ` Sasha Levin
4 siblings, 0 replies; 6+ messages in thread
From: Sasha Levin @ 2023-11-06 23:16 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Mike Rapoport (IBM), Qi Zheng, Mario Casquero, Ingo Molnar,
David Hildenbrand, Michal Hocko, Dave Hansen, Rik van Riel,
Sasha Levin, tglx, mingo, bp, x86, luto, peterz
From: "Mike Rapoport (IBM)" <rppt@kernel.org>
[ Upstream commit a1e2b8b36820d8c91275f207e77e91645b7c6836 ]
Qi Zheng reported crashes in a production environment and provided a
simplified example as a reproducer:
| For example, if we use Qemu to start a two NUMA node kernel,
| one of the nodes has 2M memory (less than NODE_MIN_SIZE),
| and the other node has 2G, then we will encounter the
| following panic:
|
| BUG: kernel NULL pointer dereference, address: 0000000000000000
| <...>
| RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
| <...>
| Call Trace:
| <TASK>
| deactivate_slab()
| bootstrap()
| kmem_cache_init()
| start_kernel()
| secondary_startup_64_no_verify()
The crashes happen because of inconsistency between the nodemask that
has nodes with less than 4MB as memoryless, and the actual memory fed
into the core mm.
The commit:
9391a3f9c7f1 ("[PATCH] x86_64: Clear more state when ignoring empty node in SRAT parsing")
... that introduced minimal size of a NUMA node does not explain why
a node size cannot be less than 4MB and what boot failures this
restriction might fix.
Fixes have been submitted to the core MM code to tighten up the
memory topologies it accepts and to not crash on weird input:
mm: page_alloc: skip memoryless nodes entirely
mm: memory_hotplug: drop memoryless node from fallback lists
Andrew has accepted them into the -mm tree, but there are no
stable SHA1's yet.
This patch drops the limitation for minimal node size on x86:
- which works around the crash without the fixes to the core MM.
- makes x86 topologies less weird,
- removes an arbitrary and undocumented limitation on NUMA topologies.
[ mingo: Improved changelog clarity. ]
Reported-by: Qi Zheng <zhengqi.arch@bytedance.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@surriel.com>
Link: https://lore.kernel.org/r/ZS+2qqjEO5/867br@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
arch/x86/include/asm/numa.h | 7 -------
arch/x86/mm/numa.c | 7 -------
2 files changed, 14 deletions(-)
diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index e3bae2b60a0db..ef2844d691735 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -12,13 +12,6 @@
#define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
-/*
- * Too small node sizes may confuse the VM badly. Usually they
- * result from BIOS bugs. So dont recognize nodes as standalone
- * NUMA entities that have less than this amount of RAM listed:
- */
-#define NODE_MIN_SIZE (4*1024*1024)
-
extern int numa_off;
/*
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index e360c6892a584..1a1c0c242f272 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -601,13 +601,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
if (start >= end)
continue;
- /*
- * Don't confuse VM with a node that doesn't have the
- * minimum amount of memory:
- */
- if (end && (end - start) < NODE_MIN_SIZE)
- continue;
-
alloc_node_data(nid);
}
--
2.42.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
end of thread, other threads:[~2023-11-06 23:19 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-06 23:16 [PATCH AUTOSEL 5.15 1/6] perf/core: Bail out early if the request AUX area is out of bound Sasha Levin
2023-11-06 23:16 ` [PATCH AUTOSEL 5.15 2/6] clocksource/drivers/timer-imx-gpt: Fix potential memory leak Sasha Levin
2023-11-06 23:16 ` [PATCH AUTOSEL 5.15 3/6] binfmt_misc: cleanup on filesystem umount Sasha Levin
2023-11-06 23:16 ` [PATCH AUTOSEL 5.15 4/6] clocksource/drivers/timer-atmel-tcb: Fix initialization on SAM9 hardware Sasha Levin
2023-11-06 23:16 ` [PATCH AUTOSEL 5.15 5/6] workqueue: Provide one lock class key per work_on_cpu() callsite Sasha Levin
2023-11-06 23:16 ` [PATCH AUTOSEL 5.15 6/6] x86/mm: Drop the 4 MB restriction on minimal NUMA node memory size Sasha Levin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).