From: Luis Chamberlain <mcgrof@kernel.org>
To: David Hildenbrand <david@redhat.com>, Kees Cook <keescook@chromium.org>
Cc: linux-modules@vger.kernel.org, linux-kernel@vger.kernel.org,
pmladek@suse.com, petr.pavlu@suse.com, prarit@redhat.com,
christophe.leroy@csgroup.eu, song@kernel.org,
torvalds@linux-foundation.org, dave@stgolabs.net,
fan.ni@samsung.com, vincent.fu@samsung.com,
a.manzanares@samsung.com, colin.i.king@gmail.com
Subject: Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
Date: Fri, 24 Mar 2023 10:54:39 -0700 [thread overview]
Message-ID: <ZB3j3x4F2ozYX8UI@bombadil.infradead.org> (raw)
In-Reply-To: <582aa586-e69c-99bb-caf8-eda468c332b6@redhat.com>
On Fri, Mar 24, 2023 at 10:27:14AM +0100, David Hildenbrand wrote:
> On 21.03.23 20:32, David Hildenbrand wrote:
> > On 20.03.23 22:27, Luis Chamberlain wrote:
> > > On Mon, Mar 20, 2023 at 02:23:36PM -0700, Luis Chamberlain wrote:
> > > > On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote:
> > > > > Not able to reproduce with 20230319-module-alloc-opts so far (2 tries).
> > > >
> > > > Oh wow, so to clarify, it boots OK?
> > > >
> > >
> > > Now that we know that tree works, I'm curious also now if you can
> > > confirm just re-ordering the patches still works (it should)
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts-adjust
> > >
> >
> > So far no vmap errors booting the debug/kasan kernel (2 tries).
<-- snip -->
> > I think we primarily only care about systemd-udev-settle.service.
> >
> > That is fastest without the rcu patch (~6s), compared to with the rcu
> > patch (~6.5s) and with stock (~7.5s -- 8s).
> >
> > Looks like dracut-initqueue also might be a bit faster with your changes, but
> > maybe it's mostly noise (would have to do more runs).
> >
> > So maybe drop that rcu patch? But of course, there could be other scenarios where it's
> > helpful ...
Yes I confirm the RCU patch does not help at all now also using
stress-ng.
> Are there any other things you would like me to measure/test? I'll have to
> hand back that test machine soonish.
Yes please test the below. Perhaps its not the final form we want, but
it *does* fix OOM'ing when thrashing with stress-ng now with the module
option and even with 100 threads brings down max memory consumption by
259 MiB. The reason is that we also vmalloc during each finit_read_file()
for each module as well way before we even do layout_and_allocate(), and
so obviously if we fix the module path but not this path this will eventually
catch up with us as. I'm not at all happy with the current approach given
ideally we'd bump the counter when the user is done with the file, but we
don't yet have any tracking of that for users, they just vfree the memory
itself. And so this is just trying to catch heavy immediate abuse on the
caller to fend off abuse of vmalloc uses in a lightway manner.
There's gotta be a better way to do this, but its just an idea I have so far.
If we *want* to keep tabs until the user is done, we have to just modify
most users of these APIs and intrudce our own free. I don't think we're
in a rush to fix this so maybe that's the better approach.
And so I've managed to reproduce the issues you found now with my new stress-ng
module stressor as well.
https://github.com/ColinIanKing/stress-ng.git
Even though you have 400 CPUs with stress-ng we can likely reproduce it
with (use a module not loaded on your system):
./stress-ng --module 100 --module-name xfs
Without the patch below using 400 threads still OOMs easily due to the
kread issue. Max threads allowed are 8192.
From ec5099b15bb74f154319034547184e81e4ad8ba0 Mon Sep 17 00:00:00 2001
From: Luis Chamberlain <mcgrof@kernel.org>
Date: Fri, 24 Mar 2023 10:35:41 -0700
Subject: [PATCH] fs/kernel_read_file: add a clutch
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
fs/kernel_read_file.c | 52 +++++++++++++++++++++++++++++++++++++++++--
1 file changed, 50 insertions(+), 2 deletions(-)
diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c
index 5d826274570c..2d55abe73b21 100644
--- a/fs/kernel_read_file.c
+++ b/fs/kernel_read_file.c
@@ -1,10 +1,52 @@
// SPDX-License-Identifier: GPL-2.0-only
+
+#define pr_fmt(fmt) "kread: " fmt
+
#include <linux/fs.h>
#include <linux/fs_struct.h>
#include <linux/kernel_read_file.h>
#include <linux/security.h>
#include <linux/vmalloc.h>
+/*
+ * This clutch ensures we only allow a certain number concurrent threads at a
+ * time allocating space concurrently and they must all finish within the
+ * timeout specified. Anything more we know we're thrashing.
+ */
+#define MAX_KREAD_CONCURRENT 20
+static atomic_t kread_concurrent_max = ATOMIC_INIT(MAX_KREAD_CONCURRENT);
+static DECLARE_WAIT_QUEUE_HEAD(kread_wq);
+
+/*
+ * How many seconds to wait for *all* MAX_KREAD_CONCURRENT threads running
+ * at the same time without returning.
+ */
+#define MAX_KREAD_ALL_BUSY_TIMEOUT 1
+
+static int kernel_read_check_concurrent(void)
+{
+ int ret;
+
+ if (atomic_dec_if_positive(&kread_concurrent_max) < 0) {
+ pr_warn_ratelimited("kread_concurrent_max (%u) close to 0 (max_loads: %u), throttling...",
+ atomic_read(&kread_concurrent_max),
+ MAX_KREAD_CONCURRENT);
+ ret = wait_event_killable_timeout(kread_wq,
+ atomic_dec_if_positive(&kread_concurrent_max) >= 0,
+ MAX_KREAD_ALL_BUSY_TIMEOUT * HZ);
+ if (!ret) {
+ pr_warn_ratelimited("reading cannot be processed, kernel busy with %d threads reading files now for more than %d seconds",
+ MAX_KREAD_CONCURRENT, MAX_KREAD_ALL_BUSY_TIMEOUT);
+ return -ETIME;
+ } else if (ret == -ERESTARTSYS) {
+ pr_warn_ratelimited("sigkill sent for kernel read, giving up");
+ return ret;
+ }
+ }
+
+ return 0;
+}
+
/**
* kernel_read_file() - read file contents into a kernel buffer
*
@@ -68,10 +110,14 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf,
goto out;
}
+ ret = kernel_read_check_concurrent();
+ if (ret)
+ goto out;
+
whole_file = (offset == 0 && i_size <= buf_size);
ret = security_kernel_read_file(file, id, whole_file);
if (ret)
- goto out;
+ goto out_allow_new_read;
if (file_size)
*file_size = i_size;
@@ -117,7 +163,9 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf,
*buf = NULL;
}
}
-
+out_allow_new_read:
+ atomic_inc(&kread_concurrent_max);
+ wake_up(&kread_wq);
out:
allow_write_access(file);
return ret == 0 ? copied : ret;
--
2.39.2
next prev parent reply other threads:[~2023-03-24 17:55 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-11 5:17 [RFC 00/12] module: avoid userspace pressure on unwanted allocations Luis Chamberlain
2023-03-11 5:17 ` [RFC 01/12] module: use goto errors on check_modinfo() and layout_and_allocate() Luis Chamberlain
2023-03-11 5:17 ` [RFC 02/12] module: move get_modinfo() helpers all above Luis Chamberlain
2023-03-11 5:17 ` [RFC 03/12] module: rename next_string() to module_next_tag_pair() Luis Chamberlain
2023-03-11 5:17 ` [RFC 04/12] module: add a for_each_modinfo_entry() Luis Chamberlain
2023-03-11 5:17 ` [RFC 05/12] module: add debugging alias parsing support Luis Chamberlain
2023-03-11 5:17 ` [RFC 06/12] module: move early sanity checks into a helper Luis Chamberlain
2023-03-11 5:17 ` [RFC 07/12] module: move check_modinfo() early to early_mod_check() Luis Chamberlain
2023-03-11 5:17 ` [RFC 08/12] module: move finished_loading() Luis Chamberlain
2023-03-11 5:17 ` [RFC 09/12] module: extract patient module check into helper Luis Chamberlain
2023-03-11 5:17 ` [RFC 10/12] module: avoid allocation if module is already present and ready Luis Chamberlain
2023-03-11 5:17 ` [RFC 11/12] module: use list_add_tail_rcu() when adding module Luis Chamberlain
2023-03-11 5:17 ` [RFC 12/12] module: use aliases to find module on find_module_all() Luis Chamberlain
2023-03-15 14:43 ` Petr Pavlu
2023-03-15 16:12 ` Luis Chamberlain
2023-03-15 12:24 ` [RFC 00/12] module: avoid userspace pressure on unwanted allocations David Hildenbrand
2023-03-15 16:10 ` Luis Chamberlain
2023-03-15 16:41 ` David Hildenbrand
2023-03-16 23:55 ` Luis Chamberlain
2023-03-16 23:56 ` Luis Chamberlain
2023-03-18 0:11 ` Luis Chamberlain
2023-03-20 9:38 ` David Hildenbrand
2023-03-20 19:40 ` David Hildenbrand
2023-03-20 21:09 ` Luis Chamberlain
2023-03-20 21:15 ` David Hildenbrand
2023-03-20 21:23 ` Luis Chamberlain
2023-03-20 21:27 ` Luis Chamberlain
2023-03-21 19:32 ` David Hildenbrand
2023-03-24 9:27 ` David Hildenbrand
2023-03-24 17:54 ` Luis Chamberlain [this message]
2023-03-24 19:11 ` Linus Torvalds
2023-03-24 19:59 ` Luis Chamberlain
2023-03-24 20:28 ` Linus Torvalds
2023-03-24 21:14 ` Luis Chamberlain
2023-03-24 23:27 ` Luis Chamberlain
2023-03-24 23:41 ` Linus Torvalds
2023-03-28 3:44 ` David Hildenbrand
2023-03-28 6:16 ` Luis Chamberlain
2023-03-28 21:02 ` David Hildenbrand
2023-03-29 5:31 ` Luis Chamberlain
2023-03-30 4:42 ` David Hildenbrand
2023-03-21 15:11 ` David Hildenbrand
2023-03-21 16:52 ` Luis Chamberlain
2023-03-21 17:01 ` David Hildenbrand
2023-03-20 9:37 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZB3j3x4F2ozYX8UI@bombadil.infradead.org \
--to=mcgrof@kernel.org \
--cc=a.manzanares@samsung.com \
--cc=christophe.leroy@csgroup.eu \
--cc=colin.i.king@gmail.com \
--cc=dave@stgolabs.net \
--cc=david@redhat.com \
--cc=fan.ni@samsung.com \
--cc=keescook@chromium.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-modules@vger.kernel.org \
--cc=petr.pavlu@suse.com \
--cc=pmladek@suse.com \
--cc=prarit@redhat.com \
--cc=song@kernel.org \
--cc=torvalds@linux-foundation.org \
--cc=vincent.fu@samsung.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).