From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 55A86C761AF for ; Wed, 29 Mar 2023 05:46:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B76346B0075; Wed, 29 Mar 2023 01:46:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AFE4C6B0078; Wed, 29 Mar 2023 01:46:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 977B6900002; Wed, 29 Mar 2023 01:46:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 810CD6B0075 for ; Wed, 29 Mar 2023 01:46:29 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 52670140B7C for ; Wed, 29 Mar 2023 05:46:29 +0000 (UTC) X-FDA: 80620850898.21.539599E Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by imf11.hostedemail.com (Postfix) with ESMTP id 6B12E40010 for ; Wed, 29 Mar 2023 05:46:27 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=linuxfoundation.org header.s=korg header.b=YagHrNkk; spf=pass (imf11.hostedemail.com: domain of gregkh@linuxfoundation.org designates 145.40.68.75 as permitted sender) smtp.mailfrom=gregkh@linuxfoundation.org; dmarc=pass (policy=none) header.from=linuxfoundation.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1680068787; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=G/REpJAO8VXsGttTfIdfeXKcKFrz3dXJWQGj5MIUbWE=; b=W7QUwA2pQcM1fO0Avb9CVgSFJFQNGgrUrSHt5UyIInb2K9a/QcCtUwnsWUjB0NtUsibDNn ts2XCakVgCWk4kw0ts9RopcxLvMnEDf6lMBOnuJFRYgRqFGox07W2k4jC72fTIOVJcIVt8 5WFnevulZSYOz1jbi/hO8ZFZXOl6bFA= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=linuxfoundation.org header.s=korg header.b=YagHrNkk; spf=pass (imf11.hostedemail.com: domain of gregkh@linuxfoundation.org designates 145.40.68.75 as permitted sender) smtp.mailfrom=gregkh@linuxfoundation.org; dmarc=pass (policy=none) header.from=linuxfoundation.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1680068787; a=rsa-sha256; cv=none; b=xpH9CjCVIMOo98u5Q3ViNNpS0QsN4qwp4V4jWeKNvQvYQ4abbVqshxlszIoJK3mD9j46Yy zZV37++/P2cb5waMuc2G131MSbw9OFi/PJQY9ffNSzbU5b6BPP9eNZipD/PJqksjwYtPCg UpJk6r+/uh7MbgNsH96Vw9aw2zIHxO8= Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id DB7C7B819D1; Wed, 29 Mar 2023 05:46:25 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 1D43EC433D2; Wed, 29 Mar 2023 05:46:24 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1680068784; bh=8fFU5kQOTeWdxd4ArzXKxbPdX+xWPqKjjB5BunJ/kq8=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=YagHrNkk9RhqYdv9JtmPBrpRhmrJItBxcfMzCp+C+Xii9kTK3UQyw4SIZV0vP5YrJ fzX06H+ru5EhB+2v0+tWWdD4+0rGHtZa37VIdh+uQPo+NS0d2JWO5JyNmqQ8zzwAeK fgNAZEkodp8FJfR/Xkwpl/11SYNumWb21jg2x7Xw= Date: Wed, 29 Mar 2023 07:46:22 +0200 From: Greg KH To: Luis Chamberlain Cc: david@redhat.com, patches@lists.linux.dev, linux-modules@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, pmladek@suse.com, petr.pavlu@suse.com, prarit@redhat.com, torvalds@linux-foundation.org, rafael@kernel.org, christophe.leroy@csgroup.eu, tglx@linutronix.de, peterz@infradead.org, song@kernel.org, rppt@kernel.org, willy@infradead.org, vbabka@suse.cz, mhocko@suse.com, dave.hansen@linux.intel.com Subject: Re: [PATCH 7/7] module: add debug stats to help identify memory pressure Message-ID: References: <20230329053149.3976378-1-mcgrof@kernel.org> <20230329053149.3976378-8-mcgrof@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230329053149.3976378-8-mcgrof@kernel.org> X-Rspamd-Queue-Id: 6B12E40010 X-Stat-Signature: o8cfoasdf1k1go4jrfj9upwdmsoesdxp X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1680068787-323705 X-HE-Meta: U2FsdGVkX1/nPCCvwA+fi/IiuuTW6bISb0NQZS5eHVRT7KUG8GIk9skIxsLODPDNGQfCSjqB9UM7emFm8wvrA3ZLFORlyLss4sZj0hC2EADm1I32X8p9Q/pfBtyCivOv59xPJGN3Y+RAy1vZ42yE7m8hwrh83YCHMjtR5mWUmrLKojbQvokgCgJ3jZ9yWjckfAenzFsmtSrdtKLTeLlEWOPWZNgEowQuY0meksnJe0ylEaDp8olg5S/XRWUQ2JVIeMGnJHug5ILND1Sk/4Yk/O2gvuUv9gQMPuoO1TdV5lyRJmC5bCWd2PMPqMoFKA3h/t8fUQaoGfqu5ezqYRlVzUTmVaTuJwh4yPp3hHyO5K40lNe8zWXVJGMH0Wz6B70xGZQC8kWAycieEd5wGESmpCq3QfjtNS+HvhJlmlHhOlrL76/yHLHbaLgDdh0MW39PTskQ6h+/VRJ45k+w+yKOnS0VA1EB/n38SYtTzxO53b1MCtFZS8oV+qachn5oAc+Yg3s78HH6m3aObWSVIJqHdiG0021u2cfRxP9E8yjB2fekdDTYpQ+kVN2apyF9mpH3weg4AXdmukcdZC/r8YIwy+NNhsV5JNnbysNDqcifSOFpaWjCgNdpLUY6ITg+zCsFxL0tQP4KZd52ive128LztpWfeR6ghPo2ejbidX+hM4d37LUvpbkkoqAmsSKBNk38mSapA0L+jXMxGm6Q9nnY0SnGl/Jara6pTPHuQl7F0+3PPmurvLIf8jrWpQ4qYRT70cqiUshCSDUIeGLySOKgZrnhWc/qlQcCseYeWJzzLm1ixYRqgHWCZJAnIUCYsQXlABRPFDNZ789UxnY8KwtOX/ZVRODyyBN38NvkJrN7RbNKMA1YlWeOg4hYhYXDM5DDCiTcu0txOZJ0EoU0Sra6r+KIQCJ4NiJwQkfeC/v+GiVGOTcFiicZPrpUOBeBonuIwi1q+O6wQ6ygNRoywHS 6AC5MZ59 LwinnaKHpj1eqBg5lcv1K7mXkEoMTJtumvuUV5Meubwi4sGhQfy7qh1dLKohAIMD3/iCQii4KmYi0oan33MOjZwcNwxLoLF+uEnjwifO+Dk8JtuWWq0wNC1b2f7Km5O9iMehbiXXM9OmioxRd+NNUC5HOr+Gc3bETXbxDyVIECJqyrErHjUpDGmtSSQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Mar 28, 2023 at 10:31:49PM -0700, Luis Chamberlain wrote: > Loading modules with finit_module() can end up using vmalloc(), vmap() > and vmalloc() again, for a total of up to 3 separate allocations in the > worse case for a single module. We always kernel_read*() the module, > that's a vmalloc(). Then vmap() is used for the module decompression, > and if so the last read buffer is freed as we use the now decompressed > module buffer to stuff data into our copy module. The last one is > specific to architectures but pretty much that's generally a series > of vmalloc() for different ELF sections... > > Evaluation with new stress-ng module support [1] with just 100 ops > us proving that you can end up using GiBs of data easily even if > we are trying to be very careful not to load modules which are already > loaded. 100 ops seems to resemble the sort of pressure a system with > about 400 CPUs can create on modules. Although those issues for so > many concurrent loads per CPU is silly and are being fixed, we lack > proper tooling to help diagnose easily what happened, when it happened > and what likely are the culprits -- userspace or kernel module > autoloading. > > Provide an initial set of stats for debugfs which let us easily scrape > post-boot information about failed loads. This sort of information can > be used on production worklaods to try to optimize *avoiding* redundant > memory pressure using finit_module(). > > Screen shot: > > root@kmod ~ # cat /sys/kernel/debug/modules/stats > Modules loaded 67 Is this "loaded now", or "ever successfully loaded"? As in a modprobe/rmmod/modprobe would bump this by 2, right? Anyway, minor comment on the code, this looks overall great to me: > Total module size 11464704 > Total mod text size 4194304 > Failed kread bytes 890064 > Failed kmod bytes 890064 > Invalid kread bytes 890064 > Invalid decompress bytes 0 > Invalid mod bytes 890064 > Average mod size 171115 > Average mod text size 62602 > Failed modules: > kvm_intel > kvm > irqbypass > crct10dif_pclmul > ghash_clmulni_intel > sha512_ssse3 > sha512_generic > aesni_intel > crypto_simd > cryptd > evdev > serio_raw > virtio_pci > nvme > nvme_core > virtio_pci_legacy_dev > t10_pi > crc64_rocksoft > virtio_pci_modern_dev > crc32_pclmul > virtio > crc32c_intel > virtio_ring > crc64 > > [0] https://github.com/ColinIanKing/stress-ng.git > [1] echo 0 > /proc/sys/vm/oom_dump_tasks > ./stress-ng --module 100 --module-name xfs > > Signed-off-by: Luis Chamberlain > --- > kernel/module/Kconfig | 32 +++++++ > kernel/module/Makefile | 4 + > kernel/module/debug.c | 16 ++++ > kernel/module/internal.h | 35 +++++++ > kernel/module/main.c | 45 ++++++++- > kernel/module/stats.c | 200 +++++++++++++++++++++++++++++++++++++++ > kernel/module/tracking.c | 7 +- > 7 files changed, 331 insertions(+), 8 deletions(-) > create mode 100644 kernel/module/debug.c > create mode 100644 kernel/module/stats.c > > diff --git a/kernel/module/Kconfig b/kernel/module/Kconfig > index 424b3bc58f3f..fbf7b92cb3d0 100644 > --- a/kernel/module/Kconfig > +++ b/kernel/module/Kconfig > @@ -22,6 +22,12 @@ menuconfig MODULES > > if MODULES > > +config MODULE_DEBUG > + bool "Module debugging" > + default n default is always "n", no need to set that specifically. > + help > + Allows to help debug module functionality. Might want to make this a bit more verbose. > config MODULE_FORCE_LOAD > bool "Forced module loading" > default n > @@ -48,6 +54,8 @@ config MODULE_FORCE_UNLOAD > rmmod). This is mainly for kernel developers and desperate users. > If unsure, say N. > > +if MODULE_DEBUG > + > config MODULE_UNLOAD_TAINT_TRACKING > bool "Tainted module unload tracking" > depends on MODULE_UNLOAD > @@ -59,6 +67,30 @@ config MODULE_UNLOAD_TAINT_TRACKING > page (see bad_page()), the aforementioned details are also > shown. If unsure, say N. > > +config MODULE_STATS > + bool "Module statistics" > + depends on DEBUG_FS > + default n Again, already default. > + help > + This option allows you to maintain a record of module statistics. > + For example each all modules size, average size, text size, and > + failed modules and the size for each of those. For failed > + modules we keep track of module which failed due to either the > + existing module taking too long to load or that module already > + was loaded. > + > + You should enable this if you are debugging production loads > + and want to see if userspace or the kernel is doing stupid things > + with loading modules when it shouldn't or if you want to help > + optimize userspace / kernel space module autoloading schemes. > + You might want to do this because failed modules tend to use > + use up significan amount of memory, and so you'd be doing everyone > + a favor in avoiding these failure proactively. > + > + If unsure, say N. > + > +endif # MODULE_DEBUG > + > config MODVERSIONS > bool "Module versioning support" > help > diff --git a/kernel/module/Makefile b/kernel/module/Makefile > index 5b1d26b53b8d..fe97047f3807 100644 > --- a/kernel/module/Makefile > +++ b/kernel/module/Makefile > @@ -20,4 +20,8 @@ obj-$(CONFIG_PROC_FS) += procfs.o > obj-$(CONFIG_SYSFS) += sysfs.o > obj-$(CONFIG_KGDB_KDB) += kdb.o > obj-$(CONFIG_MODVERSIONS) += version.o > + > +# Link order matters here, keep debug.o first. > +obj-$(CONFIG_MODULE_DEBUG) += debug.o > +obj-$(CONFIG_MODULE_STATS) += stats.o > obj-$(CONFIG_MODULE_UNLOAD_TAINT_TRACKING) += tracking.o > diff --git a/kernel/module/debug.c b/kernel/module/debug.c > new file mode 100644 > index 000000000000..ef580d70b751 > --- /dev/null > +++ b/kernel/module/debug.c > @@ -0,0 +1,16 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > +/* > + * Copyright (C) 2023 Luis Chamberlain > + */ > + > +#include > +#include "internal.h" > + > +struct dentry *mod_debugfs_root; > + > +static int module_debugfs_init(void) > +{ > + mod_debugfs_root = debugfs_create_dir("modules", NULL); > + return 0; > +} > +module_init(module_debugfs_init); Why is this a whole separate file? And as MODULE_DEBUG does not reference debugfs, yet it still creates the debugfs directory here? Yes, it will just not do anything if debugfs is not enabled, but then why is this file here? > diff --git a/kernel/module/internal.h b/kernel/module/internal.h > index 6ae29bb8836f..a645cb3fafc7 100644 > --- a/kernel/module/internal.h > +++ b/kernel/module/internal.h > @@ -143,6 +143,41 @@ static inline bool set_livepatch_module(struct module *mod) > #endif > } > > +#ifdef CONFIG_MODULE_STATS > + > +#define mod_stat_add64(count, var) atomic64_add(count, var) > +#define mod_stat_inc(name) atomic_inc(name) Ok, but: > + > +extern atomic64_t total_mod_size; > +extern atomic64_t total_text_size; > +extern atomic64_t invalid_kread_bytes; > +extern atomic64_t invalid_decompress_bytes; > +extern atomic64_t invalid_mod_becoming_bytes; > +extern atomic64_t invalid_mod_bytes; > + > +extern atomic_t modcount; > +extern atomic_t failed_kreads; > +extern atomic_t failed_decompress; > +extern atomic_t failed_load_modules; > +struct mod_fail_load { > + struct list_head list; > + char name[MODULE_NAME_LEN]; > + atomic_t count; > +}; > + > +int try_add_failed_module(const char *name); > + > +#else > + > +#define mod_stat_inc64(name) > +#define mod_stat_inc(name) atomic_inc(name) Why do you still increment the variable here if the option is not enabled? Also, didn't we have some sort of "we want to use an atomic variable as statistics" type somewhere in the kernel? Or did that never get accepted? And do all of these really need to be atomic variables? Don't you have locks for some of this to not need the atomic-ness of them? I guess it doesn't matter much as this isn't that fast of a code-path. thanks, greg k-h