From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 557D7C4332F for ; Thu, 1 Dec 2022 22:35:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 785EF6B0073; Thu, 1 Dec 2022 17:35:01 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 70F796B0074; Thu, 1 Dec 2022 17:35:01 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 58E406B0075; Thu, 1 Dec 2022 17:35:01 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 435106B0073 for ; Thu, 1 Dec 2022 17:35:01 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 1425A140724 for ; Thu, 1 Dec 2022 22:35:01 +0000 (UTC) X-FDA: 80195194002.24.9C637FB Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) by imf06.hostedemail.com (Postfix) with ESMTP id 5184F18000B for ; Thu, 1 Dec 2022 22:35:00 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=linutronix.de header.s=2020 header.b="IRR/xyts"; dkim=pass header.d=linutronix.de header.s=2020e header.b=69s2aBBM; dmarc=pass (policy=none) header.from=linutronix.de; spf=pass (imf06.hostedemail.com: domain of tglx@linutronix.de designates 193.142.43.55 as permitted sender) smtp.mailfrom=tglx@linutronix.de ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1669934099; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=NQ3l/8iPHJrE4lXKcSUZU6jnY9bc7FNJ5TKI/6vTgGA=; b=lXeHRqz4vQa0UU6zkmbJXnyG76lTqouYF4hG/4SCfe+NjuQRbvbt46ir20nHQRewBovax+ qGrnjWClFX3kyfB6tXAWIA/D4hh5mn+q6FyCIDy2O8Tr8eYSolYTIpVPycaMSflmAGs6T4 W9FZzihycOMT+67fb6yXT5mIFN5YCgI= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=linutronix.de header.s=2020 header.b="IRR/xyts"; dkim=pass header.d=linutronix.de header.s=2020e header.b=69s2aBBM; dmarc=pass (policy=none) header.from=linutronix.de; spf=pass (imf06.hostedemail.com: domain of tglx@linutronix.de designates 193.142.43.55 as permitted sender) smtp.mailfrom=tglx@linutronix.de ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1669934099; a=rsa-sha256; cv=none; b=S49CUj21ILlHjPatyuWLbcSlfIjzYbkS4wMq1B8ZiHhVhYajREhQtbgIMNpRsGT4B3e2gx TnMHqovEio8RmQKWKCR9M8i6N1j9H5Y45HTRxXaeQaiYMzJXvLTR9a12ofOJJG4COYtR5h Hliz/lVXEDxWrV0fNI6bLGSrS47ZuS4= From: Thomas Gleixner DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1669934097; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=NQ3l/8iPHJrE4lXKcSUZU6jnY9bc7FNJ5TKI/6vTgGA=; b=IRR/xyts6kOonxpsMaiU6T/O5xt31MgiyF46ryrFDvRwkSdjhqvecw8s1IWhk8X1qpi2y+ /AyGu9sKkh53xbPjjNlKOSD1dJvBo/6+gJoVc46o2FFO5zI/lv0MwscmKCRo5YqoJ0HXzX 3LvlVZXid6e8kvpNTzm5OGd7zAXBlum6Iu0cybPxLwRbsp94JIUPxrnKQAKuLQbqOO+Gb7 yFGaaDseYJU05KJrUuXT/69mCFtBAwRONUNUSeAGPnIGNEjw5XEwJ7cyFYgyWkqWB9ReeI zIaw/WgLofHY52R4wCrbmfdyjsPpU4JbnKBmc1FgM4viyfvs1u5YxBL6aDIIig== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1669934097; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=NQ3l/8iPHJrE4lXKcSUZU6jnY9bc7FNJ5TKI/6vTgGA=; b=69s2aBBM19lrmXJ4LMiY+TRz0Zup152ougrxGaYiZR5QmJgMwJcMgz4MTGWqaZhM1IFRmO lP8U2qALBWgIiyBw== To: Mike Rapoport Cc: Song Liu , bpf@vger.kernel.org, linux-mm@kvack.org, peterz@infradead.org, akpm@linux-foundation.org, x86@kernel.org, hch@lst.de, rick.p.edgecombe@intel.com, aaron.lu@intel.com, mcgrof@kernel.org Subject: Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs In-Reply-To: References: <87v8mvsd8d.ffs@tglx> Date: Thu, 01 Dec 2022 23:34:57 +0100 Message-ID: <87mt86rbvy.ffs@tglx> MIME-Version: 1.0 Content-Type: text/plain X-Stat-Signature: zybnqioe69b8eycpcz4tizqsonybbad6 X-Spamd-Result: default: False [-2.50 / 9.00]; BAYES_HAM(-3.00)[100.00%]; SUBJECT_HAS_UNDERSCORES(1.00)[]; MID_RHS_NOT_FQDN(0.50)[]; DMARC_POLICY_ALLOW(-0.50)[linutronix.de,none]; R_SPF_ALLOW(-0.20)[+mx]; R_DKIM_ALLOW(-0.20)[linutronix.de:s=2020,linutronix.de:s=2020e]; MIME_GOOD(-0.10)[text/plain]; RCVD_COUNT_ZERO(0.00)[0]; RCPT_COUNT_SEVEN(0.00)[11]; FROM_EQ_ENVFROM(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; MIME_TRACE(0.00)[0:+]; FROM_HAS_DN(0.00)[]; DKIM_TRACE(0.00)[linutronix.de:+]; TO_DN_SOME(0.00)[]; ARC_SIGNED(0.00)[hostedemail.com:s=arc-20220608:i=1]; ARC_NA(0.00)[] X-Rspamd-Queue-Id: 5184F18000B X-Rspamd-Server: rspam08 X-Rspam-User: X-HE-Tag: 1669934100-336195 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Mike! On Thu, Dec 01 2022 at 22:23, Mike Rapoport wrote: > On Thu, Dec 01, 2022 at 10:08:18AM +0100, Thomas Gleixner wrote: >> On Wed, Nov 30 2022 at 08:18, Song Liu wrote: >> The symptom is iTLB pressure. The root cause is the way how module >> memory is allocated, which in turn causes the fragmentation into >> 4k PTEs. That's the same problem for anything which uses module_alloc() >> to get space for text allocated, e.g. kprobes, tracing.... > > There's also dTLB pressure caused by the fragmentation of the direct map. > The memory allocated with module_alloc() is a priori mapped with 4k PTEs, > but setting RO in the malloc address space also updates the direct map > alias and this causes splits of large pages. > > It's not clear what causes more performance improvement: avoiding splits of > large pages in the direct map or reducing iTLB pressure by backing text > memory with 2M pages. >From our experiments when doing the first version of the SKX retbleed mitigation, the main improvement came from reducing iTLB pressure simply because the iTLB cache is really small. The kernel text placement is way beyond suboptimal. If you really do a hotpath analysis and (manually) place all hot code into one or two 2M pages, then you can achieve massive performance improvements way above the 10% range. We currently have a master student investigating this, but it will take some time until usable results materialize. > If the major improvement comes from keeping direct map intact, it's > might be possible to mix data and text in the same 2M page. No. That can't work. text = RX data = RW or RO If you mix this, then you end up with RWX for the whole 2M page. Not an option really as you lose _all_ protections in one go. That's why I said: >> As a logical next step we make that three blocks and allocate text, >> data and rodata separately, which will preserve the large mappings for >> text and data. rodata still needs to be split because we need a space to >> accomodate ro_after_init data. The point is, that rodata and ro_after_init_data is a pretty small portion of modules as far as my limited analysis of a distro build shows. The bulk is in text and data. So if we preserve 2M pages for text and for RW data and bite the bullet to split one 2M page for ro[_after_init_]data, we get the maximum benefit for the least complexity. >> But at the end we want an allocation mechanism which: >> >> - preserves large mappings >> - handles a distinct address range >> - is mapping type aware >> >> That solves _all_ the issues of modules, kprobes, tracing, bpf in one >> go. See? > > There is also > > - handles kaslr > > and at least for arm and powerpc we'd also need > > - handles architecture specific range restrictions and fallbacks Good points. kaslr should be fairly trivial. The architecture specific restrictions and fallbacks are not really hard to solve either. If done right then the allocator just falls back to 4k maps during initialization in early boot which brings it back to the status quo. But we can provide consistent semantics for the three types which are required for modules and the text only usage for kprobes, tracing, bpf... Thanks, tglx