From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0E52EC4332F for ; Thu, 10 Nov 2022 01:50:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 43D676B0071; Wed, 9 Nov 2022 20:50:57 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3C6126B0072; Wed, 9 Nov 2022 20:50:57 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 266A76B0074; Wed, 9 Nov 2022 20:50:57 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 1012E6B0071 for ; Wed, 9 Nov 2022 20:50:57 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id D23301C678D for ; Thu, 10 Nov 2022 01:50:56 +0000 (UTC) X-FDA: 80115854112.18.8A7565E Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf07.hostedemail.com (Postfix) with ESMTP id 5FC4C40005 for ; Thu, 10 Nov 2022 01:50:56 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 642FF61D48 for ; Thu, 10 Nov 2022 01:50:55 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id C3B74C433C1 for ; Thu, 10 Nov 2022 01:50:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1668045054; bh=TaZJ8YD2UBYOjpacFdfcE5S0ffWUIE6L7jzBu6Vnfqs=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=hWjvT/Q7wMuxv7gCAIujQyGyCVI0qZAhgT/kMU/9U/85Bj9LhIgMzfELr+3fRXH68 r6m1KQoZ4sCmU6wWG+7u4BNWuMXwkpykLKuhZ7in5idKTJ1RcklEP5BYfBhpJPQrNQ x08eYs4+mnZwm83/jnphU+5tRYtV34ou0yfPnEqMxiLklf4OoKKJjIMG5y5lO1IoxG P2AkhMytc983W1EzDgk3i340FEKDAtXFxhY1+iWAz/Se0Q5e2Nm75cXKjG2bn420S1 9XOi1fUHM41kTlWfd6Vfj/deVOJnKHGKV5TEDMcQYp2Wjgofy+XZb46sNIq5NC6c7N msMNytyCN58qg== Received: by mail-ej1-f41.google.com with SMTP id bj12so1387495ejb.13 for ; Wed, 09 Nov 2022 17:50:53 -0800 (PST) X-Gm-Message-State: ANoB5pmDrn8bg+xVsHoxFsak2sIfrVYTuIDzKDZNVAnJ2Q0Cc9Iw9FOx XBMAt99K143FsVSkdPVKL92dByBKZoGYvlGdkIA= X-Google-Smtp-Source: AA0mqf6YrN4AU+pP8A/AKXrG57m1y7nec0O3F6+K15aS8pMvGdF3Yx7YPJTjORr/HgSCUmq0cUpuhA2myaJdFMgvbyw= X-Received: by 2002:a17:907:2995:b0:7ae:8956:ab56 with SMTP id eu21-20020a170907299500b007ae8956ab56mr4781923ejc.719.1668045051958; Wed, 09 Nov 2022 17:50:51 -0800 (PST) MIME-Version: 1.0 References: <20221107223921.3451913-1-song@kernel.org> <9e59a4e8b6f071cf380b9843cdf1e9160f798255.camel@intel.com> In-Reply-To: From: Song Liu Date: Wed, 9 Nov 2022 17:50:39 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs To: Christophe Leroy Cc: Mike Rapoport , "Edgecombe, Rick P" , "peterz@infradead.org" , "bpf@vger.kernel.org" , "linux-mm@kvack.org" , "hch@lst.de" , "x86@kernel.org" , "akpm@linux-foundation.org" , "mcgrof@kernel.org" , "Lu, Aaron" , "linuxppc-dev@lists.ozlabs.org" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1668045056; a=rsa-sha256; cv=none; b=pOIJ6rTrAW7dHEjjBrvhnGgc9KOXQe8R/TNVQH7584VKyZqNroRZzKqBhLcHsSkS8rIIjX 5K24Qk/50NfDkBk+1eSZw6oo6OzdjnMldfbSgvaaYz+EOnWj4Eqfli2AEJNcmfdI9IYlkc H/lR8RTP9Mf++44RKe/sgb+bFw/mK4Y= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="hWjvT/Q7"; spf=pass (imf07.hostedemail.com: domain of song@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=song@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1668045056; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kH+KygkPqT4dQ3XQ53+YGb1lT1cx3L2TPO73zHoIw7c=; b=B54KL9YwhCL3CQ0nibM4e1rdKf7b31aLpzeyE3wN9PkpG9Zz+sOvuMdTspf6Y11NaoLUBJ Ie1RSzBJd55IqXxDyNal6H/Bbe1Av3ePDYyR16+MW6YOQRqwDzw35vH7Umm4RmCMHNa/oR KG8MzSRfTzjm+4qd1IaelYY2qSVMn4o= X-Stat-Signature: 6mtmcn5wjchsdbpu3o5eer8ptafmzb7p X-Rspamd-Queue-Id: 5FC4C40005 Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="hWjvT/Q7"; spf=pass (imf07.hostedemail.com: domain of song@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=song@kernel.org; dmarc=pass (policy=none) header.from=kernel.org X-Rspam-User: X-Rspamd-Server: rspam11 X-HE-Tag: 1668045056-567510 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Nov 9, 2022 at 1:24 PM Christophe Leroy wrote: > > + linuxppc-dev list as we start mentioning powerpc. > > Le 09/11/2022 =C3=A0 18:43, Song Liu a =C3=A9crit : > > On Wed, Nov 9, 2022 at 3:18 AM Mike Rapoport wrote: > >> > > [...] > > > >>>> > >>>> The proposed execmem_alloc() looks to me very much tailored for x86 > >>>> to be > >>>> used as a replacement for module_alloc(). Some architectures have > >>>> module_alloc() that is quite different from the default or x86 > >>>> version, so > >>>> I'd expect at least some explanation how modules etc can use execmem= _ > >>>> APIs > >>>> without breaking !x86 architectures. > >>> > >>> I think this is fair, but I think we should ask ask ourselves - how > >>> much should we do in one step? > >> > >> I think that at least we need an evidence that execmem_alloc() etc can= be > >> actually used by modules/ftrace/kprobes. Luis said that RFC v2 didn't = work > >> for him at all, so having a core MM API for code allocation that only = works > >> with BPF on x86 seems not right to me. > > > > While using execmem_alloc() et. al. in module support is difficult, fol= ks are > > making progress with it. For example, the prototype would be more diffi= cult > > before CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC > > (introduced by Christophe). > > By the way, the motivation for CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC > was completely different: This was because on powerpc book3s/32, no-exec > flaggin is per segment of size 256 Mbytes, so in order to provide > STRICT_MODULES_RWX it was necessary to put data outside of the segment > that holds module text in order to be able to flag RW data as no-exec. Yeah, I only noticed the actual motivation of this work earlier today. :) > > But I'm happy if it can also serve other purposes. > > > > > We also have other users that we can onboard soon: BPF trampoline on > > x86_64, BPF jit and trampoline on arm64, and maybe also on powerpc and > > s390. > > > >> > >>> For non-text_poke() architectures, the way you can make it work is ha= ve > >>> the API look like: > >>> execmem_alloc() <- Does the allocation, but necessarily usable yet > >>> execmem_write() <- Loads the mapping, doesn't work after finish() > >>> execmem_finish() <- Makes the mapping live (loaded, executable, ready= ) > >>> > >>> So for text_poke(): > >>> execmem_alloc() <- reserves the mapping > >>> execmem_write() <- text_pokes() to the mapping > >>> execmem_finish() <- does nothing > >>> > >>> And non-text_poke(): > >>> execmem_alloc() <- Allocates a regular RW vmalloc allocation > >>> execmem_write() <- Writes normally to it > >>> execmem_finish() <- does set_memory_ro()/set_memory_x() on it > >>> > >>> Non-text_poke() only gets the benefits of centralized logic, but the > >>> interface works for both. This is pretty much what the perm_alloc() R= FC > >>> did to make it work with other arch's and modules. But to fit with th= e > >>> existing modules code (which is actually spread all over) and also > >>> handle RO sections, it also needed some additional bells and whistles= . > >> > >> I'm less concerned about non-text_poke() part, but rather about > >> restrictions where code and data can live on different architectures a= nd > >> whether these restrictions won't lead to inability to use the centrali= zed > >> logic on, say, arm64 and powerpc. > > Until recently, powerpc CPU didn't implement PC-relative data access. > Only very recent powerpc CPUs (power10 only ?) have capability to do > PC-relative accesses, but the kernel doesn't use it yet. So there's no > constraint about distance between text and data. What matters is the > distance between core kernel text and module text to avoid trampolines. Ah, this is great. I guess this means powerpc can benefit from this work with much less effort than x86_64. > > >> > >> For instance, if we use execmem_alloc() for modules, it means that dat= a > >> sections should be allocated separately with plain vmalloc(). Will thi= s > >> work universally? Or this will require special care with additional > >> complexity in the modules code? > >> > >>> So the question I'm trying to ask is, how much should we target for t= he > >>> next step? I first thought that this functionality was so intertwined= , > >>> it would be too hard to do iteratively. So if we want to try > >>> iteratively, I'm ok if it doesn't solve everything. > >> > >> With execmem_alloc() as the first step I'm failing to see the large > >> picture. If we want to use it for modules, how will we allocate RO dat= a? > >> with similar rodata_alloc() that uses yet another tree in vmalloc? > >> How the caching of large pages in vmalloc can be made useful for use c= ases > >> like secretmem and PKS? > > > > If RO data causes problems with direct map fragmentation, we can use > > similar logic. I think we will need another tree in vmalloc for this ca= se. > > Since the logic will be mostly identical, I personally don't think addi= ng > > another tree is a big overhead. > > On powerpc, kernel core RAM is not mapped by pages but is mapped by > blocks. There are only two blocks: One ROX block which contains both > text and rodata, and one RW block that contains everything else. Maybe > the same can be done for modules. What matters is to be sure you never > have WX memory. Having ROX rodata is not an issue. Got it. Thanks! Song