linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mike Rapoport <rppt@kernel.org>
To: Andy Lutomirski <luto@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>,
	Kees Cook <keescook@chromium.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Christophe Leroy <christophe.leroy@csgroup.eu>,
	"David S. Miller" <davem@davemloft.net>,
	Dinh Nguyen <dinguyen@kernel.org>,
	Heiko Carstens <hca@linux.ibm.com>, Helge Deller <deller@gmx.de>,
	Huacai Chen <chenhuacai@kernel.org>,
	Kent Overstreet <kent.overstreet@linux.dev>,
	Luis Chamberlain <mcgrof@kernel.org>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Nadav Amit <nadav.amit@gmail.com>,
	"Naveen N. Rao" <naveen.n.rao@linux.ibm.com>,
	Palmer Dabbelt <palmer@dabbelt.com>,
	Puranjay Mohan <puranjay12@gmail.com>,
	Rick P Edgecombe <rick.p.edgecombe@intel.com>,
	"Russell King (Oracle)" <linux@armlinux.org.uk>,
	Song Liu <song@kernel.org>, Steven Rostedt <rostedt@goodmis.org>,
	Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
	Thomas Gleixner <tglx@linutronix.de>,
	Will Deacon <will@kernel.org>,
	bpf@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-mips@vger.kernel.org, linux-mm@kvack.org,
	linux-modules@vger.kernel.org, linux-parisc@vger.kernel.org,
	linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, loongarch@lists.linux.dev,
	netdev@vger.kernel.org, sparclinux@vger.kernel.org,
	the arch/x86 maintainers <x86@kernel.org>
Subject: Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
Date: Sun, 25 Jun 2023 20:42:57 +0300	[thread overview]
Message-ID: <20230625174257.GL52412@kernel.org> (raw)
In-Reply-To: <90161ac9-3ca0-4c72-b1c4-ab1293e55445@app.fastmail.com>

On Sun, Jun 25, 2023 at 09:59:34AM -0700, Andy Lutomirski wrote:
> 
> 
> On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote:
> > On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> >> 
> >> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> >> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> >> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> >> >> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> >> >> >
> >> >> > module_alloc() is used everywhere as a mean to allocate memory for code.
> >> >> >
> >> >> > Beside being semantically wrong, this unnecessarily ties all subsystems
> >> >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> >> >> > and puts the burden of code allocation to the modules code.
> >> >> >
> >> >> > Several architectures override module_alloc() because of various
> >> >> > constraints where the executable memory can be located and this causes
> >> >> > additional obstacles for improvements of code allocation.
> >> >> >
> >> >> > Start splitting code allocation from modules by introducing
> >> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> >> >> >
> >> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> >> >> > module_alloc() and execmem_free() and jit_free() are replacements of
> >> >> > module_memfree() to allow updating all call sites to use the new APIs.
> >> >> >
> >> >> > The intention semantics for new allocation APIs:
> >> >> >
> >> >> > * execmem_text_alloc() should be used to allocate memory that must reside
> >> >> >   close to the kernel image, like loadable kernel modules and generated
> >> >> >   code that is restricted by relative addressing.
> >> >> >
> >> >> > * jit_text_alloc() should be used to allocate memory for generated code
> >> >> >   when there are no restrictions for the code placement. For
> >> >> >   architectures that require that any code is within certain distance
> >> >> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> >> >> >   execmem_text_alloc().
> >> >> >
> >> >> 
> >> >> Is there anything in this series to help users do the appropriate
> >> >> synchronization when the actually populate the allocated memory with
> >> >> code?  See here, for example:
> >> >
> >> > This series only factors out the executable allocations from modules and
> >> > puts them in a central place.
> >> > Anything else would go on top after this lands.
> >> 
> >> Hmm.
> >> 
> >> On the one hand, there's nothing wrong with factoring out common code. On
> >> the other hand, this is probably the right time to at least start
> >> thinking about synchronization, at least to the extent that it might make
> >> us want to change this API.  (I'm not at all saying that this series
> >> should require changes -- I'm just saying that this is a good time to
> >> think about how this should work.)
> >> 
> >> The current APIs, *and* the proposed jit_text_alloc() API, don't actually
> >> look like the one think in the Linux ecosystem that actually
> >> intelligently and efficiently maps new text into an address space:
> >> mmap().
> >> 
> >> On x86, you can mmap() an existing file full of executable code PROT_EXEC
> >> and jump to it with minimal synchronization (just the standard implicit
> >> ordering in the kernel that populates the pages before setting up the
> >> PTEs and whatever user synchronization is needed to avoid jumping into
> >> the mapping before mmap() finishes).  It works across CPUs, and the only
> >> possible way userspace can screw it up (for a read-only mapping of
> >> read-only text, anyway) is to jump to the mapping too early, in which
> >> case userspace gets a page fault.  Incoherence is impossible, and no one
> >> needs to "serialize" (in the SDM sense).
> >> 
> >> I think the same sequence (from userspace's perspective) works on other
> >> architectures, too, although I think more cache management is needed on
> >> the kernel's end.  As far as I know, no Linux SMP architecture needs an
> >> IPI to map executable text into usermode, but I could easily be wrong.
> >> (IIRC RISC-V has very developer-unfriendly icache management, but I don't
> >> remember the details.)
> >> 
> >> Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
> >> rather fraught, and I bet many things do it wrong when userspace is
> >> multithreaded.  But not in production because it's mostly not used in
> >> production.)
> >> 
> >> But jit_text_alloc() can't do this, because the order of operations
> >> doesn't match.  With jit_text_alloc(), the executable mapping shows up
> >> before the text is populated, so there is no atomic change from not-there
> >> to populated-and-executable.  Which means that there is an opportunity
> >> for CPUs, speculatively or otherwise, to start filling various caches
> >> with intermediate states of the text, which means that various
> >> architectures (even x86!) may need serialization.
> >> 
> >> For eBPF- and module- like use cases, where JITting/code gen is quite
> >> coarse-grained, perhaps something vaguely like:
> >> 
> >> jit_text_alloc() -> returns a handle and an executable virtual address,
> >> but does *not* map it there
> >> jit_text_write() -> write to that handle
> >> jit_text_map() -> map it and synchronize if needed (no sync needed on
> >> x86, I think)
> >> 
> >> could be more efficient and/or safer.
> >> 
> >> (Modules could use this too.  Getting alternatives right might take some
> >> fiddling, because off the top of my head, this doesn't match how it works
> >> now.)
> >> 
> >> To make alternatives easier, this could work, maybe (haven't fully
> >> thought it through):
> >> 
> >> jit_text_alloc()
> >> jit_text_map_rw_inplace() -> map at the target address, but RW, !X
> >> 
> >> write the text and apply alternatives
> >> 
> >> jit_text_finalize() -> change from RW to RX *and synchronize*
> >> 
> >> jit_text_finalize() would either need to wait for RCU (possibly extra
> >> heavy weight RCU to get "serialization") or send an IPI.
> >
> > This essentially how modules work now. The memory is allocated RW, written
> > and updated with alternatives and then made ROX in the end with set_memory
> > APIs.
> >
> > The issue with not having the memory mapped X when it's written is that we
> > cannot use large pages to map it. One of the goals is to have executable
> > memory mapped with large pages and make code allocator able to divide that
> > page among several callers.
> >
> > So the idea was that jit_text_alloc() will have a cache of large pages
> > mapped ROX, will allocate memory from those caches and there will be
> > jit_update() that uses text poking for writing to that memory.
> >
> > Upon allocation of a large page to increase the cache, that large page will
> > be "invalidated" by filling it with breakpoint instructions (e.g int3 on
> > x86)
> 
> Is this actually valid?  In between int3 and real code, there’s a
> potential torn read of real code mixed up with 0xcc.
 
You mean while doing text poking?

> > To improve the performance of this process, we can write to !X copy and
> > then text_poke it to the actual address in one go. This will require some
> > changes to get the alternatives right.
> >
> > -- 
> > Sincerely yours,
> > Mike.

-- 
Sincerely yours,
Mike.

  reply	other threads:[~2023-06-25 17:44 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-16  8:50 [PATCH v2 00/12] mm: jit/text allocator Mike Rapoport
2023-06-16  8:50 ` [PATCH v2 01/12] nios2: define virtual address space for modules Mike Rapoport
2023-06-16 16:00   ` Edgecombe, Rick P
2023-06-17  5:52     ` Mike Rapoport
2023-06-16 18:14   ` Song Liu
2023-06-16  8:50 ` [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() Mike Rapoport
2023-06-16 16:48   ` Kent Overstreet
2023-06-16 18:18     ` Song Liu
2023-06-17  5:57     ` Mike Rapoport
2023-06-17 20:38   ` Andy Lutomirski
2023-06-18  8:00     ` Mike Rapoport
2023-06-19 17:09       ` Andy Lutomirski
2023-06-19 20:18         ` Nadav Amit
2023-06-20 17:24           ` Andy Lutomirski
2023-06-25 16:14         ` Mike Rapoport
2023-06-25 16:59           ` Andy Lutomirski
2023-06-25 17:42             ` Mike Rapoport [this message]
2023-06-25 18:07               ` Kent Overstreet
2023-06-26  6:13                 ` Song Liu
2023-06-26  9:54                   ` Puranjay Mohan
2023-06-26 12:23                     ` Mark Rutland
2023-06-26 12:31           ` Mark Rutland
2023-06-26 17:48             ` Song Liu
2023-07-17 17:23               ` Andy Lutomirski
2023-06-26 13:01         ` Mark Rutland
2023-06-19 11:34     ` Kent Overstreet
2023-06-16  8:50 ` [PATCH v2 03/12] mm/execmem, arch: convert simple overrides of module_alloc to execmem Mike Rapoport
2023-06-16 18:36   ` Song Liu
2023-06-16  8:50 ` [PATCH v2 04/12] mm/execmem, arch: convert remaining " Mike Rapoport
2023-06-16 16:16   ` Edgecombe, Rick P
2023-06-17  6:10     ` Mike Rapoport
2023-06-16 18:53   ` Song Liu
2023-06-17  6:14     ` Mike Rapoport
2023-06-16  8:50 ` [PATCH v2 05/12] modules, execmem: drop module_alloc Mike Rapoport
2023-06-16 18:56   ` Song Liu
2023-06-16  8:50 ` [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc() Mike Rapoport
2023-06-16 16:55   ` Edgecombe, Rick P
2023-06-17  6:44     ` Mike Rapoport
2023-06-16 20:01   ` Song Liu
2023-06-17  6:51     ` Mike Rapoport
2023-06-18 22:32   ` Thomas Gleixner
2023-06-18 23:14     ` Kent Overstreet
2023-06-19  0:43       ` Thomas Gleixner
2023-06-19  2:12         ` Kent Overstreet
2023-06-20 14:51         ` Steven Rostedt
2023-06-20 15:32           ` Alexei Starovoitov
2023-06-19 15:23     ` Mike Rapoport
2023-06-16  8:50 ` [PATCH v2 07/12] arm64, execmem: extend execmem_params for generated code definitions Mike Rapoport
2023-06-16 20:05   ` Song Liu
2023-06-17  6:57     ` Mike Rapoport
2023-06-17 15:36       ` Kent Overstreet
2023-06-17 16:38         ` Song Liu
2023-06-17 20:37           ` Kent Overstreet
2023-06-16  8:50 ` [PATCH v2 08/12] riscv: extend execmem_params for kprobes allocations Mike Rapoport
2023-06-16 20:09   ` Song Liu
2023-06-16  8:50 ` [PATCH v2 09/12] powerpc: " Mike Rapoport
2023-06-16 20:09   ` Song Liu
2023-06-16  8:50 ` [PATCH v2 10/12] arch: make execmem setup available regardless of CONFIG_MODULES Mike Rapoport
2023-06-16 20:17   ` Song Liu
2023-06-16  8:50 ` [PATCH v2 11/12] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES Mike Rapoport
2023-06-16 20:18   ` Song Liu
2023-06-16  8:50 ` [PATCH v2 12/12] kprobes: remove dependcy on CONFIG_MODULES Mike Rapoport
2023-06-16 11:44   ` Björn Töpel
2023-06-17  6:52     ` Mike Rapoport
2023-06-16 17:02 ` [PATCH v2 00/12] mm: jit/text allocator Edgecombe, Rick P

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230625174257.GL52412@kernel.org \
    --to=rppt@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=bpf@vger.kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=chenhuacai@kernel.org \
    --cc=christophe.leroy@csgroup.eu \
    --cc=davem@davemloft.net \
    --cc=deller@gmx.de \
    --cc=dinguyen@kernel.org \
    --cc=hca@linux.ibm.com \
    --cc=keescook@chromium.org \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mips@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-modules@vger.kernel.org \
    --cc=linux-parisc@vger.kernel.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=linux@armlinux.org.uk \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=loongarch@lists.linux.dev \
    --cc=luto@kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mcgrof@kernel.org \
    --cc=mpe@ellerman.id.au \
    --cc=nadav.amit@gmail.com \
    --cc=naveen.n.rao@linux.ibm.com \
    --cc=netdev@vger.kernel.org \
    --cc=palmer@dabbelt.com \
    --cc=puranjay12@gmail.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=rostedt@goodmis.org \
    --cc=song@kernel.org \
    --cc=sparclinux@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=tsbogend@alpha.franken.de \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).