Re: [PATCH bpf-next v2 0/5] execmem

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
       [not found]       ` <CAPhsuW5e8rBnu73DYkyc1L6gC-WBxjTZVwdFC_L12GVyzROR1w@mail.gmail.com>
@ 2022-11-09 21:23         ` Christophe Leroy
  2022-11-10  1:50           ` Song Liu
  0 siblings, 1 reply; 2+ messages in thread
From: Christophe Leroy @ 2022-11-09 21:23 UTC (permalink / raw)
  To: Song Liu, Mike Rapoport
  Cc: Lu, Aaron, akpm@linux-foundation.org, peterz@infradead.org,
	x86@kernel.org, linux-mm@kvack.org, mcgrof@kernel.org,
	bpf@vger.kernel.org, Edgecombe, Rick P,
	linuxppc-dev@lists.ozlabs.org, hch@lst.de

+ linuxppc-dev list as we start mentioning powerpc.

Le 09/11/2022 à 18:43, Song Liu a écrit :
> On Wed, Nov 9, 2022 at 3:18 AM Mike Rapoport <rppt@kernel.org> wrote:
>>
> [...]
> 
>>>>
>>>> The proposed execmem_alloc() looks to me very much tailored for x86
>>>> to be
>>>> used as a replacement for module_alloc(). Some architectures have
>>>> module_alloc() that is quite different from the default or x86
>>>> version, so
>>>> I'd expect at least some explanation how modules etc can use execmem_
>>>> APIs
>>>> without breaking !x86 architectures.
>>>
>>> I think this is fair, but I think we should ask ask ourselves - how
>>> much should we do in one step?
>>
>> I think that at least we need an evidence that execmem_alloc() etc can be
>> actually used by modules/ftrace/kprobes. Luis said that RFC v2 didn't work
>> for him at all, so having a core MM API for code allocation that only works
>> with BPF on x86 seems not right to me.
> 
> While using execmem_alloc() et. al. in module support is difficult, folks are
> making progress with it. For example, the prototype would be more difficult
> before CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
> (introduced by Christophe).

By the way, the motivation for CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC 
was completely different: This was because on powerpc book3s/32, no-exec 
flaggin is per segment of size 256 Mbytes, so in order to provide 
STRICT_MODULES_RWX it was necessary to put data outside of the segment 
that holds module text in order to be able to flag RW data as no-exec.

But I'm happy if it can also serve other purposes.

> 
> We also have other users that we can onboard soon: BPF trampoline on
> x86_64, BPF jit and trampoline on arm64, and maybe also on powerpc and
> s390.
> 
>>
>>> For non-text_poke() architectures, the way you can make it work is have
>>> the API look like:
>>> execmem_alloc()  <- Does the allocation, but necessarily usable yet
>>> execmem_write()  <- Loads the mapping, doesn't work after finish()
>>> execmem_finish() <- Makes the mapping live (loaded, executable, ready)
>>>
>>> So for text_poke():
>>> execmem_alloc()  <- reserves the mapping
>>> execmem_write()  <- text_pokes() to the mapping
>>> execmem_finish() <- does nothing
>>>
>>> And non-text_poke():
>>> execmem_alloc()  <- Allocates a regular RW vmalloc allocation
>>> execmem_write()  <- Writes normally to it
>>> execmem_finish() <- does set_memory_ro()/set_memory_x() on it
>>>
>>> Non-text_poke() only gets the benefits of centralized logic, but the
>>> interface works for both. This is pretty much what the perm_alloc() RFC
>>> did to make it work with other arch's and modules. But to fit with the
>>> existing modules code (which is actually spread all over) and also
>>> handle RO sections, it also needed some additional bells and whistles.
>>
>> I'm less concerned about non-text_poke() part, but rather about
>> restrictions where code and data can live on different architectures and
>> whether these restrictions won't lead to inability to use the centralized
>> logic on, say, arm64 and powerpc.

Until recently, powerpc CPU didn't implement PC-relative data access. 
Only very recent powerpc CPUs (power10 only ?) have capability to do 
PC-relative accesses, but the kernel doesn't use it yet. So there's no 
constraint about distance between text and data. What matters is the 
distance between core kernel text and module text to avoid trampolines.

>>
>> For instance, if we use execmem_alloc() for modules, it means that data
>> sections should be allocated separately with plain vmalloc(). Will this
>> work universally? Or this will require special care with additional
>> complexity in the modules code?
>>
>>> So the question I'm trying to ask is, how much should we target for the
>>> next step? I first thought that this functionality was so intertwined,
>>> it would be too hard to do iteratively. So if we want to try
>>> iteratively, I'm ok if it doesn't solve everything.
>>
>> With execmem_alloc() as the first step I'm failing to see the large
>> picture. If we want to use it for modules, how will we allocate RO data?
>> with similar rodata_alloc() that uses yet another tree in vmalloc?
>> How the caching of large pages in vmalloc can be made useful for use cases
>> like secretmem and PKS?
> 
> If RO data causes problems with direct map fragmentation, we can use
> similar logic. I think we will need another tree in vmalloc for this case.
> Since the logic will be mostly identical, I personally don't think adding
> another tree is a big overhead.

On powerpc, kernel core RAM is not mapped by pages but is mapped by 
blocks. There are only two blocks: One ROX block which contains both 
text and rodata, and one RW block that contains everything else. Maybe 
the same can be done for modules. What matters is to be sure you never 
have WX memory. Having ROX rodata is not an issue.

Christophe

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
  2022-11-09 21:23         ` [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs Christophe Leroy
@ 2022-11-10  1:50           ` Song Liu
  0 siblings, 0 replies; 2+ messages in thread
From: Song Liu @ 2022-11-10  1:50 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Lu, Aaron, akpm@linux-foundation.org, peterz@infradead.org,
	x86@kernel.org, linux-mm@kvack.org, mcgrof@kernel.org,
	bpf@vger.kernel.org, Edgecombe, Rick P,
	linuxppc-dev@lists.ozlabs.org, hch@lst.de, Mike Rapoport

On Wed, Nov 9, 2022 at 1:24 PM Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
>
> + linuxppc-dev list as we start mentioning powerpc.
>
> Le 09/11/2022 à 18:43, Song Liu a écrit :
> > On Wed, Nov 9, 2022 at 3:18 AM Mike Rapoport <rppt@kernel.org> wrote:
> >>
> > [...]
> >
> >>>>
> >>>> The proposed execmem_alloc() looks to me very much tailored for x86
> >>>> to be
> >>>> used as a replacement for module_alloc(). Some architectures have
> >>>> module_alloc() that is quite different from the default or x86
> >>>> version, so
> >>>> I'd expect at least some explanation how modules etc can use execmem_
> >>>> APIs
> >>>> without breaking !x86 architectures.
> >>>
> >>> I think this is fair, but I think we should ask ask ourselves - how
> >>> much should we do in one step?
> >>
> >> I think that at least we need an evidence that execmem_alloc() etc can be
> >> actually used by modules/ftrace/kprobes. Luis said that RFC v2 didn't work
> >> for him at all, so having a core MM API for code allocation that only works
> >> with BPF on x86 seems not right to me.
> >
> > While using execmem_alloc() et. al. in module support is difficult, folks are
> > making progress with it. For example, the prototype would be more difficult
> > before CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
> > (introduced by Christophe).
>
> By the way, the motivation for CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
> was completely different: This was because on powerpc book3s/32, no-exec
> flaggin is per segment of size 256 Mbytes, so in order to provide
> STRICT_MODULES_RWX it was necessary to put data outside of the segment
> that holds module text in order to be able to flag RW data as no-exec.

Yeah, I only noticed the actual motivation of this work earlier today. :)

>
> But I'm happy if it can also serve other purposes.
>
> >
> > We also have other users that we can onboard soon: BPF trampoline on
> > x86_64, BPF jit and trampoline on arm64, and maybe also on powerpc and
> > s390.
> >
> >>
> >>> For non-text_poke() architectures, the way you can make it work is have
> >>> the API look like:
> >>> execmem_alloc()  <- Does the allocation, but necessarily usable yet
> >>> execmem_write()  <- Loads the mapping, doesn't work after finish()
> >>> execmem_finish() <- Makes the mapping live (loaded, executable, ready)
> >>>
> >>> So for text_poke():
> >>> execmem_alloc()  <- reserves the mapping
> >>> execmem_write()  <- text_pokes() to the mapping
> >>> execmem_finish() <- does nothing
> >>>
> >>> And non-text_poke():
> >>> execmem_alloc()  <- Allocates a regular RW vmalloc allocation
> >>> execmem_write()  <- Writes normally to it
> >>> execmem_finish() <- does set_memory_ro()/set_memory_x() on it
> >>>
> >>> Non-text_poke() only gets the benefits of centralized logic, but the
> >>> interface works for both. This is pretty much what the perm_alloc() RFC
> >>> did to make it work with other arch's and modules. But to fit with the
> >>> existing modules code (which is actually spread all over) and also
> >>> handle RO sections, it also needed some additional bells and whistles.
> >>
> >> I'm less concerned about non-text_poke() part, but rather about
> >> restrictions where code and data can live on different architectures and
> >> whether these restrictions won't lead to inability to use the centralized
> >> logic on, say, arm64 and powerpc.
>
> Until recently, powerpc CPU didn't implement PC-relative data access.
> Only very recent powerpc CPUs (power10 only ?) have capability to do
> PC-relative accesses, but the kernel doesn't use it yet. So there's no
> constraint about distance between text and data. What matters is the
> distance between core kernel text and module text to avoid trampolines.

Ah, this is great. I guess this means powerpc can benefit from this work
with much less effort than x86_64.

>
> >>
> >> For instance, if we use execmem_alloc() for modules, it means that data
> >> sections should be allocated separately with plain vmalloc(). Will this
> >> work universally? Or this will require special care with additional
> >> complexity in the modules code?
> >>
> >>> So the question I'm trying to ask is, how much should we target for the
> >>> next step? I first thought that this functionality was so intertwined,
> >>> it would be too hard to do iteratively. So if we want to try
> >>> iteratively, I'm ok if it doesn't solve everything.
> >>
> >> With execmem_alloc() as the first step I'm failing to see the large
> >> picture. If we want to use it for modules, how will we allocate RO data?
> >> with similar rodata_alloc() that uses yet another tree in vmalloc?
> >> How the caching of large pages in vmalloc can be made useful for use cases
> >> like secretmem and PKS?
> >
> > If RO data causes problems with direct map fragmentation, we can use
> > similar logic. I think we will need another tree in vmalloc for this case.
> > Since the logic will be mostly identical, I personally don't think adding
> > another tree is a big overhead.
>
> On powerpc, kernel core RAM is not mapped by pages but is mapped by
> blocks. There are only two blocks: One ROX block which contains both
> text and rodata, and one RW block that contains everything else. Maybe
> the same can be done for modules. What matters is to be sure you never
> have WX memory. Having ROX rodata is not an issue.

Got it. Thanks!

Song

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2022-11-10  1:51 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20221107223921.3451913-1-song@kernel.org>
     [not found] ` <Y2o9Iz30A3Nruqs4@kernel.org>
     [not found]   ` <9e59a4e8b6f071cf380b9843cdf1e9160f798255.camel@intel.com>
     [not found]     ` <Y2uMWvmiPlaNXlZz@kernel.org>
     [not found]       ` <CAPhsuW5e8rBnu73DYkyc1L6gC-WBxjTZVwdFC_L12GVyzROR1w@mail.gmail.com>
2022-11-09 21:23         ` [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs Christophe Leroy
2022-11-10  1:50           ` Song Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).