From: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
To: ast@fb.com, Daniel Borkmann <daniel@iogearbox.net>,
Michael Ellerman <mpe@ellerman.id.au>,
Sandipan Das <sandipan@linux.vnet.ibm.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>,
linuxppc-dev@lists.ozlabs.org, netdev@vger.kernel.org
Subject: Re: [RFC][PATCH bpf v2 1/2] bpf: allow 64-bit offsets for bpf function calls
Date: Wed, 21 Feb 2018 00:52:27 +0530 [thread overview]
Message-ID: <1519153431.im5wioxel1.naveen@linux.ibm.com> (raw)
In-Reply-To: <87po50roup.fsf@concordia.ellerman.id.au>
Michael Ellerman wrote:
> "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> writes:
>> Daniel Borkmann wrote:
>>> On 02/15/2018 05:25 PM, Daniel Borkmann wrote:
>>>> On 02/13/2018 05:05 AM, Sandipan Das wrote:
>>>>> The imm field of a bpf_insn is a signed 32-bit integer. For
>>>>> JIT-ed bpf-to-bpf function calls, it stores the offset from
>>>>> __bpf_call_base to the start of the callee function.
>>>>>
>>>>> For some architectures, such as powerpc64, it was found that
>>>>> this offset may be as large as 64 bits because of which this
>>>>> cannot be accomodated in the imm field without truncation.
>>>>>
>>>>> To resolve this, we additionally make aux->func within each
>>>>> bpf_prog associated with the functions to point to the list
>>>>> of all function addresses determined by the verifier.
>>>>>
>>>>> We keep the value assigned to the off field of the bpf_insn
>>>>> as a way to index into aux->func and also set aux->func_cnt
>>>>> so that this can be used for performing basic upper bound
>>>>> checks for the off field.
>>>>>
>>>>> Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
>>>>> ---
>>>>> v2: Make aux->func point to the list of functions determined
>>>>> by the verifier rather than allocating a separate callee
>>>>> list for each function.
>>>>=20
>>>> Approach looks good to me; do you know whether s390x JIT would
>>>> have similar requirement? I think one limitation that would still
>>>> need to be addressed later with such approach would be regarding the
>>>> xlated prog dump in bpftool, see 'BPF calls via JIT' in 7105e828c087
>>>> ("bpf: allow for correlation of maps and helpers in dump"). Any
>>>> ideas for this (potentially if we could use off + imm for calls,
>>>> we'd get to 48 bits, but that seems still not be enough as you say)?
>>
>> All good points. I'm not really sure how s390x works, so I can't comment=
=20
>> on that, but I'm copying Michael Holzheu for his consideration.
>>
>> With the existing scheme, 48 bits won't be enough, so we rejected that=20
>> approach. I can also see how this will be a problem with bpftool, but I=20
>> haven't looked into it in detail. I wonder if we can annotate the output=
=20
>> to indicate the function being referred to?
>>
>>>=20
>>> One other random thought, although I'm not sure how feasible this
>>> is for ppc64 JIT to realize ... but idea would be to have something
>>> like the below:
>>>=20
>>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>>> index 29ca920..daa7258 100644
>>> --- a/kernel/bpf/core.c
>>> +++ b/kernel/bpf/core.c
>>> @@ -512,6 +512,11 @@ int bpf_get_kallsym(unsigned int symnum, unsigned =
long *value, char *type,
>>> return ret;
>>> }
>>>=20
>>> +void * __weak bpf_jit_image_alloc(unsigned long size)
>>> +{
>>> + return module_alloc(size);
>>> +}
>>> +
>>> struct bpf_binary_header *
>>> bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
>>> unsigned int alignment,
>>> @@ -525,7 +530,7 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **ima=
ge_ptr,
>>> * random section of illegal instructions.
>>> */
>>> size =3D round_up(proglen + sizeof(*hdr) + 128, PAGE_SIZE);
>>> - hdr =3D module_alloc(size);
>>> + hdr =3D bpf_jit_image_alloc(size);
>>> if (hdr =3D=3D NULL)
>>> return NULL;
>>>=20
>>> And ppc64 JIT could override bpf_jit_image_alloc() in a similar way
>>> like some archs would override the module_alloc() helper through a
>>> custom implementation, usually via __vmalloc_node_range(), so we
>>> could perhaps fit the range for BPF JITed images in a way that they
>>> could use the 32bit imm in the end? There are not that many progs
>>> loaded typically, so the range could be a bit narrower in such case
>>> anyway. (Not sure if this would work out though, but I thought to
>>> bring it up.)
>>
>> That'd be a good option to consider. I don't think we want to allocate=20
>> anything from the linear memory range since users could load=20
>> unprivileged BPF programs and consume a lot of memory that way. I doubt=20
>> if we can map vmalloc'ed memory into the 0xc0 address range, but I'm not=
=20
>> entirely sure.
>>
>> Michael,
>> Is the above possible? The question is if we can have BPF programs be=20
>> allocated within 4GB of __bpf_call_base (which is a kernel symbol), so=20
>> that calls to those programs can be encoded in a 32-bit immediate field=20
>> in a BPF instruction.
>=20
> Hmmm.
>=20
> It's not technically impossible, but I don't think it's really a good
> option.
>=20
> The 0xc range is a linear mapping of RAM, and the kernel tends to be
> near the start of RAM for reasons. That means there generally isn't a
> hole in the 0xc range within 4GB for you to map BPF programs.
>=20
> You could create a hole by making the 0xc mapping non linear, ie.
> mapping some RAM near the kernel elsewhere in the 0xc range, to make a
> hole that you can then remap BPF programs into. But I think that would
> cause a lot of bugs, it's a pretty fundamental assumption that the
> linear mapping is 1:1.
>=20
>> As an extension, we may be able to extend it to=20
>> 48-bits by combining with another BPF instruction field (offset). In=20
>> either case, the vmalloc'ed address range won't work.
>=20
> 48-bits could possibly work, we don't have systems with that much RAM
> *yet*. So you could remap the BPF programs at the end of the 0xc range,
> or somewhere we have a gap in RAM.
>=20
> That would probably still confuse some things, ie. the 0xc mapping would
> be a 1:1 mapping of RAM plus some other stuff, but it might be tractable
> to fix.
>=20
> We don't have page tables for the 0xc range when using the hash MMU, so
> we'd need to either fix that or use some bespoke data structure for
> keeping track of the mappings.
>=20
> It doesn't really appeal :)
Thanks for the detailed explanation. That does look gross :)
>=20
> We load modules in the 0xd range, and they call functions in the kernel,
> we handle that with trampolines. Can you do something similar?
Yes, we already handle this in our JIT. The scenario being considered=20
here is in how the BPF core informs the BPF JIT engine when it has to=20
call out to a different BPF JIT'ed function. The existing approach=20
encodes an offset from __bpf_call_base() in the imm32 field of the BPF=20
instruction itself. For powerpc64 though, I think we'll have to have the=20
address of the JIT'ed BPF functions passed in through other means.
>=20
> You obviously still need a 64-bit branch somewhere, but perhaps you only
> need one, or one per BPF program, rather than one per function call.
We may have more than a single branch depending on how the program is=20
laid out. We are looking to see if it is worth detecting that a call is=20
close-by to do a relative branch, rather than loading a full 64-bit=20
address and branching to that. For BPF-to-BPF calls, we should also be=20
able to skip all the r12/r2 GEP/TOC handling. We may need to emit an=20
additional branch to skip over some instructions (or emit nops) just so=20
we consistently emit a fixed number of instructions each pass.
>=20
>> The alternative is to pass the full 64-bit address of the BPF program in=
=20
>> an auxiliary field (as proposed in this patch set) but we need to fix it=
=20
>> up for 'bpftool' as well.
>=20
> OK. You'll have to explain to me how bpftool is related, I thought it
> was just for loading/examining BPF programs.
It is, and it can also dump loaded BPF programs. When such programs call=20
out to other BPF programs, bpftool currently figures out the address of=20
the target BPF function by essentially doing (__bpf_call_base + imm32)=20
and looking up kallsyms. Since we've established that imm32 is not=20
enough for us, this won't work on powerpc64.
However, if bpf_jit_kallsyms is not enabled, it looks like bpftool will=20
still fail to resolve the call, and only print out an address today. I'm=20
wondering if we can instead encode the bpf prog id in imm32. That way,=20
we should be able to indicate the BPF function being called into. =20
Daniel, is that something we can consider?
Thanks!
- Naveen
=
next prev parent reply other threads:[~2018-02-20 21:50 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-02-13 4:05 [RFC][PATCH bpf v2 1/2] bpf: allow 64-bit offsets for bpf function calls Sandipan Das
2018-02-13 4:06 ` [RFC][PATCH bpf v2 2/2] bpf: powerpc64: add JIT support for multi-function programs Sandipan Das
2018-02-15 16:25 ` [RFC][PATCH bpf v2 1/2] bpf: allow 64-bit offsets for bpf function calls Daniel Borkmann
2018-02-15 20:18 ` Daniel Borkmann
2018-02-16 15:50 ` Naveen N. Rao
2018-02-20 9:29 ` Michael Ellerman
2018-02-20 19:22 ` Naveen N. Rao [this message]
2018-02-27 12:13 ` [RFC][PATCH bpf] tools: bpftool: Fix tags for bpf-to-bpf calls Sandipan Das
2018-02-27 14:44 ` Daniel Borkmann
2018-03-01 8:51 ` Naveen N. Rao
2018-03-05 17:02 ` Alexei Starovoitov
[not found] ` <415b415e-f47f-082c-1bc9-87d3e9d3aed1__9575.16645216874$1520270545$gmane$org@fb.com>
[not found] ` <415b415e-f47f-082c-1bc9-87d3e9d3aed1__9575.16645216874$1520270545$gmane$org@ fb.com>
2018-05-03 15:20 ` Naveen N. Rao
2018-02-22 12:06 ` [RFC][PATCH bpf v2 1/2] bpf: allow 64-bit offsets for bpf function calls Michael Holzheu
2018-02-22 12:10 ` Michael Holzheu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1519153431.im5wioxel1.naveen@linux.ibm.com \
--to=naveen.n.rao@linux.vnet.ibm.com \
--cc=ast@fb.com \
--cc=daniel@iogearbox.net \
--cc=holzheu@linux.vnet.ibm.com \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mpe@ellerman.id.au \
--cc=netdev@vger.kernel.org \
--cc=sandipan@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).