netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net] x86: bpf_jit: fix compilation of large bpf programs
@ 2015-05-22 22:42 Alexei Starovoitov
  2015-05-22 22:46 ` Daniel Borkmann
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Alexei Starovoitov @ 2015-05-22 22:42 UTC (permalink / raw)
  To: David S. Miller; +Cc: Daniel Borkmann, Eric Dumazet, netdev

x86 has variable length encoding. x86 JIT compiler is trying
to pick the shortest encoding for given bpf instruction.
While doing so the jump targets are changing, so JIT is doing
multiple passes over the program. Typical program needs 3 passes.
Some very short programs converge with 2 passes. Large programs
may need 4 or 5. But specially crafted bpf programs may hit the
pass limit and if the program converges on the last iteration
the JIT compiler will be producing an image full of 'int 3' insns.
Fix this corner case by doing final iteration over bpf program.

Fixes: 0a14842f5a3c ("net: filter: Just In Time compiler for x86-64")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
Daniel wrote the 'Edge hopping nuthouse' test case with 4k jump
instructions that managed to trigger this bug.
The test case is nuts and the bug is real.
It's an old bug, but I think worth backporting all the way.
Though this fix will apply cleanly only till commit:
f3c2af7ba17a ("net: filter: x86: split bpf_jit_compile()")
The older kernels should be similar. They have
'for (pass = 0; pass < 10; pass++) {' at the line 153 or so.
and all have similar problem as far as I can see.

 arch/x86/net/bpf_jit_comp.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 99f76103c6b7..ddeff4844a10 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -966,7 +966,12 @@ void bpf_int_jit_compile(struct bpf_prog *prog)
 	}
 	ctx.cleanup_addr = proglen;
 
-	for (pass = 0; pass < 10; pass++) {
+	/* JITed image shrinks with every pass and the loop iterates
+	 * until the image stops shrinking. Very large bpf programs
+	 * may converge on the last pass. In such case do one more
+	 * pass to emit the final image
+	 */
+	for (pass = 0; pass < 10 || image; pass++) {
 		proglen = do_jit(prog, addrs, image, oldproglen, &ctx);
 		if (proglen <= 0) {
 			image = NULL;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH net] x86: bpf_jit: fix compilation of large bpf programs
  2015-05-22 22:42 [PATCH net] x86: bpf_jit: fix compilation of large bpf programs Alexei Starovoitov
@ 2015-05-22 22:46 ` Daniel Borkmann
  2015-05-25  4:19 ` David Miller
  2015-05-26 13:40 ` David Laight
  2 siblings, 0 replies; 8+ messages in thread
From: Daniel Borkmann @ 2015-05-22 22:46 UTC (permalink / raw)
  To: Alexei Starovoitov, David S. Miller; +Cc: Eric Dumazet, netdev

On 05/23/2015 12:42 AM, Alexei Starovoitov wrote:
> x86 has variable length encoding. x86 JIT compiler is trying
> to pick the shortest encoding for given bpf instruction.
> While doing so the jump targets are changing, so JIT is doing
> multiple passes over the program. Typical program needs 3 passes.
> Some very short programs converge with 2 passes. Large programs
> may need 4 or 5. But specially crafted bpf programs may hit the
> pass limit and if the program converges on the last iteration
> the JIT compiler will be producing an image full of 'int 3' insns.
> Fix this corner case by doing final iteration over bpf program.
>
> Fixes: 0a14842f5a3c ("net: filter: Just In Time compiler for x86-64")
> Reported-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>

LGTM, thanks!

Tested-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net] x86: bpf_jit: fix compilation of large bpf programs
  2015-05-22 22:42 [PATCH net] x86: bpf_jit: fix compilation of large bpf programs Alexei Starovoitov
  2015-05-22 22:46 ` Daniel Borkmann
@ 2015-05-25  4:19 ` David Miller
  2015-05-26 13:40 ` David Laight
  2 siblings, 0 replies; 8+ messages in thread
From: David Miller @ 2015-05-25  4:19 UTC (permalink / raw)
  To: ast; +Cc: daniel, edumazet, netdev

From: Alexei Starovoitov <ast@plumgrid.com>
Date: Fri, 22 May 2015 15:42:55 -0700

> x86 has variable length encoding. x86 JIT compiler is trying
> to pick the shortest encoding for given bpf instruction.
> While doing so the jump targets are changing, so JIT is doing
> multiple passes over the program. Typical program needs 3 passes.
> Some very short programs converge with 2 passes. Large programs
> may need 4 or 5. But specially crafted bpf programs may hit the
> pass limit and if the program converges on the last iteration
> the JIT compiler will be producing an image full of 'int 3' insns.
> Fix this corner case by doing final iteration over bpf program.
> 
> Fixes: 0a14842f5a3c ("net: filter: Just In Time compiler for x86-64")
> Reported-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH net] x86: bpf_jit: fix compilation of large bpf programs
  2015-05-22 22:42 [PATCH net] x86: bpf_jit: fix compilation of large bpf programs Alexei Starovoitov
  2015-05-22 22:46 ` Daniel Borkmann
  2015-05-25  4:19 ` David Miller
@ 2015-05-26 13:40 ` David Laight
  2015-05-26 14:35   ` Eric Dumazet
  2 siblings, 1 reply; 8+ messages in thread
From: David Laight @ 2015-05-26 13:40 UTC (permalink / raw)
  To: 'Alexei Starovoitov', David S. Miller
  Cc: Daniel Borkmann, Eric Dumazet, netdev@vger.kernel.org

From: Alexei Starovoitov
> Sent: 22 May 2015 23:43
> x86 has variable length encoding. x86 JIT compiler is trying
> to pick the shortest encoding for given bpf instruction.
> While doing so the jump targets are changing, so JIT is doing
> multiple passes over the program. Typical program needs 3 passes.
> Some very short programs converge with 2 passes. Large programs
> may need 4 or 5. But specially crafted bpf programs may hit the
> pass limit and if the program converges on the last iteration
> the JIT compiler will be producing an image full of 'int 3' insns.
> Fix this corner case by doing final iteration over bpf program.

If the JIT compiler is only changing the encoding of the constants
in the x86 instructions (rather than changing the instructions themselves)
then there is likely to me an unmeasurable change in the execution time.
For instance I don't remember there being a difference in execution time
between long and short branches - the only difference is the amount of
cache they use.

	David

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net] x86: bpf_jit: fix compilation of large bpf programs
  2015-05-26 13:40 ` David Laight
@ 2015-05-26 14:35   ` Eric Dumazet
  2015-05-26 15:13     ` David Laight
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2015-05-26 14:35 UTC (permalink / raw)
  To: David Laight
  Cc: 'Alexei Starovoitov', David S. Miller, Daniel Borkmann,
	Eric Dumazet, netdev@vger.kernel.org

On Tue, 2015-05-26 at 13:40 +0000, David Laight wrote:

> If the JIT compiler is only changing the encoding of the constants
> in the x86 instructions (rather than changing the instructions themselves)
> then there is likely to me an unmeasurable change in the execution time.
> For instance I don't remember there being a difference in execution time
> between long and short branches - the only difference is the amount of
> cache they use.

icache is precisely the matter here. In the end, it makes a difference.

You could check this interesting study Ingo did recently :

https://lkml.org/lkml/2015/5/19/1009

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH net] x86: bpf_jit: fix compilation of large bpf programs
  2015-05-26 14:35   ` Eric Dumazet
@ 2015-05-26 15:13     ` David Laight
  2015-05-26 15:29       ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: David Laight @ 2015-05-26 15:13 UTC (permalink / raw)
  To: 'Eric Dumazet'
  Cc: 'Alexei Starovoitov', David S. Miller, Daniel Borkmann,
	Eric Dumazet, netdev@vger.kernel.org

From: Eric Dumazet
> Sent: 26 May 2015 15:35
> On Tue, 2015-05-26 at 13:40 +0000, David Laight wrote:
> 
> > If the JIT compiler is only changing the encoding of the constants
> > in the x86 instructions (rather than changing the instructions themselves)
> > then there is likely to me an unmeasurable change in the execution time.
> > For instance I don't remember there being a difference in execution time
> > between long and short branches - the only difference is the amount of
> > cache they use.
> 
> icache is precisely the matter here. In the end, it makes a difference.
> 
> You could check this interesting study Ingo did recently :
> 
> https://lkml.org/lkml/2015/5/19/1009

Yes, interesting, a benchmark that manages to run a lot of code 'cold cache'.
Possibly dominated by having a full cache line of code to execute at the
beginning of each function.

My guess is that aligning function bodies helps, aligning branch targets
(as some old versions of gcc used to do aggressively) is more likely to
have a negative effect (increases the icache footprint too much).
Unrolling loops further than needed to avoid data stalls needs very
careful study.

For JIT, once the obviously short branches have been reduced, further code
size reduction might not be worth while.

	David


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net] x86: bpf_jit: fix compilation of large bpf programs
  2015-05-26 15:13     ` David Laight
@ 2015-05-26 15:29       ` Eric Dumazet
  2015-05-26 15:47         ` David Laight
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2015-05-26 15:29 UTC (permalink / raw)
  To: David Laight
  Cc: 'Alexei Starovoitov', David S. Miller, Daniel Borkmann,
	Eric Dumazet, netdev@vger.kernel.org

On Tue, 2015-05-26 at 15:13 +0000, David Laight wrote:

> Yes, interesting, a benchmark that manages to run a lot of code 'cold cache'.

We have binaries here at Google with 400 or 500 MBytes of text.

Not benchmark, super real workloads you know.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH net] x86: bpf_jit: fix compilation of large bpf programs
  2015-05-26 15:29       ` Eric Dumazet
@ 2015-05-26 15:47         ` David Laight
  0 siblings, 0 replies; 8+ messages in thread
From: David Laight @ 2015-05-26 15:47 UTC (permalink / raw)
  To: 'Eric Dumazet'
  Cc: 'Alexei Starovoitov', David S. Miller, Daniel Borkmann,
	Eric Dumazet, netdev@vger.kernel.org

From: Eric Dumazet 
> Sent: 26 May 2015 16:30
> 
> > Yes, interesting, a benchmark that manages to run a lot of code 'cold cache'.
> 
> We have binaries here at Google with 400 or 500 MBytes of text.
> 
> Not benchmark, super real workloads you know.

Indeed, and a lot of the code is likely to be running 'cold cache'.

I was alluding to the problem where people will benchmark a small function
by running in 1000s of times in a tight loop with exactly the same data.
Not only is it 'hot cache' but any dynamic branch prediction is 'trained'
to the specific data.

	David


 


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-05-26 15:49 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-22 22:42 [PATCH net] x86: bpf_jit: fix compilation of large bpf programs Alexei Starovoitov
2015-05-22 22:46 ` Daniel Borkmann
2015-05-25  4:19 ` David Miller
2015-05-26 13:40 ` David Laight
2015-05-26 14:35   ` Eric Dumazet
2015-05-26 15:13     ` David Laight
2015-05-26 15:29       ` Eric Dumazet
2015-05-26 15:47         ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).