Re: [PATCH v3] x86/alternatives: remove false sharing in poke_int3_handler()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ingo Molnar <mingo@kernel.org>
To: Eric Dumazet <edumazet@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"H . Peter Anvin" <hpa@zytor.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Masami Hiramatsu <mhiramat@kernel.org>,
	x86@kernel.org, bpf@vger.kernel.org,
	Eric Dumazet <eric.dumazet@gmail.com>,
	Greg Thelen <gthelen@google.com>,
	Stephane Eranian <eranian@google.com>
Subject: Re: [PATCH v3] x86/alternatives: remove false sharing in poke_int3_handler()
Date: Tue, 25 Mar 2025 08:14:44 +0100	[thread overview]
Message-ID: <Z-JX5ImltdTFoFgr@gmail.com> (raw)
In-Reply-To: <20250325043316.874518-1-edumazet@google.com>


* Eric Dumazet <edumazet@google.com> wrote:

> eBPF programs can be run 50,000,000 times per second on busy servers.
> 
> Whenever /proc/sys/kernel/bpf_stats_enabled is turned off,
> hundreds of calls sites are patched from text_poke_bp_batch()
> and we see a huge loss of performance due to false sharing
> on bp_desc.refs lasting up to three seconds.
> 
>    51.30%  server_bin       [kernel.kallsyms]           [k] poke_int3_handler
>             |
>             |--46.45%--poke_int3_handler
>             |          exc_int3
>             |          asm_exc_int3
>             |          |
>             |          |--24.26%--cls_bpf_classify
>             |          |          tcf_classify
>             |          |          __dev_queue_xmit
>             |          |          ip6_finish_output2
>             |          |          ip6_output
>             |          |          ip6_xmit
>             |          |          inet6_csk_xmit
>             |          |          __tcp_transmit_skb
> 
> Fix this by replacing bp_desc.refs with a per-cpu bp_refs.
> 
> Before the patch, on a host with 240 cores (480 threads):
> 
> sysctl -wq kernel.bpf_stats_enabled=0
> 
> text_poke_bp_batch(nr_entries=164) : Took 2655300 usec
> 
> bpftool prog | grep run_time_ns
> ...
> 105: sched_cls  name hn_egress  tag 699fc5eea64144e3  gpl run_time_ns
> 3009063719 run_cnt 82757845 : average cost is 36 nsec per call
> 
> After this patch:
> 
> sysctl -wq kernel.bpf_stats_enabled=0
> 
> text_poke_bp_batch(nr_entries=164) : Took 702 usec
> 
> $ bpftool prog | grep run_time_ns
> ...
> 105: sched_cls  name hn_egress  tag 699fc5eea64144e3  gpl run_time_ns
> 1928223019 run_cnt 67682728 : average cost is 28 nsec per call
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  arch/x86/kernel/alternative.c | 30 ++++++++++++++++++------------
>  1 file changed, 18 insertions(+), 12 deletions(-)

Thanks for the updates. I've further improved the changelog (see 
attached below), and have tentatively applied it to 
tip:x86/alternatives.

Thanks,

	Ingo

==============================>
From: Eric Dumazet <edumazet@google.com>
Date: Tue, 25 Mar 2025 04:33:16 +0000
Subject: [PATCH] x86/alternatives: Improve code-patching scalability by removing false sharing in poke_int3_handler()

eBPF programs can be run 50,000,000 times per second on busy servers.

Whenever /proc/sys/kernel/bpf_stats_enabled is turned off,
hundreds of calls sites are patched from text_poke_bp_batch()
and we see a huge loss of performance due to false sharing
on bp_desc.refs lasting up to three seconds.

   51.30%  server_bin       [kernel.kallsyms]           [k] poke_int3_handler
            |
            |--46.45%--poke_int3_handler
            |          exc_int3
            |          asm_exc_int3
            |          |
            |          |--24.26%--cls_bpf_classify
            |          |          tcf_classify
            |          |          __dev_queue_xmit
            |          |          ip6_finish_output2
            |          |          ip6_output
            |          |          ip6_xmit
            |          |          inet6_csk_xmit
            |          |          __tcp_transmit_skb

Fix this by replacing bp_desc.refs with a per-cpu bp_refs.

Before the patch, on a host with 240 cores (480 threads):

  $ sysctl -wq kernel.bpf_stats_enabled=0

  text_poke_bp_batch(nr_entries=164) : Took 2655300 usec

  $ bpftool prog | grep run_time_ns
  ...
  105: sched_cls  name hn_egress  tag 699fc5eea64144e3  gpl run_time_ns
  3009063719 run_cnt 82757845 : average cost is 36 nsec per call

After this patch:

  $ sysctl -wq kernel.bpf_stats_enabled=0

  text_poke_bp_batch(nr_entries=164) : Took 702 usec

  $ bpftool prog | grep run_time_ns
  ...
  105: sched_cls  name hn_egress  tag 699fc5eea64144e3  gpl run_time_ns
  1928223019 run_cnt 67682728 : average cost is 28 nsec per call

Ie. text-patching performance improved 3700x: from 2.65 seconds
to 0.0007 seconds.

Since the atomic_cond_read_acquire(refs, !VAL) spin-loop was not triggered
even once in my tests, add an unlikely() annotation, because this appears
to be the common case.

[ mingo: Improved the changelog some more. ]

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20250325043316.874518-1-edumazet@google.com

next prev parent reply	other threads:[~2025-03-25  7:14 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-25  4:33 [PATCH v3] x86/alternatives: remove false sharing in poke_int3_handler() Eric Dumazet
2025-03-25  7:14 ` Ingo Molnar [this message]
2025-03-25  7:45   ` Eric Dumazet
2025-03-25  7:25 ` [tip: x86/alternatives] x86/alternatives: Improve code-patching scalability by removing " tip-bot2 for Eric Dumazet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z-JX5ImltdTFoFgr@gmail.com \
    --to=mingo@kernel.org \
    --cc=ast@kernel.org \
    --cc=bp@alien8.de \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=edumazet@google.com \
    --cc=eranian@google.com \
    --cc=eric.dumazet@gmail.com \
    --cc=gthelen@google.com \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhiramat@kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.