Re: Percpu variables, benchmarking, and performance weirdness

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: Percpu variables, benchmarking, and performance weirdness
       [not found] <CAJ+HfNgNAzvdBw7gBJTCDQsne-HnWm90H50zNvXBSp4izbwFTA@mail.gmail.com>
@ 2019-12-20  9:34 ` Jesper Dangaard Brouer
  2019-12-20 15:12   ` Tejun Heo
  0 siblings, 1 reply; 6+ messages in thread
From: Jesper Dangaard Brouer @ 2019-12-20  9:34 UTC (permalink / raw)
  To: Björn Töpel
  Cc: bpf, brouer, LKML, Tejun Heo, Christoph Lameter, Dennis Zhou

On Fri, 20 Dec 2019 09:25:43 +0100
Björn Töpel <bjorn.topel@gmail.com> wrote:

> I've been doing some benchmarking with AF_XDP, and more specific the
> bpf_xdp_redirect_map() helper and xdp_do_redirect(). One thing that
> puzzles me is that the percpu-variable accesses stands out.
> 
> I did a horrible hack that just accesses a regular global variable,
> instead of the percpu struct bpf_redirect_info, and got a performance
> boost from 22.7 Mpps to 23.8 Mpps with the rxdrop scenario from
> xdpsock.

Yes, this an 2 ns overhead, which is annoying in XDP context.
 (1/22.7-1/23.8)*1000 = 2 ns

> Have anyone else seen this?

Yes, I see it all the time...

> So, my question to the uarch/percpu folks out there: Why are percpu
> accesses (%gs segment register) more expensive than regular global
> variables in this scenario.

I'm also VERY interested in knowing the answer to above question!?
(Adding LKML to reach more people)


> One way around that is changing BPF_PROG_RUN, and BPF_CALL_x to pass a
> context (struct bpf_redirect_info) explicitly, and access that instead
> of doing percpu access. That would be a pretty churny patch, and
> before doing that it would be nice to understand why percpu stands out
> performance-wise.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Percpu variables, benchmarking, and performance weirdness
  2019-12-20  9:34 ` Percpu variables, benchmarking, and performance weirdness Jesper Dangaard Brouer
@ 2019-12-20 15:12   ` Tejun Heo
  2019-12-20 15:36     ` Christopher Lameter
  2019-12-20 16:22     ` Eric Dumazet
  0 siblings, 2 replies; 6+ messages in thread
From: Tejun Heo @ 2019-12-20 15:12 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Björn Töpel, bpf, LKML, Christoph Lameter, Dennis Zhou

On Fri, Dec 20, 2019 at 10:34:20AM +0100, Jesper Dangaard Brouer wrote:
> > So, my question to the uarch/percpu folks out there: Why are percpu
> > accesses (%gs segment register) more expensive than regular global
> > variables in this scenario.
> 
> I'm also VERY interested in knowing the answer to above question!?
> (Adding LKML to reach more people)

No idea.  One difference is that percpu accesses are through vmap area
which is mapped using 4k pages while global variable would be accessed
through the fault linear mapping.  Maybe you're getting hit by tlb
pressure?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Percpu variables, benchmarking, and performance weirdness
  2019-12-20 15:12   ` Tejun Heo
@ 2019-12-20 15:36     ` Christopher Lameter
  2019-12-20 17:10       ` Dennis Zhou
  2019-12-20 16:22     ` Eric Dumazet
  1 sibling, 1 reply; 6+ messages in thread
From: Christopher Lameter @ 2019-12-20 15:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jesper Dangaard Brouer, Björn Töpel, bpf, LKML,
	Dennis Zhou

On Fri, 20 Dec 2019, Tejun Heo wrote:

> On Fri, Dec 20, 2019 at 10:34:20AM +0100, Jesper Dangaard Brouer wrote:
> > > So, my question to the uarch/percpu folks out there: Why are percpu
> > > accesses (%gs segment register) more expensive than regular global
> > > variables in this scenario.
> >
> > I'm also VERY interested in knowing the answer to above question!?
> > (Adding LKML to reach more people)
>
> No idea.  One difference is that percpu accesses are through vmap area
> which is mapped using 4k pages while global variable would be accessed
> through the fault linear mapping.  Maybe you're getting hit by tlb
> pressure?

And there are some accesses from remote processors to per cpu ares of
other cpus. If those are in the same cacheline then those will cause
additional latencies.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Percpu variables, benchmarking, and performance weirdness
  2019-12-20 15:12   ` Tejun Heo
  2019-12-20 15:36     ` Christopher Lameter
@ 2019-12-20 16:22     ` Eric Dumazet
  2019-12-20 16:34       ` Tejun Heo
  1 sibling, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2019-12-20 16:22 UTC (permalink / raw)
  To: Tejun Heo, Jesper Dangaard Brouer
  Cc: Björn Töpel, bpf, LKML, Christoph Lameter, Dennis Zhou



On 12/20/19 7:12 AM, Tejun Heo wrote:
> On Fri, Dec 20, 2019 at 10:34:20AM +0100, Jesper Dangaard Brouer wrote:
>>> So, my question to the uarch/percpu folks out there: Why are percpu
>>> accesses (%gs segment register) more expensive than regular global
>>> variables in this scenario.
>>
>> I'm also VERY interested in knowing the answer to above question!?
>> (Adding LKML to reach more people)
> 
> No idea.  One difference is that percpu accesses are through vmap area
> which is mapped using 4k pages while global variable would be accessed
> through the fault linear mapping.  Maybe you're getting hit by tlb
> pressure?

I definitely seen expensive per-cpu updates in the stack.
(SNMP counters, or per-cpu stats for packets/bytes counters)

It might be nice to have an option to use 2M pages.

(I recall sending some patches in the past about using high-order pages for vmalloc,
but this went nowhere)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Percpu variables, benchmarking, and performance weirdness
  2019-12-20 16:22     ` Eric Dumazet
@ 2019-12-20 16:34       ` Tejun Heo
  0 siblings, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2019-12-20 16:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Björn Töpel, bpf, LKML,
	Christoph Lameter, Dennis Zhou

On Fri, Dec 20, 2019 at 08:22:02AM -0800, Eric Dumazet wrote:
> I definitely seen expensive per-cpu updates in the stack.
> (SNMP counters, or per-cpu stats for packets/bytes counters)
> 
> It might be nice to have an option to use 2M pages.
> 
> (I recall sending some patches in the past about using high-order pages for vmalloc,
> but this went nowhere)

Yeah, the percpu allocator implementation is half-way prepared for
that.  There just hasn't been a real need for that yet.  If this
actually is a difference coming from tlb pressure, this might be it, I
guess?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Percpu variables, benchmarking, and performance weirdness
  2019-12-20 15:36     ` Christopher Lameter
@ 2019-12-20 17:10       ` Dennis Zhou
  0 siblings, 0 replies; 6+ messages in thread
From: Dennis Zhou @ 2019-12-20 17:10 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Tejun Heo, Jesper Dangaard Brouer, Björn Töpel, bpf,
	LKML

On Fri, Dec 20, 2019 at 03:36:51PM +0000, Christopher Lameter wrote:
> On Fri, 20 Dec 2019, Tejun Heo wrote:
> 
> > On Fri, Dec 20, 2019 at 10:34:20AM +0100, Jesper Dangaard Brouer wrote:
> > > > So, my question to the uarch/percpu folks out there: Why are percpu
> > > > accesses (%gs segment register) more expensive than regular global
> > > > variables in this scenario.
> > >
> > > I'm also VERY interested in knowing the answer to above question!?
> > > (Adding LKML to reach more people)
> >
> > No idea.  One difference is that percpu accesses are through vmap area
> > which is mapped using 4k pages while global variable would be accessed
> > through the fault linear mapping.  Maybe you're getting hit by tlb
> > pressure?

bpf_redirect_info is static so that should be accessed via the linear
mapping as well if we're embedding the first chunk.

> 
> And there are some accesses from remote processors to per cpu ares of
> other cpus. If those are in the same cacheline then those will cause
> additional latencies.
> 

I guess we could pad out certain structs like bpf_redirect_info, but
that isn't really ideal.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-12-20 17:10 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAJ+HfNgNAzvdBw7gBJTCDQsne-HnWm90H50zNvXBSp4izbwFTA@mail.gmail.com>
2019-12-20  9:34 ` Percpu variables, benchmarking, and performance weirdness Jesper Dangaard Brouer
2019-12-20 15:12   ` Tejun Heo
2019-12-20 15:36     ` Christopher Lameter
2019-12-20 17:10       ` Dennis Zhou
2019-12-20 16:22     ` Eric Dumazet
2019-12-20 16:34       ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox