Perf support in CPython

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Perf support in CPython
@ 2023-11-20 23:50 Pablo Galindo Salgado
  2023-11-21 18:15 ` Namhyung Kim
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Pablo Galindo Salgado @ 2023-11-20 23:50 UTC (permalink / raw)
  To: linux-perf-users

Hi,

I am Pablo Galindo Salgado from the CPython core development team. In
the past release of CPython (Python 3.12) I have added support
for including Python function calls in the output of perf by
leveraging the jit-interface support by writing to /tmp/perf-%d.map.
The support
works by compiling assembly trampolines at runtime that just call the
Python bytecode evaluation loop and assigning function names to the
trampolines by writing in /tmp/perf-%d.map. This has worked really
well when CPython is compiled with frame pointers but unfortunately
CPython also suffers from 5% to 10% slow down when frame pointers are
used. Unfortunately, perf cannot unwind through the jitted trampolines
without frame pointers currently.

I am looking to extend support to the jitdump specification to allow
perf to unwind through these trampolines when using dwarf unwinding
mode.
I have a draft for a patch
(https://github.com/python/cpython/pull/112254) that adds support for
the specification by writing JIT_CODE_LOAD and
JIT_CODE_UNWINDING_INFO (currently with empty EH Frame entries to
force perf to defer to FP based unwinding). This works well in my
current
distribution (Arch Linux with perf version 6.5) but when tested in
other distributions it falls apart.

* One problem I have struggled to debug is that even when compiled
with frame pointers and without emitting JIT_CODE_UNWINDING_INFO
(using fp-based unwinding), in some cases (such as the latest Ubuntu
versions), the shared objects and addresses generated by perf-inject
-j look sane, the addresses match but perf report or perf script do
not include the symbol names in their output. I do not understand why
this
is happening and my debuggins so far has not allowed me to get any
thread I can pull so I wanted to reach to this mailing list for
assistance.

* The other problem I am finding is to include the eh_frames to our
trampolines as it looks like deferring to FP unwinding is unreliable
even if
our trampolines are trivial. This is a different problem but my
current attempts to add EH Frames have failed and I don't know how to
properly
debug them correctly as it looks like perf doesn't expect some of the
values we are writing. I would love it if someone with perf and DWARF
expertise could help us in the CPython team to get steered in the
right direction.

Thanks a lot in advance,
Pablo Galindo Salgado

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Perf support in CPython
  2023-11-20 23:50 Perf support in CPython Pablo Galindo Salgado
@ 2023-11-21 18:15 ` Namhyung Kim
  2023-11-21 18:30   ` Pablo Galindo Salgado
  2023-11-22 21:04 ` Ian Rogers
  2023-11-30 11:00 ` James Clark
  2 siblings, 1 reply; 11+ messages in thread
From: Namhyung Kim @ 2023-11-21 18:15 UTC (permalink / raw)
  To: Pablo Galindo Salgado; +Cc: linux-perf-users

Hello,

On Mon, Nov 20, 2023 at 3:51 PM Pablo Galindo Salgado
<pablogsal@gmail.com> wrote:
>
> Hi,
>
> I am Pablo Galindo Salgado from the CPython core development team. In
> the past release of CPython (Python 3.12) I have added support
> for including Python function calls in the output of perf by
> leveraging the jit-interface support by writing to /tmp/perf-%d.map.

Cool!


> The support
> works by compiling assembly trampolines at runtime that just call the
> Python bytecode evaluation loop and assigning function names to the
> trampolines by writing in /tmp/perf-%d.map. This has worked really
> well when CPython is compiled with frame pointers but unfortunately

I'm curious if it works well when the compiled trampolines are reused.
IOW what if two different trampolines have the same address (one used
and freed and then another came to the same address later)?


> CPython also suffers from 5% to 10% slow down when frame pointers are
> used. Unfortunately, perf cannot unwind through the jitted trampolines
> without frame pointers currently.

I see.

>
> I am looking to extend support to the jitdump specification to allow
> perf to unwind through these trampolines when using dwarf unwinding
> mode.
> I have a draft for a patch
> (https://github.com/python/cpython/pull/112254) that adds support for
> the specification by writing JIT_CODE_LOAD and
> JIT_CODE_UNWINDING_INFO (currently with empty EH Frame entries to
> force perf to defer to FP based unwinding). This works well in my
> current
> distribution (Arch Linux with perf version 6.5) but when tested in
> other distributions it falls apart.

Looks interesting.  Let me take a look (after Thanksgiving).

>
> * One problem I have struggled to debug is that even when compiled
> with frame pointers and without emitting JIT_CODE_UNWINDING_INFO
> (using fp-based unwinding), in some cases (such as the latest Ubuntu
> versions), the shared objects and addresses generated by perf-inject
> -j look sane, the addresses match but perf report or perf script do
> not include the symbol names in their output. I do not understand why
> this
> is happening and my debuggins so far has not allowed me to get any
> thread I can pull so I wanted to reach to this mailing list for
> assistance.

Ok, I think it needs more detail..  Have you filed a bug somewhere?

>
> * The other problem I am finding is to include the eh_frames to our
> trampolines as it looks like deferring to FP unwinding is unreliable
> even if
> our trampolines are trivial. This is a different problem but my
> current attempts to add EH Frames have failed and I don't know how to
> properly
> debug them correctly as it looks like perf doesn't expect some of the
> values we are writing. I would love it if someone with perf and DWARF
> expertise could help us in the CPython team to get steered in the
> right direction.
>
> Thanks a lot in advance,
> Pablo Galindo Salgado
>

I'd be happy to help.  Thanks for reaching out.
Namhyung

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Perf support in CPython
  2023-11-21 18:15 ` Namhyung Kim
@ 2023-11-21 18:30   ` Pablo Galindo Salgado
  2023-11-21 18:49     ` Namhyung Kim
  0 siblings, 1 reply; 11+ messages in thread
From: Pablo Galindo Salgado @ 2023-11-21 18:30 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: linux-perf-users

> I'm curious if it works well when the compiled trampolines are reused.
> IOW what if two different trampolines have the same address (one used
> and freed and then another came to the same address later)?

The compiled trampolines are never reuse. As all the trampolines have
the same native code
we have a pool of copies and we assign one per Python code object and
we tell perf of the assignment
by writing in /tmp/perf-$d.map. Trampolines are only freed at the end
of the program.

> Looks interesting.  Let me take a look (after Thanksgiving).

Fantastic! You can read about the integration here:

https://docs.python.org/3/howto/perf_profiling.html

That has some examples that can guide you to ensure how it is supposed
to work. You can check the link
in my original message to explore where the code lives.

> Ok, I think it needs more detail..  Have you filed a bug somewhere?

No, I still don't know if it is a bug or I am messing something up,
but it is certainly mysterious so I need
some help figuring out if it is my fault or perf's.

> I'd be happy to help.  Thanks for reaching out.

Thanks a lot! Looking forward to working together on this!

Pablo Galindo Salgado

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Perf support in CPython
  2023-11-21 18:30   ` Pablo Galindo Salgado
@ 2023-11-21 18:49     ` Namhyung Kim
  0 siblings, 0 replies; 11+ messages in thread
From: Namhyung Kim @ 2023-11-21 18:49 UTC (permalink / raw)
  To: Pablo Galindo Salgado; +Cc: linux-perf-users

On Tue, Nov 21, 2023 at 10:30 AM Pablo Galindo Salgado
<pablogsal@gmail.com> wrote:
>
> > I'm curious if it works well when the compiled trampolines are reused.
> > IOW what if two different trampolines have the same address (one used
> > and freed and then another came to the same address later)?
>
> The compiled trampolines are never reuse. As all the trampolines have
> the same native code
> we have a pool of copies and we assign one per Python code object and
> we tell perf of the assignment
> by writing in /tmp/perf-$d.map. Trampolines are only freed at the end
> of the program.

Great.  Yeah, that'd work then.

>
>
> > Looks interesting.  Let me take a look (after Thanksgiving).
>
> Fantastic! You can read about the integration here:
>
> https://docs.python.org/3/howto/perf_profiling.html
>
> That has some examples that can guide you to ensure how it is supposed
> to work. You can check the link
> in my original message to explore where the code lives.

Ok, thanks for the pointer.

>
> > Ok, I think it needs more detail..  Have you filed a bug somewhere?
>
> No, I still don't know if it is a bug or I am messing something up,
> but it is certainly mysterious so I need
> some help figuring out if it is my fault or perf's.

It's hard to tell without seeing the details.
Maybe we can talk about this later.

>
> > I'd be happy to help.  Thanks for reaching out.
>
> Thanks a lot! Looking forward to working together on this!

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Perf support in CPython
  2023-11-20 23:50 Perf support in CPython Pablo Galindo Salgado
  2023-11-21 18:15 ` Namhyung Kim
@ 2023-11-22 21:04 ` Ian Rogers
  2023-11-30  6:39   ` Namhyung Kim
  2023-11-30 11:00 ` James Clark
  2 siblings, 1 reply; 11+ messages in thread
From: Ian Rogers @ 2023-11-22 21:04 UTC (permalink / raw)
  To: Pablo Galindo Salgado; +Cc: linux-perf-users, Andrii Nakryiko

On Mon, Nov 20, 2023 at 3:50 PM Pablo Galindo Salgado
<pablogsal@gmail.com> wrote:
>
> Hi,
>
> I am Pablo Galindo Salgado from the CPython core development team. In
> the past release of CPython (Python 3.12) I have added support
> for including Python function calls in the output of perf by
> leveraging the jit-interface support by writing to /tmp/perf-%d.map.
> The support
> works by compiling assembly trampolines at runtime that just call the
> Python bytecode evaluation loop and assigning function names to the
> trampolines by writing in /tmp/perf-%d.map. This has worked really
> well when CPython is compiled with frame pointers but unfortunately
> CPython also suffers from 5% to 10% slow down when frame pointers are
> used. Unfortunately, perf cannot unwind through the jitted trampolines
> without frame pointers currently.

There was an analysis of the CPython frame pointer regression here:
https://pagure.io/fesco/issue/2817#comment-826636
but with GCC some of it was stemming from sloppy code generation, not
using a shorter [ESP + ..] address rather than always [EBP + ..].

I was ranting at LPC last week about people not wanting frame pointers
so I'll try to write a blog, but some pertinent points:

- we're only talking x86 here and we should try to avoid overfitting
to one particular architecture.

- x86 with APX extensions (ETA 2025) is getting 16 new registers.

- frame pointers don't need to cost a register: you can put the frame
pointer in thread-local storage and then "push (fs|gs):[offset]"
replaces the "push rbp", you place the new frame pointer in the TLS
and at exit you reverse the push. To make the stack unwinder support
this we just need a thread-local storage convention which (imo) is way
less painful than say supporting page faults, to map in debug
information, during stack unwinding.

- when function call overhead is significant it is evidence that your
inlining is broken. With FDO, AutoFDO, LTO, etc. there are many tools
to tackle this. A constant problem is that the x86-64 ABI lacks
callee-save registers meaning that if you do use the registers at your
disposal you can end up with a function call looking like a context
switch in terms of spilling out and then needing to refill registers.
Setting up FDO and LTO is necessary to eliminate not just function
call overhead from establishing a frame but also the more significant
costs due to a lack of callee-save registers. APX is adding fast
push/pop instructions to try to alleviate this, but we're overdue for
a new x86-64 ABI.

Making jitdump better is good, but I'm not motivated by frame pointers
being a problem. Fixing jitdump won't fix BPF, etc.

Thanks,
Ian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Perf support in CPython
  2023-11-22 21:04 ` Ian Rogers
@ 2023-11-30  6:39   ` Namhyung Kim
  0 siblings, 0 replies; 11+ messages in thread
From: Namhyung Kim @ 2023-11-30  6:39 UTC (permalink / raw)
  To: Ian Rogers; +Cc: Pablo Galindo Salgado, linux-perf-users, Andrii Nakryiko

On Wed, Nov 22, 2023 at 1:05 PM Ian Rogers <irogers@google.com> wrote:
>
> On Mon, Nov 20, 2023 at 3:50 PM Pablo Galindo Salgado
> <pablogsal@gmail.com> wrote:
> >
> > Hi,
> >
> > I am Pablo Galindo Salgado from the CPython core development team. In
> > the past release of CPython (Python 3.12) I have added support
> > for including Python function calls in the output of perf by
> > leveraging the jit-interface support by writing to /tmp/perf-%d.map.
> > The support
> > works by compiling assembly trampolines at runtime that just call the
> > Python bytecode evaluation loop and assigning function names to the
> > trampolines by writing in /tmp/perf-%d.map. This has worked really
> > well when CPython is compiled with frame pointers but unfortunately
> > CPython also suffers from 5% to 10% slow down when frame pointers are
> > used. Unfortunately, perf cannot unwind through the jitted trampolines
> > without frame pointers currently.
>
> There was an analysis of the CPython frame pointer regression here:
> https://pagure.io/fesco/issue/2817#comment-826636
> but with GCC some of it was stemming from sloppy code generation, not
> using a shorter [ESP + ..] address rather than always [EBP + ..].

So I went through the thread and felt like compilers could be
improved with frame pointers.  But changing compilers would
take long time so I think we need a workaround in the meantime.

>
> I was ranting at LPC last week about people not wanting frame pointers
> so I'll try to write a blog, but some pertinent points:
>
> - we're only talking x86 here and we should try to avoid overfitting
> to one particular architecture.
>
> - x86 with APX extensions (ETA 2025) is getting 16 new registers.
>
> - frame pointers don't need to cost a register: you can put the frame
> pointer in thread-local storage and then "push (fs|gs):[offset]"
> replaces the "push rbp", you place the new frame pointer in the TLS
> and at exit you reverse the push. To make the stack unwinder support
> this we just need a thread-local storage convention which (imo) is way
> less painful than say supporting page faults, to map in debug
> information, during stack unwinding.
>
> - when function call overhead is significant it is evidence that your
> inlining is broken. With FDO, AutoFDO, LTO, etc. there are many tools
> to tackle this. A constant problem is that the x86-64 ABI lacks
> callee-save registers meaning that if you do use the registers at your
> disposal you can end up with a function call looking like a context
> switch in terms of spilling out and then needing to refill registers.
> Setting up FDO and LTO is necessary to eliminate not just function
> call overhead from establishing a frame but also the more significant
> costs due to a lack of callee-save registers. APX is adding fast
> push/pop instructions to try to alleviate this, but we're overdue for
> a new x86-64 ABI.
>
> Making jitdump better is good, but I'm not motivated by frame pointers
> being a problem. Fixing jitdump won't fix BPF, etc.

I'd be happy if new hardwares or compilers would bring the frame
pointers back.  But I'm also open to find some alternatives like
shadow stack, SFrame or jitdump before that.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Perf support in CPython
  2023-11-20 23:50 Perf support in CPython Pablo Galindo Salgado
  2023-11-21 18:15 ` Namhyung Kim
  2023-11-22 21:04 ` Ian Rogers
@ 2023-11-30 11:00 ` James Clark
  2023-11-30 12:42   ` Pablo Galindo Salgado
  2 siblings, 1 reply; 11+ messages in thread
From: James Clark @ 2023-11-30 11:00 UTC (permalink / raw)
  To: Pablo Galindo Salgado, linux-perf-users

On 20/11/2023 23:50, Pablo Galindo Salgado wrote:
> Hi,
> 
> I am Pablo Galindo Salgado from the CPython core development team. In
> the past release of CPython (Python 3.12) I have added support
> for including Python function calls in the output of perf by
> leveraging the jit-interface support by writing to /tmp/perf-%d.map.
> The support
> works by compiling assembly trampolines at runtime that just call the
> Python bytecode evaluation loop and assigning function names to the
> trampolines by writing in /tmp/perf-%d.map. This has worked really
> well when CPython is compiled with frame pointers but unfortunately
> CPython also suffers from 5% to 10% slow down when frame pointers are
> used. Unfortunately, perf cannot unwind through the jitted trampolines
> without frame pointers currently.
> 

I doubt this will have any impact on the 5% - 10%, but I'll leave it
here anyway just in case:

(At least on Arm) you can avoid "-mno-omit-leaf-frame-pointer". As in,
leaf frame pointers can be omitted. This is because of this change that
we added to Perf:

  void arch__add_leaf_frame_record_opts(struct record_opts *opts)
  {
     opts->sample_user_regs |= sample_reg_masks[PERF_REG_ARM64_LR].mask;
  }

After sampling the register, even in FP unwind mode, we'll still try to
do a single step of Dwarf unwind of the last frame, using only the link
register as the input data, and none of the stack space. The unwinder
knows at that instruction whether the link register contains the return
address, and there is a high chance that it does.

We did this because at least with some compilers,
"-no-omit-frame-pointer" actually has no effect on leaf frames, it will
still omit it even if you asked for it not to be (because they don't do
anything in leaf frames). Although maybe "-no-omit-leaf-frame-pointer"
fixes this, you actually don't need it as long as you sampled the link
register and have the Dwarf available.

So I suppose to get that working you'd have to add the
JIT_CODE_UNWINDING_INFO stuff.

> I am looking to extend support to the jitdump specification to allow
> perf to unwind through these trampolines when using dwarf unwinding
> mode.

Do you really want to use full Dwarf mode? You have to save the entire
stack space on every sample for that to work, so it seems a bit slow.
But maybe it's not an issue for lower sampling frequencies. If the
application has a huge stack it's not really scalable.

Is that just so you have a mode that works out of the box without any
recompilation of Python? But not necessarily the best way to do it?

I don't know if it's possible to have some kind of hybrid approach where
only the trampolines have leaf frame pointers, but none of the other
code does, so you don't get overhead issue? Or that would also fall
apart in the kernel's FP unwinder?

> I have a draft for a patch
> (https://github.com/python/cpython/pull/112254) that adds support for
> the specification by writing JIT_CODE_LOAD and
> JIT_CODE_UNWINDING_INFO (currently with empty EH Frame entries to
> force perf to defer to FP based unwinding). This works well in my
> current
> distribution (Arch Linux with perf version 6.5) but when tested in
> other distributions it falls apart.
> 
> * One problem I have struggled to debug is that even when compiled
> with frame pointers and without emitting JIT_CODE_UNWINDING_INFO
> (using fp-based unwinding), in some cases (such as the latest Ubuntu
> versions), the shared objects and addresses generated by perf-inject
> -j look sane, the addresses match but perf report or perf script do
> not include the symbol names in their output. I do not understand why
> this
> is happening and my debuggins so far has not allowed me to get any
> thread I can pull so I wanted to reach to this mailing list for
> assistance.
> 
> * The other problem I am finding is to include the eh_frames to our
> trampolines as it looks like deferring to FP unwinding is unreliable
> even if
> our trampolines are trivial. This is a different problem but my
> current attempts to add EH Frames have failed and I don't know how to
> properly
> debug them correctly as it looks like perf doesn't expect some of the
> values we are writing. I would love it if someone with perf and DWARF
> expertise could help us in the CPython team to get steered in the
> right direction.
> 
> Thanks a lot in advance,
> Pablo Galindo Salgado
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Perf support in CPython
  2023-11-30 11:00 ` James Clark
@ 2023-11-30 12:42   ` Pablo Galindo Salgado
  2023-11-30 18:09     ` James Clark
  2023-11-30 21:16     ` Ian Rogers
  0 siblings, 2 replies; 11+ messages in thread
From: Pablo Galindo Salgado @ 2023-11-30 12:42 UTC (permalink / raw)
  To: James Clark; +Cc: linux-perf-users

> Do you really want to use full Dwarf mode? You have to save the entire
> stack space on every sample for that to work, so it seems a bit slow.
> But maybe it's not an issue for lower sampling frequencies. If the
> application has a huge stack it's not really scalable.
> Is that just so you have a mode that works out of the box without any
> recompilation of Python? But not necessarily the best way to do it?

I do not *want* to do it as we already have a fully working version
when frame pointers are included
using the perf maps version. But the problem is that users cannot
generally leverage this as most
Python redistributors do not compile with frame pointers, and this
renders the integration useless
for most people.

So indeed, as you mention dwarf unwinding is a suboptimal way but it
will provide a way
for most users to get the integration working for them and people that
really care about the
most performant way can compile Python with frame pointers. The
problem is that most
Python users do not compile Python themselves and this is a huge
barrier for them.

That's why we want to *also* have DWARF unwinding working, even if it
is suboptimal.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Perf support in CPython
  2023-11-30 12:42   ` Pablo Galindo Salgado
@ 2023-11-30 18:09     ` James Clark
  2023-11-30 21:16     ` Ian Rogers
  1 sibling, 0 replies; 11+ messages in thread
From: James Clark @ 2023-11-30 18:09 UTC (permalink / raw)
  To: Pablo Galindo Salgado; +Cc: linux-perf-users

On 30/11/2023 12:42, Pablo Galindo Salgado wrote:
>> Do you really want to use full Dwarf mode? You have to save the entire
>> stack space on every sample for that to work, so it seems a bit slow.
>> But maybe it's not an issue for lower sampling frequencies. If the
>> application has a huge stack it's not really scalable.
>> Is that just so you have a mode that works out of the box without any
>> recompilation of Python? But not necessarily the best way to do it?
> 
> I do not *want* to do it as we already have a fully working version
> when frame pointers are included
> using the perf maps version. But the problem is that users cannot
> generally leverage this as most
> Python redistributors do not compile with frame pointers, and this
> renders the integration useless
> for most people.
> 
> So indeed, as you mention dwarf unwinding is a suboptimal way but it
> will provide a way
> for most users to get the integration working for them and people that
> really care about the
> most performant way can compile Python with frame pointers. The
> problem is that most
> Python users do not compile Python themselves and this is a huge
> barrier for them.
> 
> That's why we want to *also* have DWARF unwinding working, even if it
> is suboptimal.

Makes sense, it just makes me wonder if there isn't something that's low
overhead that could be added to the default build configuration of
Python that makes it work even in frame pointer mode. Like putting frame
pointers only in the trampolines.

If that doesn't work, maybe this is kind of a ridiculous idea, but what
if you compiled the final binary so that it had two versions of Python,
one with frame pointers and one without, and when you run "python -X
perf" it goes down the path with frame pointers on. Presumably the
performance hit isn't as big of a deal if it's only on for profiling.

I suppose everything is a tradeoff, and that trades fewer build
configurations for a larger binary size and more complicated build system.

Feel free to ignore me though, I'm just thinking out loud. But if Python
was already modified to insert the trampolines, it seems like there must
be some modification that can be done to make frame pointer unwinding
work without turning them on for every single performance critical function.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Perf support in CPython
  2023-11-30 12:42   ` Pablo Galindo Salgado
  2023-11-30 18:09     ` James Clark
@ 2023-11-30 21:16     ` Ian Rogers
  2023-12-01  9:57       ` James Clark
  1 sibling, 1 reply; 11+ messages in thread
From: Ian Rogers @ 2023-11-30 21:16 UTC (permalink / raw)
  To: Pablo Galindo Salgado; +Cc: James Clark, linux-perf-users

On Thu, Nov 30, 2023 at 4:42 AM Pablo Galindo Salgado
<pablogsal@gmail.com> wrote:
>
> > Do you really want to use full Dwarf mode? You have to save the entire
> > stack space on every sample for that to work, so it seems a bit slow.
> > But maybe it's not an issue for lower sampling frequencies. If the
> > application has a huge stack it's not really scalable.
> > Is that just so you have a mode that works out of the box without any
> > recompilation of Python? But not necessarily the best way to do it?
>
> I do not *want* to do it as we already have a fully working version
> when frame pointers are included
> using the perf maps version. But the problem is that users cannot
> generally leverage this as most
> Python redistributors do not compile with frame pointers, and this
> renders the integration useless
> for most people.
>
> So indeed, as you mention dwarf unwinding is a suboptimal way but it
> will provide a way
> for most users to get the integration working for them and people that
> really care about the
> most performant way can compile Python with frame pointers. The
> problem is that most
> Python users do not compile Python themselves and this is a huge
> barrier for them.
>
> That's why we want to *also* have DWARF unwinding working, even if it
> is suboptimal.

Won't fixing GCC/LLVM be the best approach? Change the RBP offsets to
RSP relative ones if shorter, perhaps around this code:
https://github.com/gcc-mirror/gcc/blob/master/gcc/emit-rtl.cc#L882
https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86FrameLowering.cpp#L2547

Nobody really expects 1 register to be worth 10% performance, so
fixing a performance elephant (which by definition should be easy to
see) in the compiler will win for everybody and give a solution that
works with BPF, etc.

Thanks,
Ian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Perf support in CPython
  2023-11-30 21:16     ` Ian Rogers
@ 2023-12-01  9:57       ` James Clark
  0 siblings, 0 replies; 11+ messages in thread
From: James Clark @ 2023-12-01  9:57 UTC (permalink / raw)
  To: Ian Rogers, Pablo Galindo Salgado; +Cc: linux-perf-users



On 30/11/2023 21:16, Ian Rogers wrote:
> On Thu, Nov 30, 2023 at 4:42 AM Pablo Galindo Salgado
> <pablogsal@gmail.com> wrote:
>>
>>> Do you really want to use full Dwarf mode? You have to save the entire
>>> stack space on every sample for that to work, so it seems a bit slow.
>>> But maybe it's not an issue for lower sampling frequencies. If the
>>> application has a huge stack it's not really scalable.
>>> Is that just so you have a mode that works out of the box without any
>>> recompilation of Python? But not necessarily the best way to do it?
>>
>> I do not *want* to do it as we already have a fully working version
>> when frame pointers are included
>> using the perf maps version. But the problem is that users cannot
>> generally leverage this as most
>> Python redistributors do not compile with frame pointers, and this
>> renders the integration useless
>> for most people.
>>
>> So indeed, as you mention dwarf unwinding is a suboptimal way but it
>> will provide a way
>> for most users to get the integration working for them and people that
>> really care about the
>> most performant way can compile Python with frame pointers. The
>> problem is that most
>> Python users do not compile Python themselves and this is a huge
>> barrier for them.
>>
>> That's why we want to *also* have DWARF unwinding working, even if it
>> is suboptimal.
> 
> Won't fixing GCC/LLVM be the best approach? Change the RBP offsets to
> RSP relative ones if shorter, perhaps around this code:
> https://github.com/gcc-mirror/gcc/blob/master/gcc/emit-rtl.cc#L882
> https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86FrameLowering.cpp#L2547
> 
> Nobody really expects 1 register to be worth 10% performance, so
> fixing a performance elephant (which by definition should be easy to
> see) in the compiler will win for everybody and give a solution that
> works with BPF, etc.
> 
> Thanks,
> Ian

I found this thread that's discussing why Python has such a big
performance issue with frame pointers relevant so I'll leave it here:
https://pagure.io/fesco/issue/2817#comment-826636

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-12-01  9:57 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-20 23:50 Perf support in CPython Pablo Galindo Salgado
2023-11-21 18:15 ` Namhyung Kim
2023-11-21 18:30   ` Pablo Galindo Salgado
2023-11-21 18:49     ` Namhyung Kim
2023-11-22 21:04 ` Ian Rogers
2023-11-30  6:39   ` Namhyung Kim
2023-11-30 11:00 ` James Clark
2023-11-30 12:42   ` Pablo Galindo Salgado
2023-11-30 18:09     ` James Clark
2023-11-30 21:16     ` Ian Rogers
2023-12-01  9:57       ` James Clark

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).