From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: smtp.subspace.kernel.org; dkim=none Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id C3C27D50 for ; Thu, 30 Nov 2023 03:00:39 -0800 (PST) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 1C655C15; Thu, 30 Nov 2023 03:01:26 -0800 (PST) Received: from [192.168.1.3] (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 326B73F6C4; Thu, 30 Nov 2023 03:00:39 -0800 (PST) Message-ID: <896b5786-8ed7-af6c-2c64-a24bb06a0d89@arm.com> Date: Thu, 30 Nov 2023 11:00:38 +0000 Precedence: bulk X-Mailing-List: linux-perf-users@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.15.1 Subject: Re: Perf support in CPython Content-Language: en-US To: Pablo Galindo Salgado , linux-perf-users@vger.kernel.org References: From: James Clark In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On 20/11/2023 23:50, Pablo Galindo Salgado wrote: > Hi, > > I am Pablo Galindo Salgado from the CPython core development team. In > the past release of CPython (Python 3.12) I have added support > for including Python function calls in the output of perf by > leveraging the jit-interface support by writing to /tmp/perf-%d.map. > The support > works by compiling assembly trampolines at runtime that just call the > Python bytecode evaluation loop and assigning function names to the > trampolines by writing in /tmp/perf-%d.map. This has worked really > well when CPython is compiled with frame pointers but unfortunately > CPython also suffers from 5% to 10% slow down when frame pointers are > used. Unfortunately, perf cannot unwind through the jitted trampolines > without frame pointers currently. > I doubt this will have any impact on the 5% - 10%, but I'll leave it here anyway just in case: (At least on Arm) you can avoid "-mno-omit-leaf-frame-pointer". As in, leaf frame pointers can be omitted. This is because of this change that we added to Perf: void arch__add_leaf_frame_record_opts(struct record_opts *opts) { opts->sample_user_regs |= sample_reg_masks[PERF_REG_ARM64_LR].mask; } After sampling the register, even in FP unwind mode, we'll still try to do a single step of Dwarf unwind of the last frame, using only the link register as the input data, and none of the stack space. The unwinder knows at that instruction whether the link register contains the return address, and there is a high chance that it does. We did this because at least with some compilers, "-no-omit-frame-pointer" actually has no effect on leaf frames, it will still omit it even if you asked for it not to be (because they don't do anything in leaf frames). Although maybe "-no-omit-leaf-frame-pointer" fixes this, you actually don't need it as long as you sampled the link register and have the Dwarf available. So I suppose to get that working you'd have to add the JIT_CODE_UNWINDING_INFO stuff. > I am looking to extend support to the jitdump specification to allow > perf to unwind through these trampolines when using dwarf unwinding > mode. Do you really want to use full Dwarf mode? You have to save the entire stack space on every sample for that to work, so it seems a bit slow. But maybe it's not an issue for lower sampling frequencies. If the application has a huge stack it's not really scalable. Is that just so you have a mode that works out of the box without any recompilation of Python? But not necessarily the best way to do it? I don't know if it's possible to have some kind of hybrid approach where only the trampolines have leaf frame pointers, but none of the other code does, so you don't get overhead issue? Or that would also fall apart in the kernel's FP unwinder? > I have a draft for a patch > (https://github.com/python/cpython/pull/112254) that adds support for > the specification by writing JIT_CODE_LOAD and > JIT_CODE_UNWINDING_INFO (currently with empty EH Frame entries to > force perf to defer to FP based unwinding). This works well in my > current > distribution (Arch Linux with perf version 6.5) but when tested in > other distributions it falls apart. > > * One problem I have struggled to debug is that even when compiled > with frame pointers and without emitting JIT_CODE_UNWINDING_INFO > (using fp-based unwinding), in some cases (such as the latest Ubuntu > versions), the shared objects and addresses generated by perf-inject > -j look sane, the addresses match but perf report or perf script do > not include the symbol names in their output. I do not understand why > this > is happening and my debuggins so far has not allowed me to get any > thread I can pull so I wanted to reach to this mailing list for > assistance. > > * The other problem I am finding is to include the eh_frames to our > trampolines as it looks like deferring to FP unwinding is unreliable > even if > our trampolines are trivial. This is a different problem but my > current attempts to add EH Frames have failed and I don't know how to > properly > debug them correctly as it looks like perf doesn't expect some of the > values we are writing. I would love it if someone with perf and DWARF > expertise could help us in the CPython team to get steered in the > right direction. > > Thanks a lot in advance, > Pablo Galindo Salgado >