From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F277A234CE4; Thu, 16 Jan 2025 18:55:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737053711; cv=none; b=dG3Cr0/IiwzxJ70zxiYrCZE15FkdQhoJ2YT4Bo1ESid/2qTyffJ8pjUSoW6CkBytzI69GaxAQUI7yTtaYUxGM44xAkm06XhAjh5sov4IJ/9H/o5WKGUTRHR+79bsAwV6VdawK9qfyP/DoDJX0rwbj7LakxvV5QEAlwGB/CcfcsA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737053711; c=relaxed/simple; bh=GjN9CRNUsQVdxrwKdp8D8zNv08JZcn8mZ5FSI1G92bg=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=LPczIUd2jXjUX5HN7VvgUnmrZ5g8FZSx0oMwgCNsAccLKUV8asfrMGxSOk5B28vhAS8qbWMR04CS5/kAC5Vajgb0sTOqqDFc1U6ef+iSRN/2HuAH1WImIJI3VwvCykLfJD9d/yw2EUMxcNwRaUibvXRczeQvu9UZ001iQAfHL8U= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=uZbJhOai; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="uZbJhOai" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3A7CAC4CEE2; Thu, 16 Jan 2025 18:55:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1737053710; bh=GjN9CRNUsQVdxrwKdp8D8zNv08JZcn8mZ5FSI1G92bg=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=uZbJhOai5N1Std6NRP/4z6SsjvISYLO1PvRi+kqHvTHlVw99doXZD7CCVv/XceJqz foLOCROegeeXULBj6ShhOJPBRm48k+rZrk8r/IHUBkwUPe+0EgbjOziLfpFo6E/l13 h8SRvzEn0OEg2TVgl5QQvGD1jv1Nb+kyhbk8XUxExvsIAPQY+KzrXlkkBDmqo31GVo kRnstLvBnF2OtUAUuvUnhCjJJLXf/nCX179Id+6756wWOVLy8G4gEtUzImvuLVrxTF Audzh0wIffH317D1n1tEiesxnjhA8njwpvxi642YHjQFT0isE66pPRmDUX02VqcNne EzkGoCzGSLVvw== Date: Thu, 16 Jan 2025 10:55:08 -0800 From: Namhyung Kim To: Dmitry Vyukov Cc: Ian Rogers , linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org, eranian@google.com, Ingo Molnar , Peter Zijlstra , chu howard Subject: Re: [PATCH v2] perf report: Add wall-clock and parallelism profiling Message-ID: References: <20250113134022.2545894-1-dvyukov@google.com> Precedence: bulk X-Mailing-List: linux-perf-users@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: On Wed, Jan 15, 2025 at 08:11:51AM +0100, Dmitry Vyukov wrote: > On Wed, 15 Jan 2025 at 06:59, Ian Rogers wrote: > > > > On Tue, Jan 14, 2025 at 12:27 AM Dmitry Vyukov wrote: > > [snip] > > > FWIW I've also considered and started implementing a different > > > approach where the kernel would count parallelism level for each > > > context and write it out with samples: > > > https://github.com/dvyukov/linux/commit/56ee1f638ac1597a800a30025f711ab496c1a9f2 > > > Then sched_in/out would need to do atomic inc/dec on a global counter. > > > Not sure how hard it is to make all corner cases work there, I dropped > > > it half way b/c the perf record post-processing looked like a better > > > approach. > > > > Nice. Just to focus on this point and go off on something of a > > tangent. I worry a little about perf_event_sample_format where we've > > used 25 out of the 64 bits of sample_type. Perhaps there will be a > > sample_type2 in the future. For the code and data page size it seems > > the same information could come from mmap events. You have a similar > > issue. I was thinking of another similar issue, adding information > > about the number of dirty pages in a VMA. I wonder if there is a > > better way to organize these things, rather than just keep using up > > bits in the perf_event_sample_format. For example, we could have a > > code page size software event that when in a leader sampling group > > with a hardware event with a sample IP provides the code page size > > information of the leader event's sample IP. We have loads of space in > > the types and config values to have an endless number of such events > > and maybe the value could be generated by a BPF program for yet more > > flexibility. What these events would mean without a leader sample > > event I'm not sure. > > In the end I did not go with adding parallelism to each sample (this > is purely perf report change), so at least for this patch this is very > tangential :) > > > Wrt wall clock time, Howard Chu has done some work advancing off-CPU > > sampling. Wall clock time being off CPU plus on CPU. We need to do > > something to move forward the default flags/options for perf record, > > for example, we don't enable build ID mmap events by default causing > > the whole perf.data file to be scanned looking to add build ID events > > for the dsos with samples in them. One option that could be a default > > could be off-CPU profiling, and when permissions deny the BPF approach > > we can fallback on using events. If these events are there by default > > then it makes sense to hook them up in perf report. > > Interesting. Do you mean "IO" by "off-CPU". > Yes, if a program was blocked for IO for 10 seconds (no CPU work), > then that obviously contributes to latency, but won't be in this > profile. Though, it still works well for a large number of important > cases (e.g. builds, ML inference, server request handling are > frequently not IO bound). > > I was thinking how IO can be accounted for in the wall-clock profile. > Since we have SWITCH OUT events (and they already include preemption > bit), we do have info to account for blocked threads. But it gets > somewhat complex and has to make some hypotheses b/c not all blocked > threads contribute to latency (e.g. blocked watchdog thread). So I > left it out for now. > > The idea was as follows. > We know a set of threads blocked at any point in time (switched out, > but not preempted). > We hypothesise that each of these could equally improve CPU load to > max if/when unblocked. > We inject synthetic samples in a leaf "IO wait" symbol with the weight > according to the hypothesis. I think these events can be injected > before each switch in event only (which changes the set of blocked > threads). > > Namely: > If CPU load is already at max (parallelism == num cpus), we don't > inject any IO wait events. > If the number of blocked threads is 0, we don't inject any IO wait events. > If there are C idle CPUs and B blocked threads, we inject IO wait > events with weight C/B for each of them. To track idle CPUs, you need sched-switch of all CPUs regardless of your workload, right? Also I'm not sure when do you want to inject the IO wait events - when a thread is sched-out without preemption? And what would be the weight? I guess you want something like: blocked time * C / B Then C and B can change before the thread is woken up. > For example, if there is a single blocked thread, then we hypothesise > that this blocked thread is the root cause of all currently idle CPUs > being idle. I think this may make sense when you target a single process group but it also needs system-wide idle information. > > This still has a problem of saying that unrelated threads contribute > to latency, but at least it's a simple/explainable model and it should > show guilty threads as well. Maybe unrelated threads can be filtered > by the user by specifying a set of symbols in stacks of unrelated > threads. Or by task name. > > Does it make any sense? Do you have anything better? I'm not sure if it's right to use idle state which will be affected by unrelated processes. Maybe it's good for system-wide profiling. For a process (group) profiliing, I think you need to consider number of total threads, active threads, and CPUs. And if the #active-threads is less than min(#total-threads, #CPUs), then it could be considered as idle from the workload's perspective. What do you think? Thanks, Namhyung