From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mxchg03.rrz.uni-hamburg.de (mxchg03.rrz.uni-hamburg.de [134.100.38.113]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EF2D7B67E for ; Sun, 29 Mar 2026 15:04:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=134.100.38.113 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774796648; cv=none; b=aNUZJZ6uis3TqlwnVRnG8gU5UnuMD/6zmw8LBJO3oremVIu4lxncFxh0uuhEzasPdgQqAWQQgb3wt8LWTWPthCEBOOl7RPBTsAaLQnK5VQ9KxR0nlgkHtKZDtHB8HDS7w8zwtRBry3k4jWJ9l6ASH/3Bp1/5UC9YZbppSCc1wYo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774796648; c=relaxed/simple; bh=an47/1h9YyHGVeTQ3bNhfMWr4WRaGAJj5/2NlCpOVUg=; h=Date:From:To:CC:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=rxTg8NjClOMPsgxH/p9a4Nnn3mCwqccy5XVLsi59i7nq4XQbWkFI9sdr2HhhuvLOgPhohQV469f+m0EjdBU3/NGvKi2NygiocgDV3bzTsdRWzghYJA3ScANA++oqLEVRyqHaAcztgMcz5V23PkEmdyl2Tr1Nu+hFGJXG7xjOma0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=uni-hamburg.de; spf=pass smtp.mailfrom=uni-hamburg.de; dkim=pass (2048-bit key) header.d=uni-hamburg.de header.i=@uni-hamburg.de header.b=MuhchQr3; arc=none smtp.client-ip=134.100.38.113 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=uni-hamburg.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=uni-hamburg.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=uni-hamburg.de header.i=@uni-hamburg.de header.b="MuhchQr3" Received: from mxchg03.rrz.uni-hamburg.de (mxchg03.rrz.uni-hamburg.de [134.100.38.113]) by mxchg03.rrz.uni-hamburg.de (Postfix) with ESMTPS id 4fkHZ9168Tz2xJP; Sun, 29 Mar 2026 16:58:25 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=uni-hamburg.de; s=rrzs003; t=1774796305; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=dt6lASpmZt4Aue6qHnZgRzA2QKxtT5WrgP7sHC8p2+8=; b=MuhchQr3T/cgaaqLoEJ/B1lJkFYjiQGF89yo1bBFxfF48Rz0bXkZGBC8CVRdVN2GACeN1G jzupQFab52sHGdMI1baNvnXlzlFs+VcJtBoOg8M6uO3Xees+LXJK2MXPVpf65LdVjSwo1V HLgPyf5ezBSnfLsqI8w11X3ikUjXssCuLEO/cJbCgLGQp0Jsmat/7Evtb7SVkGYJLNv8sD 4h/j2ziKDM3LNmBoND9u5Gg7hhOp32KIiHl9BBGxc8o42bRHfUFkPkzbnYMaoirW7q/POK hoE4XNPakUVHxpbzy0zc4qwMDSGUw1Ft5vifsDcF+lX/gIwDXChOwmI6WVdjgg== Received: from exchange.uni-hamburg.de (EX-S-MR06.uni-hamburg.de [134.100.84.89]) by mxchg03.rrz.uni-hamburg.de (Postfix) with ESMTPS id 4fkHZ86J35z2xJH; Sun, 29 Mar 2026 16:58:24 +0200 (CEST) Received: from plasteblaster (134.100.32.91) by EX-S-MR06.uni-hamburg.de (134.100.84.89) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.37; Sun, 29 Mar 2026 16:58:24 +0200 Date: Sun, 29 Mar 2026 16:58:23 +0200 From: "Dr. Thomas Orgis" To: Yiyang Chen CC: Balbir Singh , Yang Yang , Wang Yaxin , , Oleg Nesterov , "Andrew Morton" Subject: Re: [PATCH] taskstats: retain dead thread stats in TGID queries Message-ID: <20260329165823.1e26001d@plasteblaster> In-Reply-To: <6f4ed79d96c389a9a1d67d5ced96c6326eda82ae.1774552296.git.cyyzero16@gmail.com> References: <6f4ed79d96c389a9a1d67d5ced96c6326eda82ae.1774552296.git.cyyzero16@gmail.com> Organization: =?UTF-8?B?VW5pdmVyc2l0w6R0?= Hamburg X-Mailer: Claws Mail 4.0.0 (GTK+ 3.24.33; x86_64-pc-linux-gnu) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: EX-S-MR07.uni-hamburg.de (134.100.84.90) To EX-S-MR06.uni-hamburg.de (134.100.84.89) X-Rspamd-UID: e5424a X-Rspamd-UID: 92d85f Am Fri, 27 Mar 2026 03:12:07 +0800 schrieb Yiyang Chen : > However, fill_tgid_exit() only accumulates delay accounting into > signal->stats. This means TGID queries lose the fields that > fill_stats_for_tgid() adds for live threads once a thread exits, > including ac_etime, ac_utime, ac_stime, nvcsw and nivcsw. Seems like you are properly tackling the problem I as an outsider had when I started out using the taskstats interface, quoting my task/process accounting tool sources (from around 2018): * I intended to only count processes (tgid stats), but that * gives empty values for the ones I am interested in. There was * a patch posted ages ago that would have added the accounting * fields in the aggregation ... but did not make it, apparently. * Linux kernel folks are interested in stuff like that delay * accounting (not sure I know what this is about), while I want * a reliable way to add up the compute/memory resources used * by certain processes. Thanks! I was missing the fields interesting to me in tgid stats: (CPU) times, I/O, memory. I am not sure if I get around testing your patch qickly due to personal time and brain-time constraints, but I want to express interest in it. I ended up adding the AGROUP flag in taskstats version 12 to solve my issue, allowing my tool to tell exit of an individual thread from exit of a group. Though this means I have to get all individual thread stats from the kernel and later sort them into aggregates per process. In the end I want to present such to users (percentages on sums for the whole HPC batch job): -------------------8<--------------- cpu mem io =E2=94=82 maxrss maxvm =E2=94=82 tasks procs =E2=94=82= command % % % =E2=94=82 GiB GiB =E2=94=82 =E2=94=82= =20 =E2=95=90=E2=95=90=E2=95=90=E2=95=90=E2=95=90=E2=95=90=E2=95=90=E2=95=90=E2= =95=90=E2=95=90=E2=95=90=E2=95=90=E2=95=90=E2=95=90=E2=95=90=E2=95=90=E2=95= =90=E2=95=90=E2=95=90=E2=95=AA=E2=95=90=E2=95=90=E2=95=90=E2=95=90=E2=95=90= =E2=95=90=E2=95=90=E2=95=90=E2=95=90=E2=95=90=E2=95=90=E2=95=90=E2=95=90=E2= =95=90=E2=95=90=E2=95=AA=E2=95=90=E2=95=90=E2=95=90=E2=95=90=E2=95=90=E2=95= =90=E2=95=90=E2=95=90=E2=95=90=E2=95=90=E2=95=90=E2=95=90=E2=95=90=E2=95=90= =E2=95=AA=E2=95=90=E2=95=90=E2=95=90=E2=95=90=E2=95=90=E2=95=90=E2=95=90=E2= =95=90 100.0 99.9 100.0 =E2=94=82 2.6 3.5 =E2=94=82 576 192 =E2=94=82= some_program Summary: Elapsed time: 38% (9.1 out of 24.0 h timelimit) =20 CPU: 100% (191.7 out of 192 physical CPU cores) =20 Max. main memory: 37% (273.9 out of 750.0 GiB min. available per node) ----------------->8----------------- I can discern that this was a structurally simple (MPI) program that spawned one process per CPU core and probably had two extra threads per core for communication. It allocated 34 % more memory than it actually needed. This one program took so much of the job's resources that other processes don't really count. A bad HPC job has a long table of commands each contributing a little, down towards individual calls to 'cat' and the like. I want to see and present those cases. In another application, I collect statistics using accumulated CPU time and coremem per program binary to be able to tell which programs and (older) versions use how much of our cluster over the years. With a counter for total tasks over the group lifetime added to struct taskstats and the missing fields filled following your patch, I could get all this information with a lot less overhead via datasets only on tgid exit and would not have to count each task as it finishes. I always like less overhead for monitoring/accounting! > Factor the per-task TGID accumulation into a helper and use it in both > fill_stats_for_tgid() and fill_tgid_exit(). This keeps the fields > retained for dead threads aligned with the fields already accounted for > live threads, and follows the existing taskstats TGID aggregation model, > which already accumulates delay accounting in fill_tgid_exit() and > combines it with a live-thread scan in fill_stats_for_tgid(). Pardon my ignorance, as I do not have the time right now to dive back into kernel code: Should other fields of interest also be filled? Do we have all of them covered? Memory highwater marks are not per-task, right? But coremem, virtmem? I/O stats? Also, in the end, I'd strongly prefer this patch to include a user-visible change in the API, like an increased TASKSTATS_VERSION. There are no new fields added, but the interpretation of the data is different now for tgid. Alrighty then, Thomas --=20 Dr. Thomas Orgis HPC @ Universit=C3=A4t Hamburg