From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3E5DA38E10F for ; Fri, 20 Mar 2026 10:18:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774001939; cv=none; b=D5CKPLU/5VHVLaHaNxfXLuyId+ZWMqvFyulaQl8zF1n0OsXUfuAqKYRMPE0hvJdA268bnzTISoPnzeWpm6kH0F/K8wUqRLx0y81eNMZk9CWrTX7ShemoXeybjNaPech46ka+AD1BKQLhBLvtWZWsgPdwLm6wji18ioJDCR+QTgo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774001939; c=relaxed/simple; bh=xsVxjTVMNRMSUa3jv8tWyXfxkmUOMvqvUPOe1ysITM0=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=RhwsRShrlhhH7vr1YNkZ7VNpQLTAI2lt1X/6ypqRRTG6ynYO/WWBNwcOs3ecXGGSO8D1ECdDUTvR8jcoHNSlD9Q1jU51yxKKreb4fpzzhW/ywLN8OOUMYygPYLc9OfzezSdTwTO2rIxjVOcTlOuPk+sKRuGWjUW394/Oq7u+KbQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=asdtlwWf; arc=none smtp.client-ip=90.155.50.34 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="asdtlwWf" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=/k9SPY5arHsmxC9ge2cudpyoNZpWDQmSVpyOTqx5dTs=; b=asdtlwWfW01n4X5M9chGYdv3Kh D3mcVmliyLjoJYHXy4euYGyl35EJdI/H2Xtjl2B5ejB6cAHUMRzKDAxRKxJgdPOL7dgfRGQ7oWqLC COQTZL6hKzgqryit/e9B2SyO55VbmcAtw4cCXSm42m7RKprKL+zvVkJMaqU/FWDGsSIQkYU1cM97u r+OrEi5ncGMnisJ+6qgoyHlThE+6raWHwk/DuIP8H89abPm3Xk7By9aTBCl8vhBWxPthphw/G9aa4 U3S5x3BwxqLtsmYtiqSsZO1+4ggY7d8//Hznmz5Z4uyDHOoXWq+JuwW0IXWNmdqyR+Ld/ZvEOoVX+ GjODM6mQ==; Received: from 2001-1c00-8d85-5700-266e-96ff-fe07-7dcc.cable.dynamic.v6.ziggo.nl ([2001:1c00:8d85:5700:266e:96ff:fe07:7dcc] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1w3WwK-00000007bhL-0Tsi; Fri, 20 Mar 2026 10:18:48 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000) id 92F56301150; Fri, 20 Mar 2026 11:18:47 +0100 (CET) Date: Fri, 20 Mar 2026 11:18:47 +0100 From: Peter Zijlstra To: Pan Deng Cc: mingo@kernel.org, linux-kernel@vger.kernel.org, tianyou.li@intel.com, tim.c.chen@linux.intel.com, yu.c.chen@intel.com Subject: Re: [PATCH v2 2/4] sched/rt: Restructure root_domain to reduce cacheline contention Message-ID: <20260320101847.GS3738786@noisy.programming.kicks-ass.net> References: <346b697a0bbf9b0ff6a62d787ccf6665dcefc99f.1753076363.git.pan.deng@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <346b697a0bbf9b0ff6a62d787ccf6665dcefc99f.1753076363.git.pan.deng@intel.com> On Mon, Jul 21, 2025 at 02:10:24PM +0800, Pan Deng wrote: > When running a multi-instance FFmpeg workload on HCC system, significant > contention is observed in root_domain cacheline 1 and 3. What's a HCC? Hobby Computer Club? Google is telling me it is the most prevalent form of liver cancer, but I somehow doubt that is what you're on about. > The SUT is a 2-socket machine with 240 physical cores and 480 logical Satellite User Terminal? Subsea Umbilical Termination? Small Unit Transceiver? Single Unit Test? > CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores > (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99 > with FIFO scheduling. FPS is used as score. Yes yes, poorly configured systems hurt. > perf c2c tool reveals (sorted by contention severity): > root_domain cache line 3: > - `cpupri->pri_to_cpu[0].count` (offset 0x38) is heavily loaded/stored, > since counts[0] is more frequently updated than others along with a > rt task enqueues an empty runq or dequeues from a non-overloaded runq. > - `rto_mask` (0x30) is heavily loaded > - `rto_loop_next` (0x24) and `rto_loop_start` (0x28) are frequently stored > - `rto_push_work` (0x0) and `rto_lock` (0x18) are lightly accessed > - cycles per load: ~10K to 59K > > root_domain cache line 1: > - `rto_count` (0x4) is frequently loaded/stored > - `overloaded` (0x28) is heavily loaded > - cycles per load: ~2.8K to 44K: > > This change adjusts the layout of `root_domain` to isolate these contended > fields across separate cache lines: > 1. `rto_count` remains in the 1st cache line; `overloaded` and > `overutilized` are moved to the last cache line > 2. `rto_push_work` is placed in the 2nd cache line > 3. `rto_loop_start`, `rto_loop_next`, and `rto_lock` remain in the 3rd > cache line; `rto_mask` is moved near `pd` in the penultimate cache line > 4. `cpupri` starts at the 4th cache line to prevent `pri_to_cpu[0].count` > contending with fields in cache line 3. > > With this change: > - FPS improves by ~5% > - Kernel cycles% drops from ~20% to ~17.7% > - root_domain cache line 3 no longer appears in perf-c2c report > - cycles per load of root_domain cache line 1 is reduced to from > ~2.8K-44K to ~2.1K-2.7K > - stress-ng cyclic benchmark is improved ~18.6%, command: > stress-ng/stress-ng --cyclic $(nproc) --cyclic-policy fifo \ > --timeout 30 --minimize --metrics > - rt-tests/pi_stress is improved ~4.7%, command: > rt-tests/pi_stress -D 30 -g $(($(nproc) / 2)) > > According to the nature of the change, to my understanding, it doesn`t > introduce any negative impact in other scenario. > > Note: This change increases the size of `root_domain` from 29 to 31 cache > lines, it's considered acceptable since `root_domain` is a single global > object. Uhm, what? We're at 207 cachelines due to that previous patch, remember? A few more don't matter at this point I would guess. It doesn't actually apply anymore, but it needs the very same that previous patch did -- more comments.