From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out162-62-58-216.mail.qq.com (out162-62-58-216.mail.qq.com [162.62.58.216]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8FED21D6187 for ; Thu, 19 Jun 2025 14:18:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=162.62.58.216 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750342725; cv=none; b=LaZoFM5fUEoxuHfPgtTxTcFQKgE6632SBrBMX0JfL+aSeo+OBS5nDnsjm20GdbZsud7LMHnxFNfLtWfXnTn6+OQ20hzB5avEsG9Hk35URTDfaqdxClm63z0t1LcEcafTqXPQHdsOGbRVjrMeQlwlisCvGXSCY+PHE1apAzxx3Mo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750342725; c=relaxed/simple; bh=H6wzggDRhxLgF/k7QgwWzPPiwJmYOU2J3mhjqhSXHk0=; h=Message-ID:Content-Type:Mime-Version:Subject:From:In-Reply-To: Date:Cc:References:To; b=URw3QRCVudVOSkpL+wrXpgkBfvAwgXQpqfwNNrLQU8vfJsVQWD363Ed0X8SmVxfLiCHki7IrwaHmaped1IuYbbNUIPwQj4qEoJRoVnnsnxkxj9Bs/NiyWID3zMZvfpjqrVWeyCGL0ALIbPDx0vEgyDgb3zp4F84/0CnLchvx5PA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=cyyself.name; spf=pass smtp.mailfrom=cyyself.name; dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b=OOpYVzch; arc=none smtp.client-ip=162.62.58.216 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=cyyself.name Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=cyyself.name Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=qq.com header.i=@qq.com header.b="OOpYVzch" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=qq.com; s=s201512; t=1750342704; bh=jEPNTd2WPyoDSyTpEMpP/zxHOT0uryc9RD50UhZr78A=; h=Subject:From:In-Reply-To:Date:Cc:References:To; b=OOpYVzchjatZ/O42Rtt15I/wzUYV2xm35DqhbM0nL/Q5aNbB676VqIQjGTvfkNK0Y VADggGJM9pOxtTTvkB7dIVg81kp4S7AnZQSbh9T9MeOjrzDhEqgqLc5N1NK3JSCTNM 4a5eaprowaJvaOgIa+44jAySdfewgguF3QQZ/dsA= Received: from smtpclient.apple ([2408:8207:18a2:f420:88c5:a3d0:e224:94b2]) by newxmesmtplogicsvrszb16-1.qq.com (NewEsmtp) with SMTP id 30E27000; Thu, 19 Jun 2025 22:12:14 +0800 X-QQ-mid: xmsmtpt1750342334t1qpkrhkq Message-ID: X-QQ-XMAILINFO: NQR8mRxMnur9tZ/1rpJ6YURlDIRkqdSvQg6OqWDrVF/QvBWs51cGg0ant9754D mfznjollGXuxngwjuCxHkY19DV3Tl/S1vWXh5KI+Sw7OT2F5M/T55boe0d8qQrGXKNaGxP9HFlav QpC72Wx2QuZWjerGpykqJOKoV/gQISXf0Vle1EwpBziKrZE07RQQhNgYkZTDd94W90jm2ErV2kyH XHG82e1hFnkY1mnXWJOSzf6RO3kfTysuv8gHrIUPklnHR0AyACRmS/22vvL4BgJevC5Uuv999bLJ 0C4wNNQfhd6AOl76BReXNvzhZa0lEbvHDNXeSGl4Y839oWBYSA6qLaqhA217SrrH/bR8ajsv0Cfo 3ltr9lo2cg87GP5PQr4b7lCtTtbNjQqd5gS6sIgC+xophVxNDdAY3P6KsK4ulvxQsd4akjm8/XBU oQfCcDaMu64dN5D3QzcegZerXmJGkU7iDlFR1LWhQOkEBUZufR3WdXKLl6/rLTF8Rm6XS/4B/cqv On3Vf975oRFlxuII/vWlr/2JVKvWfBf1DQ/HLqN7YPqrBrN1m58qUEKgWOGvWawK5AkckJUvpxJ0 w1m/Shq4PaKqIvH1qLK4aU8S4ZKwqxgNoqoQ6jA07E9TzJBIbawDl5ePHvftMx/8zWBvfF3frKhl TTsn85QVwSuntUBK0zOwsPC0znbE+tCjioEo82t4hQl9V2rLkv0bdw2UzsxzVB9oAFhXTeogXGNJ rfbmyCsiuYUhVNEJySMAlf8e9bDCEDuHUepP+PD6Vt6lc+/zbqu1fV6dm5oWVq11/ckuWhSfcPVY ryXZlOcngbtrzKDHAqDF+yTit9fPIcfI2BGZFXMQTHt79VvDX3I/25Rn3iX0rROGlmQXcfBI4+b2 f3OAbrjaSw/wPzlT32AhFdJHCW1+mcGL3nf86vcTE7q6ffdhTktEP/LBBQmKlbt/N6kzBmurUb9l X4Xq1/u2K709dIixd3+QIAiQ+WF/6XNlnr/maF1V8Ks7KxxfJCfPv664MKRUbPp5e/Q+D858b5Cc /0CkhtnR76AD4Zs8YrV2TYat+WiqA= X-QQ-XMRINFO: Nq+8W0+stu50PRdwbJxPCL0= Content-Type: text/plain; charset=us-ascii Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3826.600.51.1.1\)) Subject: Re: [RFC patch v3 00/20] Cache aware scheduling From: Yangyu Chen In-Reply-To: <9dbdfec2-0a2d-4110-b8ef-b9d16900ff87@intel.com> Date: Thu, 19 Jun 2025 22:12:03 +0800 Cc: Tim Chen , Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Tim Chen , Vincent Guittot , Libo Chen , Abel Wu , Madadi Vineeth Reddy , Hillf Danton , Len Brown , linux-kernel@vger.kernel.org Content-Transfer-Encoding: 7bit X-OQ-MSGID: <879A5558-B722-4FC4-880E-C0DA303912D4@cyyself.name> References: <9dbdfec2-0a2d-4110-b8ef-b9d16900ff87@intel.com> To: "Chen, Yu C" X-Mailer: Apple Mail (2.3826.600.51.1.1) > On 19 Jun 2025, at 21:21, Chen, Yu C wrote: > > On 6/19/2025 2:39 PM, Yangyu Chen wrote: >> Nice work! >> I've tested your patch based on commit fb4d33ab452e and found it >> incredibly helpful for Verilator with large RTL simulations like >> XiangShan [1] on AMD EPYC Geona. >> I've created a simple benchmark [2] using a static build of an >> 8-thread Verilator of XiangShan. Simply clone the repository and >> run `make run`. >> In a static allocated 8-CCX KVM (with a total of 128 vCPUs) on EPYC >> 9T24, before the patch, we have a simulation time of 49.348ms. This >> was because each thread was distributed across every CCX, resulting >> in extremely high core-to-core latency. However, after applying the >> patch, the entire 8-thread Verilator is allocated to a single CCX. >> Consequently, the simulation time was reduced to 24.196ms, which >> is a remarkable 2.03x faster than before. We don't need numactl >> anymore! >> [1] https://github.com/OpenXiangShan/XiangShan >> [2] https://github.com/cyyself/chacha20-xiangshan >> Tested-by: Yangyu Chen > > Thanks Yangyu for your test. May I know if these 8-threads have any > data sharing with each other, or each thread has their dedicated > data? Or, there is 1 main thread, the other 7 threads do the > chacha20 rotate and put the result to the main thread? Ah, I had forgotten to mention the benchmark. The workload is not about chacha20 itself. This benchmark utilizes a RTL-level simulator [1] that runs an Open Source OoO CPU core called XiangShan [2]. The chacha20 algorithm is executed on the guest CPU within this simulator. The verilator partitions a large RTL design into multiple blocks of functions and distributes them to each thread. These signals require synchronization every guest cycle, and synchronization is also necessary when a dependency exists. Given that we have approximately 5K guest cycles per second, there is a significant amount of data that needs to be transferred between each thread. If there are signal dependencies, this could lead to latency-bound performance. [1] https://github.com/verilator/verilator [2] https://github.com/OpenXiangShan/XiangShan Thanks, Yangyu Chen > Anyway I tested it on a Xeon EMR with turbo-disabled and saw ~20% > reduction in the total time. Nice result! > > Thanks, > Chenyu