From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2BB0239E194 for ; Thu, 9 Apr 2026 23:09:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.19 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775776198; cv=none; b=Dd7eQW1Un7SQ7t0nH703BCv+RxZsi8UoiKtBZYvAca//aCuHG+i8/pjxgb1aSAEHR9BIOd16KickUz3GrZeDTFTmiM6koWVj6uEFyeItEC5d4Wcr2mLv4t/4cYWzHrfIWEWR6op9LwIRTW5bJEDdnzwLI4zde1fwOeH3y1NxEiA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775776198; c=relaxed/simple; bh=L+V+Z3DvCyv4pmoEyecrUI/Q8wg6nwA/e/9ts8h2tXQ=; h=Message-ID:Subject:From:To:Cc:Date:In-Reply-To:References: Content-Type:MIME-Version; b=vGg5VC35+CVDhjUzHJXZA2ZeWWe66dB7MCm3yyND6nxdu7VjgkoIiyeXbPcqnHgd5/kCAGPqMaLq/DgEAnoMwhcJWCmpau3PbwAYmccVsl19wHolZ/68wP0hIBm1711Cpfbk1400y7MO4Rfed+55djyUmhgmTGI9OAGVlXvq0tY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=JD/ATajD; arc=none smtp.client-ip=192.198.163.19 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="JD/ATajD" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1775776196; x=1807312196; h=message-id:subject:from:to:cc:date:in-reply-to: references:content-transfer-encoding:mime-version; bh=L+V+Z3DvCyv4pmoEyecrUI/Q8wg6nwA/e/9ts8h2tXQ=; b=JD/ATajDuH79cuSkD8pBqw+HizQprYEYnN135z1tz6kupw7PLJPvkD5J hZmIH33Ux/jj4M1aAj3XoiLK1QuxvblbTU4NRJPtzRNKU+QN4N+VdZ/2S 0XDPnZHKno3ameF+IfydrHk623tfPdUYdkj4INPbou/d1kQvH2DVsdwl0 qYxeyL4XftDtKDqjqAIeI2ZiLVZimDcef8yprE1Lu3gxLKOIgPppGBWzz XNVbD2raDLZPcscYwYxk8GqTHlVUZnNrgL21Gzapl2u8BK26nm+Uig1Dd RXwBda3bCOIlnjaFdTXtmoxt7xXWLrl1wH8/PnbMiCamleliqqtlbgzhJ Q==; X-CSE-ConnectionGUID: 827Q8HiPREilkEYjTaLmcQ== X-CSE-MsgGUID: pRwuUFOVQFyjWV1TsfwT9w== X-IronPort-AV: E=McAfee;i="6800,10657,11754"; a="75830789" X-IronPort-AV: E=Sophos;i="6.23,170,1770624000"; d="scan'208";a="75830789" Received: from fmviesa009.fm.intel.com ([10.60.135.149]) by fmvoesa113.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Apr 2026 16:09:56 -0700 X-CSE-ConnectionGUID: Po3VKnzIRa6Ez41vxAJQgw== X-CSE-MsgGUID: RrhfMRpAT4OWsn/U2flLNg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,170,1770624000"; d="scan'208";a="222427287" Received: from unknown (HELO [10.241.243.39]) ([10.241.243.39]) by fmviesa009-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Apr 2026 16:09:55 -0700 Message-ID: Subject: Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention From: Tim Chen To: K Prateek Nayak , "Chen, Yu C" , Peter Zijlstra Cc: Pan Deng , mingo@kernel.org, linux-kernel@vger.kernel.org, tianyou.li@intel.com Date: Thu, 09 Apr 2026 16:09:55 -0700 In-Reply-To: References: <20260320124003.GU3738786@noisy.programming.kicks-ass.net> <63a095f02428700a7ff2623b8ea81e524a406834.camel@linux.intel.com> <20260324120008.GB3738010@noisy.programming.kicks-ass.net> <138c3f9d-309f-41e6-aa72-a3f6bd713bf0@intel.com> <22072ef8-5aec-49ac-9cc4-8a80bec14261@amd.com> <64649c85-29ab-4f70-a0c4-3c83cbdae2fc@intel.com> <20260402105530.GA3738786@noisy.programming.kicks-ass.net> <93d7eb33-c3a5-4498-bc26-57806b73d9e0@amd.com> <3b66e8e8-07e0-4f3e-a3ba-d97133af5162@intel.com> <1c742a1d8ecd8e314d704d46a44e2b8893479e50.camel@linux.intel.com> <2881a07f-ff14-4faa-9da7-3fbe206a463d@amd.com> <14eda829-fc6b-407a-93a3-0794ab521177@intel.com> <2dcbf93d-030d-466c-8b1c-8387513e9eb9@amd.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.58.1 (3.58.1-1.fc43) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 On Thu, 2026-04-09 at 10:47 +0530, K Prateek Nayak wrote: > Hello Chenyu, Tim, >=20 > On 4/8/2026 9:22 PM, K Prateek Nayak wrote: > > Hello Chenyu, > >=20 > > On 4/8/2026 5:05 PM, Chen, Yu C wrote: > > > We haven't tried breaking it down further. One possible approach > > > is to partition it at L2 scope, the benefit of which may depend on > > > the workload. > >=20 > > I fear at that point we'll have too many cachelines and too much > > cache pollution when the CPU starts reading this at tick to schedule > > a newidle balance. > >=20 > > A 128 core system would bring in 128 * 64B =3D 8kB worth of data to > > traverse the mask and at that point it becomes a trade off between > > how fast you want reads vs writes and does it even speed up writes > > after a certain point? > >=20 > > Sorry I got distracted by some other stuff today but I'll share the > > results from my experiments tomorrow. >=20 > Here is some data from an experiments I ran on a 3rd Generation EPYC > system (2 socket x 64C/128T (8LLCs per socket)): >=20 > Experiment: Two threads pinned per-CPU on all CPUs yielding to each other > and are operating on some cpumask - one setting the current CPU on the > mask and other clearing the current CPU: Just an estimate of worst case > scenario is we have to do one modification per sched-switch. >=20 > I'm measuring total cycles taken for cpumask operations with following > variants: >=20 > %cycles vs global mask operat= ion >=20 > global mask : 100.0000% (var: 3.28%) > per-NUMA mask : 32.9209% (var: 7.77%) > per-LLC mask : 1.2977% (var: 4.85%) > per-LLC mask (u8 operation; no LOCK prefix) : 0.4930% (var: 0.83%) >=20 > per-NUMA split is 3X faster, per-LLC on this 16LLC machine is 77x faster > and since there is enough space in the cacheline we can use a u8 to set > and clear the CPu atomically without LOCK prefix and then do a >> 3 to > get the CPU index from set bit which is 202x faster. >=20 > If we use the u8 operations, we can only read 8CPUs per 8-byte load on > 64-bit system but with per-LLC mask, we can scan all 16CPUs on the LLC > with one 8-byte read and and per-NUMA one requires two 8-byte reads to > scan the 128CPUs per socket. >=20 > I think per-LLC mask (or, as Tim suggested, 64CPUs per cacheline) is > a good tradeoff between the speedup vs amount of loads required to > piece together the full cpumask. Thoughts? I agree that per-LLC mask is a good compromise between minimizing loads and offer good speed ups. I think we should get the LLC APICID mask from 0x4 leaf (L1, L2, L3) instead of inferring from 0x1f leaf (Tile, = Die ...etc) for Intel. And the cache leaf I think is 0x8000_001D leaf for AMD. Those are parsed in cacheinfo code and we can get it from there. Tim