From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2BB0239E194
	for <linux-kernel@vger.kernel.org>; Thu,  9 Apr 2026 23:09:55 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.19
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1775776198; cv=none; b=Dd7eQW1Un7SQ7t0nH703BCv+RxZsi8UoiKtBZYvAca//aCuHG+i8/pjxgb1aSAEHR9BIOd16KickUz3GrZeDTFTmiM6koWVj6uEFyeItEC5d4Wcr2mLv4t/4cYWzHrfIWEWR6op9LwIRTW5bJEDdnzwLI4zde1fwOeH3y1NxEiA=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1775776198; c=relaxed/simple;
	bh=L+V+Z3DvCyv4pmoEyecrUI/Q8wg6nwA/e/9ts8h2tXQ=;
	h=Message-ID:Subject:From:To:Cc:Date:In-Reply-To:References:
	 Content-Type:MIME-Version; b=vGg5VC35+CVDhjUzHJXZA2ZeWWe66dB7MCm3yyND6nxdu7VjgkoIiyeXbPcqnHgd5/kCAGPqMaLq/DgEAnoMwhcJWCmpau3PbwAYmccVsl19wHolZ/68wP0hIBm1711Cpfbk1400y7MO4Rfed+55djyUmhgmTGI9OAGVlXvq0tY=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=JD/ATajD; arc=none smtp.client-ip=192.198.163.19
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="JD/ATajD"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1775776196; x=1807312196;
  h=message-id:subject:from:to:cc:date:in-reply-to:
   references:content-transfer-encoding:mime-version;
  bh=L+V+Z3DvCyv4pmoEyecrUI/Q8wg6nwA/e/9ts8h2tXQ=;
  b=JD/ATajDuH79cuSkD8pBqw+HizQprYEYnN135z1tz6kupw7PLJPvkD5J
   hZmIH33Ux/jj4M1aAj3XoiLK1QuxvblbTU4NRJPtzRNKU+QN4N+VdZ/2S
   0XDPnZHKno3ameF+IfydrHk623tfPdUYdkj4INPbou/d1kQvH2DVsdwl0
   qYxeyL4XftDtKDqjqAIeI2ZiLVZimDcef8yprE1Lu3gxLKOIgPppGBWzz
   XNVbD2raDLZPcscYwYxk8GqTHlVUZnNrgL21Gzapl2u8BK26nm+Uig1Dd
   RXwBda3bCOIlnjaFdTXtmoxt7xXWLrl1wH8/PnbMiCamleliqqtlbgzhJ
   Q==;
X-CSE-ConnectionGUID: 827Q8HiPREilkEYjTaLmcQ==
X-CSE-MsgGUID: pRwuUFOVQFyjWV1TsfwT9w==
X-IronPort-AV: E=McAfee;i="6800,10657,11754"; a="75830789"
X-IronPort-AV: E=Sophos;i="6.23,170,1770624000"; 
   d="scan'208";a="75830789"
Received: from fmviesa009.fm.intel.com ([10.60.135.149])
  by fmvoesa113.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Apr 2026 16:09:56 -0700
X-CSE-ConnectionGUID: Po3VKnzIRa6Ez41vxAJQgw==
X-CSE-MsgGUID: RrhfMRpAT4OWsn/U2flLNg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.23,170,1770624000"; 
   d="scan'208";a="222427287"
Received: from unknown (HELO [10.241.243.39]) ([10.241.243.39])
  by fmviesa009-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Apr 2026 16:09:55 -0700
Message-ID: <c896591397b00a3275d091a4c85dd9bcea1a0d9a.camel@linux.intel.com>
Subject: Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA
 node to reduce contention
From: Tim Chen <tim.c.chen@linux.intel.com>
To: K Prateek Nayak <kprateek.nayak@amd.com>, "Chen, Yu C"
 <yu.c.chen@intel.com>,  Peter Zijlstra <peterz@infradead.org>
Cc: Pan Deng <pan.deng@intel.com>, mingo@kernel.org, 
	linux-kernel@vger.kernel.org, tianyou.li@intel.com
Date: Thu, 09 Apr 2026 16:09:55 -0700
In-Reply-To: <e093d930-79df-4285-a492-cc6d40b3cd51@amd.com>
References: <cover.1753076363.git.pan.deng@intel.com>
	 <a3207ebf537bbe5605ff5454f63b5604d83a04a0.1753076363.git.pan.deng@intel.com>
	 <20260320124003.GU3738786@noisy.programming.kicks-ass.net>
	 <63a095f02428700a7ff2623b8ea81e524a406834.camel@linux.intel.com>
	 <20260324120008.GB3738010@noisy.programming.kicks-ass.net>
	 <138c3f9d-309f-41e6-aa72-a3f6bd713bf0@intel.com>
	 <22072ef8-5aec-49ac-9cc4-8a80bec14261@amd.com>
	 <64649c85-29ab-4f70-a0c4-3c83cbdae2fc@intel.com>
	 <c9647f77-9ce6-4291-bc03-5a13e2d713a8@amd.com>
	 <20260402105530.GA3738786@noisy.programming.kicks-ass.net>
	 <93d7eb33-c3a5-4498-bc26-57806b73d9e0@amd.com>
	 <3b66e8e8-07e0-4f3e-a3ba-d97133af5162@intel.com>
	 <1c742a1d8ecd8e314d704d46a44e2b8893479e50.camel@linux.intel.com>
	 <2881a07f-ff14-4faa-9da7-3fbe206a463d@amd.com>
	 <14eda829-fc6b-407a-93a3-0794ab521177@intel.com>
	 <2dcbf93d-030d-466c-8b1c-8387513e9eb9@amd.com>
	 <e093d930-79df-4285-a492-cc6d40b3cd51@amd.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
User-Agent: Evolution 3.58.1 (3.58.1-1.fc43) 
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

On Thu, 2026-04-09 at 10:47 +0530, K Prateek Nayak wrote:
> Hello Chenyu, Tim,
>=20
> On 4/8/2026 9:22 PM, K Prateek Nayak wrote:
> > Hello Chenyu,
> >=20
> > On 4/8/2026 5:05 PM, Chen, Yu C wrote:
> > > We haven't tried breaking it down further. One possible approach
> > > is to partition it at L2 scope, the benefit of which may depend on
> > > the workload.
> >=20
> > I fear at that point we'll have too many cachelines and too much
> > cache pollution when the CPU starts reading this at tick to schedule
> > a newidle balance.
> >=20
> > A 128 core system would bring in 128 * 64B =3D 8kB worth of data to
> > traverse the mask and at that point it becomes a trade off between
> > how fast you want reads vs writes and does it even speed up writes
> > after a certain point?
> >=20
> > Sorry I got distracted by some other stuff today but I'll share the
> > results from my experiments tomorrow.
>=20
> Here is some data from an experiments I ran on a 3rd Generation EPYC
> system (2 socket x 64C/128T (8LLCs per socket)):
>=20
> Experiment: Two threads pinned per-CPU on all CPUs yielding to each other
> and are operating on some cpumask - one setting the current CPU on the
> mask and other clearing the current CPU: Just an estimate of worst case
> scenario is we have to do one modification per sched-switch.
>=20
> I'm measuring total cycles taken for cpumask operations with following
> variants:
>=20
>                                             %cycles vs global mask operat=
ion
>=20
> global mask                                     : 100.0000%  (var: 3.28%)
> per-NUMA mask                                   :  32.9209%  (var: 7.77%)
> per-LLC mask                                    :   1.2977%  (var: 4.85%)
> per-LLC mask (u8 operation; no LOCK prefix)     :   0.4930%  (var: 0.83%)
>=20
> per-NUMA split is 3X faster, per-LLC on this 16LLC machine is 77x faster
> and since there is enough space in the cacheline we can use a u8 to set
> and clear the CPu atomically without LOCK prefix and then do a >> 3 to
> get the CPU index from set bit which is 202x faster.
>=20
> If we use the u8 operations, we can only read 8CPUs per 8-byte load on
> 64-bit system but with per-LLC mask, we can scan all 16CPUs on the LLC
> with one 8-byte read and and per-NUMA one requires two 8-byte reads to
> scan the 128CPUs per socket.
>=20
> I think per-LLC mask (or, as Tim suggested, 64CPUs per cacheline) is
> a good tradeoff between the speedup vs amount of loads required to
> piece together the full cpumask. Thoughts?

I agree that per-LLC mask is a good compromise between minimizing loads
and offer good speed ups.  I think we should get the LLC APICID
mask from 0x4 leaf (L1, L2, L3) instead of inferring from 0x1f leaf (Tile, =
Die ...etc)
for Intel.  And the cache leaf I think is 0x8000_001D leaf for AMD.
Those are parsed in cacheinfo code and we can get it from there.

Tim