From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id CFBDC3B47F1 for ; Fri, 3 Jul 2026 10:00:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1783072835; cv=none; b=FichcsrITrOBjIn3XsEFfBb7zUxVkMtv/RcEKGM51gV3+jn3lU0lPvHS+55erpF1lOhjql045iHOwkUH0nWVx7f76zP85DfArKpOEnQ6M3PPJcm4HrozpJ7RRGgcRag6m1JCmsHF7HGi2NPLHqj1bLsdFOCcoKpzd7iT3oJ0+bc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1783072835; c=relaxed/simple; bh=9lKq8gTRt5zX+NtGJ849IDKuVBjCjU0ye8+dEU9Xn50=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=a+hY/eo8APae5nbiT2bpLQKTYVn/p+54wHOsaUXbaKkTZhtUKNkKHjPGBGw5Mm1GtMZ7NhHMVa9FRCpF3TKRdMa87SePwRCsYtBdL7wNP4KYDKoRUJ7YADPXyhlta3vAizAlaZmXf2+dDIsXp9plmGjaUdt+LhPk6gqxQwNyK50= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; dkim=pass (1024-bit key) header.d=arm.com header.i=@arm.com header.b=Utpqk/V0; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=arm.com header.i=@arm.com header.b="Utpqk/V0" Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D45A122D7; Fri, 3 Jul 2026 03:00:23 -0700 (PDT) Received: from [10.57.31.140] (unknown [10.57.31.140]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 8CEE03F673; Fri, 3 Jul 2026 03:00:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=arm.com; s=foss; t=1783072828; bh=9lKq8gTRt5zX+NtGJ849IDKuVBjCjU0ye8+dEU9Xn50=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=Utpqk/V0Fxlx3CgdNRr5Opcv0IBo5TqTCua5XZc+AvXpBqRzOUUqeBEoghPpOkmvX YuVKnCYs9wWx8dmtn5mqJ1u7rvSQ+vry9Gcs37xzy9Y/XCzg42t+qPXqTIxa1eUy8Z YaafDKC/K0DsoIO0H/Ly4FGBYhaH2nTIRiINOVs8= Message-ID: <2ed258a2-ac9f-413b-aa39-59a59cdee1fe@arm.com> Date: Fri, 3 Jul 2026 11:00:23 +0100 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] sched/fair: Stabilize idle SMT core selection with asym-capacity To: Andrea Righi , K Prateek Nayak Cc: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Ricardo Neri , Shrikanth Hegde , Felix Abecassis , Joel Fernandes , Phil Auld , linux-kernel@vger.kernel.org, Julia Lawall References: <20260630152747.128746-1-arighi@nvidia.com> Content-Language: en-US From: Christian Loehle In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On 7/3/26 10:40, Andrea Righi wrote: > Hi Prateek, > > On Fri, Jul 03, 2026 at 11:21:57AM +0530, K Prateek Nayak wrote: >> Hello Andrea, >> >> On 6/30/2026 8:57 PM, Andrea Righi wrote: >>> select_idle_capacity() scans all logical CPUs also when it is looking >>> for a fully idle SMT core. Two concurrent wakeups can therefore observe >>> the same core as idle, encounter different siblings first, and place one >>> task on each sibling while another core remains unused. >>> >>> Make every logical CPU of a selected idle core resolve to the same >>> stable CPU representative within the scan's existing affinity and >>> scheduling-domain mask. If the first task is enqueued before the next >>> scan examines the core, that scan rejects the now-busy core. If both >>> scans observe the core as idle, they select the same runqueue even if >>> the first enqueue becomes visible before the second scan finishes, >>> exposing the imbalance to the load balancer. >>> >>> The symmetric CPU idle selection path is subject to the same race, but >>> normally returns as soon as select_idle_core() finds a fully idle core, >>> reducing the conflict window. The per-CPU capacity scan can retain an >>> idle-core candidate while evaluating other CPUs, giving concurrent >>> wakeups more opportunity to select different siblings of the same SMT >>> core. Therefore, limit the normalization to the asym-capacity path, >>> where this behavior has a measurable impact. >>> >>> On NVIDIA Vera Rubin (arm64, 176 CPUs/88 cores per NUMA node), a >>> CPU-intensive NVPL SGEMM workload restricted to 88 threads (one per >>> core) showed a consistent 23% increase in mean throughput across >>> multiple runs. >> >> Interesting! This reads like active balance across cores is not aggressive >> enough for this workload and, as a result, stacking somehow helps. >> >> I would have expected balance within the core would trigger first and that >> would just lead to the same scenario as both sibling sibling busy but I >> guess there is a higher order effect of stacking. > > I think the key here is that temporary runqueue stacking is preferable to > consuming both SMT siblings when fully-idle SMT cores are available, more than > having benfits from the stacking itself. > I'm curious now, as a not at all SMT expert, this is super counterintuitive to me, am I missing something? How does this happen? The SMT-switch should be way less overhead than the task context-switch, no?