From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D12FF2F46 for ; Fri, 18 Oct 2024 11:36:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.156.1 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729251401; cv=none; b=Av2FgD/GfghwHxvGAkqtWfAQlRU3TKA8NJ3com81WAwenQVx17r7Xbvzx42WMNUtxSUwZUwc9/yg7/fUG3QDsv+alfsyj+4xWLTXTvlUsBmz7L3XbkV8HfVEHVNeBqaEWr0yJEKrzAqH2f/9CAj5EfiB7RGJkKDXNBFzvAdYvsM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729251401; c=relaxed/simple; bh=4pzYH27kexmk58V14s78PMCtx9I72AUyzQ2NX6PTvbI=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=Kgb8yzXJHsZnu2yr95S6u2it0e//RkNdNmSmIhEp/f6bKZm3+Zkb43ZWSCh7lkiLc8mHTU4Emrl+BzlwEoL/3R9ju1mXOJoJcdT16V16g90cP6fbOX3cK5e14YMjFukRoyfIruNaKtnfWJzEyKed2viuVcfoYa11p8hq9+usGLg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=NybafT9Y; arc=none smtp.client-ip=148.163.156.1 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="NybafT9Y" Received: from pps.filterd (m0360083.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 49I9wA0O025098; Fri, 18 Oct 2024 11:35:53 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-type:date:from:in-reply-to:message-id:mime-version :references:subject:to; s=pp1; bh=UVrWRVZd//MTitsCOA1LI3/VouAIsG n/iAnymff9yVA=; b=NybafT9YutM8+NQdC165V940Nd2LC1oNHsr1nF3yzi+kvO mP2FYuFjHxkqseniGGWDwUP2TYMmQnrMN6UWnlr/CXmBIJmQKdY/Vy6LywMvdeGb 3rtOFPa0y1fIkazDrUdAKgdtLCIefWD+Avk1PTs6hMfJ23jDcl0uYPTVyB7uMa5a HatfN2ufWPWjcyih18X1OM4BDb9k/JegIrlhZFLyPZnEm+d2HEJnVW8/ArsAgPz4 Vy6skacXU8HUm3sTQbN/rdzbeh76bNxMqbrzaJKpNQLlncLb0XV3sp41wSCLX1QW CxeDtZIHEKG2sKZAUZpm45BXvdY3rkOSNsiONILQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 42asbd8sa1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 18 Oct 2024 11:35:53 +0000 (GMT) Received: from m0360083.ppops.net (m0360083.ppops.net [127.0.0.1]) by pps.reinject (8.18.0.8/8.18.0.8) with ESMTP id 49IBW2LB023603; Fri, 18 Oct 2024 11:35:52 GMT Received: from ppma21.wdc07v.mail.ibm.com (5b.69.3da9.ip4.static.sl-reverse.com [169.61.105.91]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 42asbd8s9w-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 18 Oct 2024 11:35:52 +0000 (GMT) Received: from pps.filterd (ppma21.wdc07v.mail.ibm.com [127.0.0.1]) by ppma21.wdc07v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 49IBUk3r002343; Fri, 18 Oct 2024 11:35:51 GMT Received: from smtprelay02.fra02v.mail.ibm.com ([9.218.2.226]) by ppma21.wdc07v.mail.ibm.com (PPS) with ESMTPS id 4284en4768-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 18 Oct 2024 11:35:51 +0000 Received: from smtpav01.fra02v.mail.ibm.com (smtpav01.fra02v.mail.ibm.com [10.20.54.100]) by smtprelay02.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 49IBZnla47055126 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 18 Oct 2024 11:35:49 GMT Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 330022004D; Fri, 18 Oct 2024 11:35:49 +0000 (GMT) Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 47DC120043; Fri, 18 Oct 2024 11:35:46 +0000 (GMT) Received: from linux.ibm.com (unknown [9.109.245.217]) by smtpav01.fra02v.mail.ibm.com (Postfix) with ESMTPS; Fri, 18 Oct 2024 11:35:46 +0000 (GMT) Date: Fri, 18 Oct 2024 17:05:43 +0530 From: Vishal Chourasia To: Steve Wahl Cc: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , linux-kernel@vger.kernel.org, Russ Anderson , Dimitri Sivanich Subject: Re: [PATCH] sched/topology: improve topology_span_sane speed Message-ID: References: <20241010155111.230674-1-steve.wahl@hpe.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20241010155111.230674-1-steve.wahl@hpe.com> X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: tUmU6aCUgHqwn63SkxV72_ofvGncmj_N X-Proofpoint-GUID: MqMknrqi3I2F0kCz99-spMeBY9cMUZJ0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1051,Hydra:6.0.680,FMLib:17.12.62.30 definitions=2024-10-15_01,2024-10-11_01,2024-09-30_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 bulkscore=0 suspectscore=0 malwarescore=0 spamscore=0 lowpriorityscore=0 phishscore=0 mlxscore=0 clxscore=1011 adultscore=0 priorityscore=1501 mlxlogscore=907 impostorscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.19.0-2409260000 definitions=main-2410180073 On Thu, Oct 10, 2024 at 10:51:11AM -0500, Steve Wahl wrote: > Use a different approach to topology_span_sane(), that checks for the > same constraint of no partial overlaps for any two CPU sets for > non-NUMA topology levels, but does so in a way that is O(N) rather > than O(N^2). > > Instead of comparing with all other masks to detect collisions, keep > one mask that includes all CPUs seen so far and detect collisions with > a single cpumask_intersects test. > > If the current mask has no collisions with previously seen masks, it > should be a new mask, which can be uniquely identified by the lowest > bit set in this mask. Keep a pointer to this mask for future > reference (in an array indexed by the lowest bit set), and add the > CPUs in this mask to the list of those seen. > > If the current mask does collide with previously seen masks, it should > be exactly equal to a mask seen before, looked up in the same array > indexed by the lowest bit set in the mask, a single comparison. > > Move the topology_span_sane() check out of the existing topology level > loop, let it use its own loop so that the array allocation can be done > only once, shared across levels. > > On a system with 1920 processors (16 sockets, 60 cores, 2 threads), > the average time to take one processor offline is reduced from 2.18 > seconds to 1.01 seconds. (Off-lining 959 of 1920 processors took > 34m49.765s without this change, 16m10.038s with this change in place.) > > Signed-off-by: Steve Wahl > --- > kernel/sched/topology.c | 79 ++++++++++++++++++++++++++++------------- > 1 file changed, 54 insertions(+), 25 deletions(-) > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index 9748a4c8d668..c150074033d3 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -2356,36 +2356,65 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve > > /* > * Ensure topology masks are sane, i.e. there are no conflicts (overlaps) for > - * any two given CPUs at this (non-NUMA) topology level. > + * any two given CPUs on non-NUMA topology levels. > */ > -static bool topology_span_sane(struct sched_domain_topology_level *tl, > - const struct cpumask *cpu_map, int cpu) > +static bool topology_span_sane(const struct cpumask *cpu_map) > { > - int i = cpu + 1; > + struct sched_domain_topology_level *tl; > + const struct cpumask **masks; > + struct cpumask *covered; > + int cpu, id; > + bool ret = false; > > - /* NUMA levels are allowed to overlap */ > - if (tl->flags & SDTL_OVERLAP) > - return true; > + lockdep_assert_held(&sched_domains_mutex); > + covered = sched_domains_tmpmask; > + > + masks = kmalloc_array(num_possible_cpus(), sizeof(struct cpumask *), GFP_KERNEL); > + if (!masks) > + return ret; > + > + for_each_sd_topology(tl) { > + > + /* NUMA levels are allowed to overlap */ > + if (tl->flags & SDTL_OVERLAP) > + continue; > + > + cpumask_clear(covered); > + memset(masks, 0, num_possible_cpus() * sizeof(struct cpumask *)); > > - /* > - * Non-NUMA levels cannot partially overlap - they must be either > - * completely equal or completely disjoint. Otherwise we can end up > - * breaking the sched_group lists - i.e. a later get_group() pass > - * breaks the linking done for an earlier span. > - */ > - for_each_cpu_from(i, cpu_map) { > /* > - * We should 'and' all those masks with 'cpu_map' to exactly > - * match the topology we're about to build, but that can only > - * remove CPUs, which only lessens our ability to detect > - * overlaps > + * Non-NUMA levels cannot partially overlap - they must be either > + * completely equal or completely disjoint. Otherwise we can end up > + * breaking the sched_group lists - i.e. a later get_group() pass > + * breaks the linking done for an earlier span. > */ > - if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) && > - cpumask_intersects(tl->mask(cpu), tl->mask(i))) > - return false; > + for_each_cpu(cpu, cpu_map) { > + /* lowest bit set in this mask is used as a unique id */ > + id = cpumask_first(tl->mask(cpu)); > + > + /* if this mask doesn't collide with what we've already seen */ > + if (!cpumask_intersects(tl->mask(cpu), covered)) { > + /* this failing would be an error in this algorithm */ > + if (WARN_ON(masks[id])) > + goto notsane; > + > + /* record the mask we saw for this id */ > + masks[id] = tl->mask(cpu); > + cpumask_or(covered, tl->mask(cpu), covered); > + } else if ((!masks[id]) || !cpumask_equal(masks[id], tl->mask(cpu))) { > + /* > + * a collision with covered should have exactly matched > + * a previously seen mask with the same id > + */ > + goto notsane; > + } > + } > } > + ret = true; > > - return true; > + notsane: > + kfree(masks); > + return ret; > } > > /* > @@ -2417,9 +2446,6 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > sd = NULL; > for_each_sd_topology(tl) { > > - if (WARN_ON(!topology_span_sane(tl, cpu_map, i))) > - goto error; > - > sd = build_sched_domain(tl, cpu_map, attr, sd, i); > > has_asym |= sd->flags & SD_ASYM_CPUCAPACITY; > @@ -2433,6 +2459,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > } > } > > + if (WARN_ON(!topology_span_sane(cpu_map))) > + goto error; Hi Steve, Is there any reason why above check is done after initializing sched domain struct for all the CPUs in the cpu_map? It looks to me, that this check can be performed before the call to __visit_domain_allocation_hell() in the build_sched_domains() resulting in early return if topology_span_sane() detects incorrect topology. Also, the error path in the current code only cleans up d->rd struct, keeping all the work done by build_sched_domain() inside the loop and __alloc_sdt() called from __visit_domain_allocation_hell() is it because we need all that work to remain intact? static void __free_domain_allocs(struct s_data *d, enum s_alloc what, const struct cpumask *cpu_map) { switch (what) { case sa_rootdomain: if (!atomic_read(&d->rd->refcount)) free_rootdomain(&d->rd->rcu); fallthrough; case sa_sd: free_percpu(d->sd); fallthrough; case sa_sd_storage: __sdt_free(cpu_map); fallthrough; case sa_none: break; } } > + > /* Build the groups for the domains */ > for_each_cpu(i, cpu_map) { > for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > -- > 2.26.2 >