From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 65949C3DA6F for ; Thu, 24 Aug 2023 18:48:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243092AbjHXSs0 (ORCPT ); Thu, 24 Aug 2023 14:48:26 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54888 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243102AbjHXSsO (ORCPT ); Thu, 24 Aug 2023 14:48:14 -0400 Received: from mail-oi1-f173.google.com (mail-oi1-f173.google.com [209.85.167.173]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2B3141BCD for ; Thu, 24 Aug 2023 11:48:11 -0700 (PDT) Received: by mail-oi1-f173.google.com with SMTP id 5614622812f47-3a7d7e5fb03so110174b6e.3 for ; Thu, 24 Aug 2023 11:48:11 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692902890; x=1693507690; h=user-agent:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=ZUvI6CU9c0j6C8Mgz85WkGeA6Ridh+7alzDfu1qS1xg=; b=anRALwsdCfrh/1Y8eJhn+9PBnQMyqLv5msprEEh6MeWGPfM1BiF6gzhDSRURiJPz08 rjHgRvnfeCIUqG6HccdjOOYdeIfkXxRwwkYZTEXPW5+8yLX+Kd0v+YzUS8qCXqAGmlrz P7bun/bsndM4747Qma4qN1HNVd/zTf0ILZukVj4NqK2A5ze2Rti0wtvdCoKbbskQLvrp 5j4ma4qKj8xH7lwWuDdJEeeBRf/5kwIcSp3gHaNE93iSutA3wOuUBTcn+6f6cX4erxYJ vp6TslEE81bhXcholoh8I9cQglr8EK8gZLOQVr40SjZ5eRYL7s3XDfKzl89IgBOrsgff F0DQ== X-Gm-Message-State: AOJu0YzVvqAi/9yy5fAL/CRzJiRdUkoUptxzo3ozCX0mi6B0plEfIMeI t7wjgG6ykazUU+KrRhhv4yM= X-Google-Smtp-Source: AGHT+IFaegcnn2+z30m4QPiXlGuidbBErshfFJx2gAYqVe3FJfDUAQP4FFE6v2raiXgUXIPgk6kJLQ== X-Received: by 2002:a05:6358:248b:b0:132:d42f:8e19 with SMTP id m11-20020a056358248b00b00132d42f8e19mr16307531rwc.31.1692902890298; Thu, 24 Aug 2023 11:48:10 -0700 (PDT) Received: from maniforge ([2620:10d:c091:400::5:1df8]) by smtp.gmail.com with ESMTPSA id q5-20020ac87345000000b00400a99b8b38sm18757qtp.78.2023.08.24.11.48.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 24 Aug 2023 11:48:09 -0700 (PDT) Date: Thu, 24 Aug 2023 13:48:07 -0500 From: David Vernet To: Aaron Lu Cc: Peter Zijlstra , Vincent Guittot , Ingo Molnar , Juri Lelli , Daniel Jordan , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider , Tim Chen , Nitin Tekchandani , Yu Chen , Waiman Long , Deng Pan , Mathieu Desnoyers , "Gautham R . Shenoy" , linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg Message-ID: <20230824184807.GA6119@maniforge> References: <20230823060832.454842-1-aaron.lu@intel.com> <20230823060832.454842-2-aaron.lu@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20230823060832.454842-2-aaron.lu@intel.com> User-Agent: Mutt/2.2.10 (2023-03-25) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 23, 2023 at 02:08:32PM +0800, Aaron Lu wrote: > When using sysbench to benchmark Postgres in a single docker instance > with sysbench's nr_threads set to nr_cpu, it is observed there are times > update_cfs_group() and update_load_avg() shows noticeable overhead on > a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR): > > 13.75% 13.74% [kernel.vmlinux] [k] update_cfs_group > 10.63% 10.04% [kernel.vmlinux] [k] update_load_avg > > Annotate shows the cycles are mostly spent on accessing tg->load_avg > with update_load_avg() being the write side and update_cfs_group() being > the read side. tg->load_avg is per task group and when different tasks > of the same taskgroup running on different CPUs frequently access > tg->load_avg, it can be heavily contended. > > E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel > Sappire Rapids, during a 5s window, the wakeup number is 14millions and > migration number is 11millions and with each migration, the task's load > will transfer from src cfs_rq to target cfs_rq and each change involves > an update to tg->load_avg. Since the workload can trigger as many wakeups > and migrations, the access(both read and write) to tg->load_avg can be > unbound. As a result, the two mentioned functions showed noticeable > overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse: > during a 5s window, wakeup number is 21millions and migration number is > 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%. > > Reduce the overhead by limiting updates to tg->load_avg to at most once > per ms. After this change, the cost of accessing tg->load_avg is greatly > reduced and performance improved. Detailed test results below. > > ============================== > postgres_sysbench on SPR: > 25% > base: 42382±19.8% > patch: 50174±9.5% (noise) > > 50% > base: 67626±1.3% > patch: 67365±3.1% (noise) > > 75% > base: 100216±1.2% > patch: 112470±0.1% +12.2% > > 100% > base: 93671±0.4% > patch: 113563±0.2% +21.2% > > ============================== > hackbench on ICL: > group=1 > base: 114912±5.2% > patch: 117857±2.5% (noise) > > group=4 > base: 359902±1.6% > patch: 361685±2.7% (noise) > > group=8 > base: 461070±0.8% > patch: 491713±0.3% +6.6% > > group=16 > base: 309032±5.0% > patch: 378337±1.3% +22.4% > > ============================= > hackbench on SPR: > group=1 > base: 100768±2.9% > patch: 103134±2.9% (noise) > > group=4 > base: 413830±12.5% > patch: 378660±16.6% (noise) > > group=8 > base: 436124±0.6% > patch: 490787±3.2% +12.5% > > group=16 > base: 457730±3.2% > patch: 680452±1.3% +48.8% > > ============================ > netperf/udp_rr on ICL > 25% > base: 114413±0.1% > patch: 115111±0.0% +0.6% > > 50% > base: 86803±0.5% > patch: 86611±0.0% (noise) > > 75% > base: 35959±5.3% > patch: 49801±0.6% +38.5% > > 100% > base: 61951±6.4% > patch: 70224±0.8% +13.4% > > =========================== > netperf/udp_rr on SPR > 25% > base: 104954±1.3% > patch: 107312±2.8% (noise) > > 50% > base: 55394±4.6% > patch: 54940±7.4% (noise) > > 75% > base: 13779±3.1% > patch: 36105±1.1% +162% > > 100% > base: 9703±3.7% > patch: 28011±0.2% +189% > > ============================================== > netperf/tcp_stream on ICL (all in noise range) > 25% > base: 43092±0.1% > patch: 42891±0.5% > > 50% > base: 19278±14.9% > patch: 22369±7.2% > > 75% > base: 16822±3.0% > patch: 17086±2.3% > > 100% > base: 18216±0.6% > patch: 18078±2.9% > > =============================================== > netperf/tcp_stream on SPR (all in noise range) > 25% > base: 34491±0.3% > patch: 34886±0.5% > > 50% > base: 19278±14.9% > patch: 22369±7.2% > > 75% > base: 16822±3.0% > patch: 17086±2.3% > > 100% > base: 18216±0.6% > patch: 18078±2.9% > > Reported-by: Nitin Tekchandani > Suggested-by: Vincent Guittot > Signed-off-by: Aaron Lu > Reviewed-by: Vincent Guittot Hey Aaron, Thanks for working on this. It LGTM modulo two small nits. Feel free to add my Reviewed-by if you'd like regardless: Reviewed-by: David Vernet > --- > kernel/sched/fair.c | 13 ++++++++++++- > kernel/sched/sched.h | 1 + > 2 files changed, 13 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index c28206499a3d..a5462d1fcc48 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) > */ > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) > { > - long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; > + long delta; > + u64 now; > > /* > * No need to update load_avg for root_task_group as it is not used. > @@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) > if (cfs_rq->tg == &root_task_group) > return; > > + /* > + * For migration heavy workload, access to tg->load_avg can be s/workload/workloads > + * unbound. Limit the update rate to at most once per ms. Can we describe either here or in the commit summary how we arrived at 1ms? I'm fine with hard-coded heuristics like this (just like the proposed 6-core shard size in the shared_runq patchset), but it would also be ideal to give a bit more color on how we arrived here, because we'll forget immediately otherwise. > + */ > + now = sched_clock_cpu(cpu_of(rq_of(cfs_rq))); > + if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC) > + return; > + > + delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; > if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { > atomic_long_add(delta, &cfs_rq->tg->load_avg); > cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; > + cfs_rq->last_update_tg_load_avg = now; > } > } > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 6a8b7b9ed089..52ee7027def9 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -593,6 +593,7 @@ struct cfs_rq { > } removed; > > #ifdef CONFIG_FAIR_GROUP_SCHED > + u64 last_update_tg_load_avg; > unsigned long tg_load_avg_contrib; > long propagate; > long prop_runnable_sum; > -- > 2.41.0 >