From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pg1-f180.google.com (mail-pg1-f180.google.com [209.85.215.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6A14D2D6624 for ; Wed, 3 Sep 2025 07:14:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.180 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756883673; cv=none; b=UXKnOhPFdHatruVEIIo+39PKfFQCk33056vfcUq+MFGVzebQPi6o0Wo6GOif+3bdiIYWjiFSCu0C5HTYTRlTyWS0Twkmt4olJXKdnwazkRsOsJbuRTgtZJ5X4NvjtK3Kw39Wc+81e5Iow479oR7J/+caG4axA+U6gh1hNBL57Vw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756883673; c=relaxed/simple; bh=MtLj3j792MTkREn0ifYlURd1e4cyjoXcWtDfbqVOOu0=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=Uy7w1VikcdXFBiFjdLn1d1JlQ3sBmrqlehhq20w/YMZZFRpS35Ks3VyY2XWX5Gp/z7iw88xp5MZwc2cELT0/JFEdOMkBcW5VUaIgpJUdfNP0wIlca6L+GVHTBwjk1VKoXj1HO5EVnyOUoILaON9AEhNU/ax7pAWKfxwXM1KofqY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=Zs9DcnNG; arc=none smtp.client-ip=209.85.215.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="Zs9DcnNG" Received: by mail-pg1-f180.google.com with SMTP id 41be03b00d2f7-b4cf40cd0d1so4088282a12.0 for ; Wed, 03 Sep 2025 00:14:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1756883670; x=1757488470; darn=vger.kernel.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=Yy8uUV2KUjZh/6MG0/nrKp90bEarPiScLuck9NFvcrM=; b=Zs9DcnNGofm0PMikHUxQXbHhgL1E1Waa6M5I2CQ7q6UVvkDgPgMCGtNEl/q82A+cNI ZbKBTAGaX3tKPwcZAQ6Qy5DqWdefpB2nVJrERzB5McUVY8KBxe+Ci808EPgtG3KDXmFy wA8tYpcTGqex9l1bO4N3TvZPv2mZd8/FSyW9c6mQ0JwYpt0Hdhh6GLgbpfptWoI50v2i Oa5bDNWJPDe62kMmMUV3uehF7XTrR45cZwBrL6mIFxNPXRp2GlHYMgh/knX9BFdK57U8 GDMQbdN3daJ3zmQcyvf2rSfc5TglY6dV5Pm6CfUyNrKOJMw8kKGtEQ0U2+tPq+hh6N26 XXKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1756883670; x=1757488470; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Yy8uUV2KUjZh/6MG0/nrKp90bEarPiScLuck9NFvcrM=; b=P+XVgPGIRNjHgutl66Ry86cqTOoutz5v/W2ccnAztDknpFfG4H7O4g4oyw5Xr0GRsU rZranOFx4+C4BD9B41bol4iHRP4860Zrkpv41ILNlpWQk74Lf7I2cmhudjTfYzf53GBY rttHkgSxDdJr+1ynMEYAM6v8cBue8ml78jKAvXP7+kBtN3q3f27ZCaIAUfUZRwdYkX0/ IHhpM1xvO42q2d4uuCuh6NLfsoQddI2h0hAPSb0b9irHHmk7wJVhYi8vrEiT81tr/w+E IXKJ/HEElhnm97eUSDkeyTe1lfTn86Odx6X7ZtqlnUsGuvnrSB6GtclOqRWL2wPVulvx wKOQ== X-Forwarded-Encrypted: i=1; AJvYcCWdmpBw1lF5kQdSTPOpk8uPVjCGv331z/LWO/TUPwUql/6G/asyJPfxB5m3SCMYvs8FprdWrlL083kTgVk=@vger.kernel.org X-Gm-Message-State: AOJu0YyDjZy3osCHS2VGySN1gnRFahgfY/3l8XETbt8TeNkxoOT2WmYk Qom3QcH2MpPdQ8CV1JlOUvC+r5+J63MiWspKSgupuvQ7faKcGio+vnE6QJbewf5duA== X-Gm-Gg: ASbGncvvch6Adx7CQjkGj1666zK6PoCOh+48yFEgc4aylTW6XQPk4BX2YIZ/44ScPHK NigldSHpoo/C8++w8jrlFm3iYKVLDRzbrcBlFINpW/5D03b/SRBsPklU72pRk6r86wAYnV6ij4H DnEiQ42orzKF++ISxyBOAH9oWgd6imA0P3Dz9krCpaY4UIeXtOzwaCkNYuKeJAQjFvMEd/LK1jB eO7V0HPfAj6x+qj27PfZNo1vrtEUnxqztCDV/nLJG1MpQ4dVlDaz03tHscSxAvqEuQ39GJijiPO v3yHv9Cu//eQqBPagpplRLGAClzw8uYb70PfDxd6A7Nz4/VthElTu4AQRol6K2ZgdNlbENnRW6C OY8SI+Ro6u319DQxM7ipDwsp2rZvRKUhC+ciw03UbDmdZKuetyA== X-Google-Smtp-Source: AGHT+IEyu+LL9PIjrMV2Sc0pvuCqgV31zattkOVodjSiVZm8Diq0bPOdRtvg1M96rDMtGOgDuQZdVw== X-Received: by 2002:a17:903:290:b0:248:70b9:c0df with SMTP id d9443c01a7336-24944a1204emr195459055ad.27.1756883670156; Wed, 03 Sep 2025 00:14:30 -0700 (PDT) Received: from bytedance ([61.213.176.56]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-24ca1c492afsm4984825ad.31.2025.09.03.00.14.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 03 Sep 2025 00:14:29 -0700 (PDT) Date: Wed, 3 Sep 2025 15:14:10 +0800 From: Aaron Lu To: Valentin Schneider Cc: Ben Segall , K Prateek Nayak , Peter Zijlstra , Chengming Zhou , Josh Don , Ingo Molnar , Vincent Guittot , Xi Wang , linux-kernel@vger.kernel.org, Juri Lelli , Dietmar Eggemann , Steven Rostedt , Mel Gorman , Chuyi Zhou , Jan Kiszka , Florian Bezdeka , Songtang Liu , Sebastian Andrzej Siewior Subject: Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model Message-ID: <20250903071410.GA42@bytedance> References: <20250715071658.267-1-ziqianlu@bytedance.com> <20250715071658.267-4-ziqianlu@bytedance.com> <20250808101330.GA75@bytedance> <20250812084828.GA52@bytedance> <20250815092910.GA33@bytedance> <20250822110701.GB289@bytedance> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20250822110701.GB289@bytedance> On Fri, Aug 22, 2025 at 07:07:01PM +0800, Aaron Lu wrote: > On Fri, Aug 15, 2025 at 05:30:08PM +0800, Aaron Lu wrote: > > On Thu, Aug 14, 2025 at 05:54:34PM +0200, Valentin Schneider wrote: > ... ... > > > I would also suggest running similar benchmarks but with deeper > > > hierarchies, to get an idea of how much worse unthrottle_cfs_rq() can get > > > when tg_unthrottle_up() goes up a bigger tree. > > > > No problem. > > > > I suppose I can reuse the previous shared test script: > > https://lore.kernel.org/lkml/CANCG0GdOwS7WO0k5Fb+hMd8R-4J_exPTt2aS3-0fAMUC5pVD8g@mail.gmail.com/ > > > > There I used: > > nr_level1=2 > > nr_level2=100 > > nr_level3=10 > > > > But I can tweak these numbers for this performance evaluation. I can make > > the leaf level to be 5 level deep and place tasks in leaf level cgroups > > and configure quota on 1st level cgroups. > > Tested on Intel EMR(2 sockets, 120cores, 240cpus) and AMD Genoa(2 > sockets, 192cores, 384cpus), with turbo/boost disabled, cpufreq set to > performance and cpuidle states all disabled. > > cgroup hierarchy: > nr_level1=2 > nr_level2=2 > nr_level3=2 > nr_level4=5 > nr_level5=5 > i.e. two cgroups in the root level, with each level1 cgroup having 2 > child cgroups, and each level2 cgroup having 2 child cgroups, etc. This > creates a 5 level deep, 200 leaf cgroups setup. Tasks are placed in leaf > cgroups. Quota are set on the two level1 cgroups. > > The TLDR is, when there is a very large number of tasks(like 8000 tasks), > task based throttle saw 10-20% performance drop on AMD Genoa; otherwise, > no obvious performance change is observed. Detailed test results below. > > Netperf: measured in throughput, more is better > - quota set to 50 cpu for each level1 cgroup; > - each leaf cgroup run a pair of netserver and netperf with following > cmdline: > netserver -p $port_for_this_cgroup > netperf -p $port_for_this_cgroup -H 127.0.0.1 -t UDP_RR -c -C -l 30 > i.e. each cgroup has 2 tasks, total task number is 2 * 200 = 400 > tasks. > > On Intel EMR: > base head diff > throughput 33305±8.40% 33995±7.84% noise > > On AMD Genoa: > base head diff > throughput 5013±1.16% 4967±1.82 noise > > > Hackbench, measured in seconds, less is better: > - quota set to 50cpu for each level1 cgroup; > - each cgroup runs with the following cmdline: > hackbench -p -g 1 -l $see_below > i.e. each leaf cgroup has 20 sender tasks and 20 receiver tasks, total > task number is 40 * 200 = 8000 tasks. > > On Intel EMR(loops set to 100000): > > base head diff > Time 85.45±3.98% 86.41±3.98% noise > > On AMD Genoa(loops set to 20000): > > base head diff > Time 104±4.33% 116±7.71% -11.54% > > So for this test case, task based throttle suffered ~10% performance > drop. I also tested on another AMD Genoa(same cpu spec) to make sure > it's not a machine problem and performance dropped there too: > > On 2nd AMD Genoa(loops set to 50000) > > base head diff > Time 81±3.13% 101±7.05% -24.69% > > According to perf, __schedule() in head takes 7.29% cycles while in base > it takes 4.61% cycles. I suppose with task based throttle, __schedule() > is more frequent since tasks in a throttled cfs_rq have to be dequeued > one by one while in current behaviour, the cfs_rq can be dequeued off rq > in one go. This is most obvious when there are multiple tasks in a single > cfs_rq; if there is only 1 task per cfs_rq, things should be roughly the > same for the two throttling model. > > With this said, I reduced the task number and retested on this 2nd AMD > Genoa: > - quota set to 50 cpu for each level1 cgroup; > - using only 1 fd pair, i.e. 2 task for each cgroup: > hackbench -p -g 1 -f 1 -l 50000000 > i.e. each leaf cgroup has 1 sender task and 1 receiver task, total > task number is 2 * 200 = 400 tasks. > > base head diff > Time 127.77±2.60% 127.49±2.63% noise > > In this setup, performance is about the same. > > Now I'm wondering why on Intel EMR, running that extreme setup(8000 > tasks), performance of task based throttle didn't see noticeable drop... Looks like hackbench doesn't like task migration on this AMD system (domain0 SMT; domain1 MC; domain2 PKG; domain3 NUMA). If I revert patch5, running this 40 * 200 = 8000 hackbench workload again, performance is roughly the same now(head~1 is slightly worse but given the 4+% stddev in base, it can be considered in noise range): base head~1(patch1-4) diff head(patch1-5) Time 82.55±4.82% 84.45±2.70% -2.3% 99.69±6.71% According to /proc/schedstat, the lb_gained for domain2 is: NOT_IDLE IDLE NEWLY_IDLE base 0 8052 81791 head~1 0 7197 175096 head 1 14818 793065 Other domains have similar number: base has smallest migration number while head has the most and head~1 reduce the number a lot. I suppose this is expected, because we removed the throttled_lb_pair() restriction in patch5 and that can cause runnable tasks in throttled hierarchy to be balanced to other cpus while in base, this can not happen. I think patch5 still makes sense and is correct, it's just this specific workload doesn't like task migrations. Intel EMR doesn't suffer from this, I suppose that's because EMR has a much larger LLC while AMD Genoa has a relatively small LLC and task migrations across LLC boundary hurts hackbench's performance. I also tried to apply below hack to prove this "task migration across LLC boundary hurts hackbench" theory on both base and head: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b173a059315c2..34c5f6b75e53d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9297,6 +9297,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) if ((p->se.sched_delayed) && (env->migration_type != migrate_load)) return 0; + if (!(env->sd->flags & SD_SHARE_LLC)) + return 0; + if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu)) return 0; With this diff applied, the result is: base' head' diff Time 74.78±8.2% 78.87±15.4% -5.47% base': base + above diff head': head + above diff So both performs better now, but with much larger variance, I guess that's because no load balance on domain2 and above now. head' is still worse than base, but not as much as before. To conclude this: hackbench doesn't like task migration, especially when task is migrated across LLC boundary. patch5 removed the restriction of no balancing throttled tasks, this caused more balance to happen and hackbench doesn't like this. But balancing has its own merit and could still benefit other workloads so I think patch5 should stay, especially considering that when throttled tasks are eventually dequeued, they will not stay on rq's cfs_tasks list so no need to take special care for them when doing load balance. On a side note: should we increase the cost of balancing tasks out of LLC boundary? I tried to enlarge sysctl_sched_migration_cost 100 times for domains without SD_SHARE_LLC in task_hot() but that didn't help.