From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pf1-f179.google.com (mail-pf1-f179.google.com [209.85.210.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 945C42EA154 for ; Wed, 3 Sep 2025 10:11:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.179 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756894275; cv=none; b=MaLT5gXxPxMCtQtNx4mzN3FfgSNYag87yw10miuZIGAsAkiHceep6a307jMA4A70UzUDCHpZcFoWPxv0YRAN/0OUs54j2g7YCPrl0j7Hl+a7622rHoGs+PRaIM31RWGFC7RuJtuDOPwyHQyXz3kYqDODaxSz64rDBfil466466M= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756894275; c=relaxed/simple; bh=4Lx2giHwG6Mu/e9LWO1sDlxmxYMvtP0SNlKyvK7YVfs=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=LANgEcf09XEFfozU3Sh8KzFsJ+0sVDyGSku+3JmUiRHHtOGdsiIeC3JRWUlGA5+dS7tQBVirJjOVJWWxcTqsjyO15c4zF5f25cIRyWDZX3A0AVezLW8S2pcoEYB1kPNgOChEiob9LXeIwMPq0e9KLHEWpq9QTFmSoohpsEYNKPY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=Vz3Yi+Jg; arc=none smtp.client-ip=209.85.210.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="Vz3Yi+Jg" Received: by mail-pf1-f179.google.com with SMTP id d2e1a72fcca58-772481b2329so4068111b3a.2 for ; Wed, 03 Sep 2025 03:11:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1756894273; x=1757499073; darn=vger.kernel.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=W4jQfJVzSv1Nd2oLvd8fbn2hTUNd8qg98uQz9A7X4Hg=; b=Vz3Yi+JgKPoey6lzPCXIqqv5wU8iYR7GyNZ96PC6b+tQxipO6kDBUyHKFUfnzwdXDW QqBIutAjI0MVcvhaD3oa6jtN5+LLnniovS3pxkDBXBWV7gZE65z2am97sSQJs8oIo3Ag fA2f1PwoKJZZzFGHgbgdt7iBpRkApSgPL+lhStiZSOO4wtaGQZi9ofCUbpPdPPq6O0v0 L5h/BR9rT+MnA8Dl48lSOpJTUW2qg/ZJ565wQoLUhegWPunRJvliyksRO4d/bty3pipE bidx1MnBMLPBTAsQpL5gGj44tjU1+IwoePQUg+hCI5V8UQztpEdw8AEFwQFrlUz8KMrE AhBg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1756894273; x=1757499073; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=W4jQfJVzSv1Nd2oLvd8fbn2hTUNd8qg98uQz9A7X4Hg=; b=A/3bMvtmJR5ATYRSzV0jBiVKw/6FIhk21QuN2Fk7nylyoQDBmjggR9NEvb0FL+7WS8 HCpx7pCWdQG6hv0J+cgrEWBdsGuM6uQMSart+rrtlM+izr0QjJSHOGFfTrII251/X8p+ Xc3dfG+hI4pB1m3T8T83F09w1VfiJxjXLkN8i3ru45HRyf0hAdYK5kCuUYY5n0NPxri/ qGqWVqPKtE2T8uiHVljqPLyudUwPUKvO5FMIIrMu84FID5VTV+oiblTHvF61wzAGQTiO 5eSF+396jSjWaNWgqd7OwSB7FE7CDF/O0R9ji7gCbnLe9DvBGKcgwZoWaWF0FyDHkWsh zbvw== X-Forwarded-Encrypted: i=1; AJvYcCURsZcdpE+YqgAQ7fkv7kZNK8yPXxd2BQP9ejGeqp289ER6VFk8YARpCJYoqOpkQ6xFR7DkmkuEL0jLBY8=@vger.kernel.org X-Gm-Message-State: AOJu0YzZ6b2y7ReDnordcUI8BgLC5CZ3kZCM6JvmO2uRVIFGaF0Po/G5 Xs9cWWkJKfhlTeIvMW87/NnXeotEtpefEAMulHeXMz1241EAxaDQjBpaBTSzYOiqcg== X-Gm-Gg: ASbGnct+9xn2QaI1onYd4THKBlwUGkfRsT/MVvPmnHxqf4NDUcFe/SqsaO8EW+NnbpO kOH7gvkosVdO6M4GGS2n2vjhRnVxkEh8hTU5TGs9vLXqXlVcaoji9gvxB2BdvaEVPU+eMvwYuTq HU4VhKGsL2nH9xrvdIrK+toJQu6caO6jAKQexwi8UvoKDNfUMd+VjVcTXsqpgPzoI5raxRMfw9Y 50VybZ22N+2yxGgF2u+zW/F403nWHwzgfzNkWjBK/TZM8eYMEQ5uSBJh79O3oejFGsDPEAPJoFm qzIxeWA1pl+9qFILyjw1UxNFhLbM2qOJ8z0h3rFsaIkq6tOaAPLGdSmMeRlTXqofDZb8QTcNPAL /UpAzA6k6Jn1xqTv/KQQuBzNFZpuKH0HcddHSpqzk8yEGQfmcuXnQG5HdpNQl X-Google-Smtp-Source: AGHT+IGxiZRyoEGYCRGGWHigdvjT4CPP6boJhIsnwQaS9ltoV3DHAWcM8lYnKekrTO81gQm/7Z5/+A== X-Received: by 2002:a05:6a00:460e:b0:772:fc2:be71 with SMTP id d2e1a72fcca58-7723e338532mr17781687b3a.20.1756894272640; Wed, 03 Sep 2025 03:11:12 -0700 (PDT) Received: from bytedance ([61.213.176.57]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-7722a2d97acsm16125948b3a.41.2025.09.03.03.11.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 03 Sep 2025 03:11:12 -0700 (PDT) Date: Wed, 3 Sep 2025 18:11:02 +0800 From: Aaron Lu To: K Prateek Nayak Cc: Valentin Schneider , Ben Segall , Peter Zijlstra , Chengming Zhou , Josh Don , Ingo Molnar , Vincent Guittot , Xi Wang , linux-kernel@vger.kernel.org, Juri Lelli , Dietmar Eggemann , Steven Rostedt , Mel Gorman , Chuyi Zhou , Jan Kiszka , Florian Bezdeka , Songtang Liu , Sebastian Andrzej Siewior Subject: Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model Message-ID: <20250903101102.GB42@bytedance> References: <20250715071658.267-4-ziqianlu@bytedance.com> <20250808101330.GA75@bytedance> <20250812084828.GA52@bytedance> <20250815092910.GA33@bytedance> <20250822110701.GB289@bytedance> <20250903071410.GA42@bytedance> <2b2e91e7-0514-4728-a3a3-96282a0d4286@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <2b2e91e7-0514-4728-a3a3-96282a0d4286@amd.com> Hi Prateek, On Wed, Sep 03, 2025 at 02:41:55PM +0530, K Prateek Nayak wrote: > Hello Aaron, > > On 9/3/2025 12:44 PM, Aaron Lu wrote: > > On Fri, Aug 22, 2025 at 07:07:01PM +0800, Aaron Lu wrote: > >> With this said, I reduced the task number and retested on this 2nd AMD > >> Genoa: > >> - quota set to 50 cpu for each level1 cgroup; > > What exactly is the quota and period when you say 50cpu? period is the default 100000 and quota is set to 5000000. > > >> - using only 1 fd pair, i.e. 2 task for each cgroup: > >> hackbench -p -g 1 -f 1 -l 50000000 > >> i.e. each leaf cgroup has 1 sender task and 1 receiver task, total > >> task number is 2 * 200 = 400 tasks. > >> > >> base head diff > >> Time 127.77±2.60% 127.49±2.63% noise > >> > >> In this setup, performance is about the same. > >> > >> Now I'm wondering why on Intel EMR, running that extreme setup(8000 > >> tasks), performance of task based throttle didn't see noticeable drop... > > > > Looks like hackbench doesn't like task migration on this AMD system > > (domain0 SMT; domain1 MC; domain2 PKG; domain3 NUMA). > > > > If I revert patch5, running this 40 * 200 = 8000 hackbench workload > > again, performance is roughly the same now(head~1 is slightly worse but > > given the 4+% stddev in base, it can be considered in noise range): > > > > base head~1(patch1-4) diff head(patch1-5) > > Time 82.55±4.82% 84.45±2.70% -2.3% 99.69±6.71% > > > > According to /proc/schedstat, the lb_gained for domain2 is: > > > > NOT_IDLE IDLE NEWLY_IDLE > > base 0 8052 81791 > > head~1 0 7197 175096 > > head 1 14818 793065 > > Since these are mostly idle and newidle balance, I wonder if we can run > into a scenario where, > > 1. All the tasks are throttled. > 2. CPU turning idle does a newidle balance. > 3. CPU pulls a tasks from throttled hierarchy and selects it. > 4. The task exits to user space and is dequeued. > 5. Goto 1. > > and when the CPU is unthrottled, it has a large number of tasks on it > that'll again require a load balance to even stuff out. > I think it is because we allow balancing tasks under a throttled hirarchy that made the balance number much larger. > > > > Other domains have similar number: base has smallest migration number > > while head has the most and head~1 reduce the number a lot. I suppose > > this is expected, because we removed the throttled_lb_pair() restriction > > in patch5 and that can cause runnable tasks in throttled hierarchy to be > > balanced to other cpus while in base, this can not happen. > > > > I think patch5 still makes sense and is correct, it's just this specific > > workload doesn't like task migrations. Intel EMR doesn't suffer from > > this, I suppose that's because EMR has a much larger LLC while AMD Genoa > > has a relatively small LLC and task migrations across LLC boundary hurts > > hackbench's performance. > > I think we can leave the throttled_lb_pair() condition as is and revisit > it later if this is visible in real world workloads. I cannot think of > any easy way to avoid the case for potential pileup without accounting > for the throttled tasks in limbo except for something like below at > head~1: > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index bdc9bfa0b9ef..3dc807af21ba 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -9385,7 +9385,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) > /* > * We do not migrate tasks that are: > * 1) delayed dequeued unless we migrate load, or > - * 2) throttled_lb_pair, or > + * 2) throttled_lb_pair unless we migrate load, or > * 3) cannot be migrated to this CPU due to cpus_ptr, or > * 4) running (obviously), or > * 5) are cache-hot on their current CPU, or > @@ -9394,7 +9394,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) > if ((p->se.sched_delayed) && (env->migration_type != migrate_load)) > return 0; > > - if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu)) > + if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu) && > + env->migration_type != migrate_load) > return 0; > > /* > --- > > Since load_avg moves slowly, it might be enough to avoid pileup of > tasks. This is similar to the condition for migrating delayed tasks > above but unlike the hierarchies of delayed tasks, the weight of > throttled hierarchy does change when throttled tasks are transitioned to > limbo so this needs some more staring at. > I was thinking: should we not allow task balancing to a throttled target cfs_rq? For task based throttle model, if a task is on rq's cfs_tasks list, it is allowed to run so we should not check src cfs_rq's throttle status but we should check if the target cfs_rq is throttled and if it is, then it's probably not very useful to do the balance. I tried below diff and the performance is restored: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index df8dc389af8e1..3e927b9b7eeb6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9370,6 +9370,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) if ((p->se.sched_delayed) && (env->migration_type != migrate_load)) return 0; + if (throttled_hierarchy(task_group(p)->cfs_rq[env->dst_cpu])) + return 0; + /* * We want to prioritize the migration of eligible tasks. * For ineligible tasks we soft-limit them and only allow base head' diff head(patch1-5) Time 82.55±4.82% 83.81±2.89% -1.5% 99.69±6.71% head': head + above diff I also tested netperf on this AMD system as well as hackbench and netperf on Intel EMR, no obvious performance difference observed after applying the above diff, i.e. base and head' performance is roughly the same. Does the above diff make sense? One thing I'm slightly concerned is, there may be one case when balancing a task to a throttled target cfs_rq makes sense: if the task holds some kernel resource and is running inside kernel, even its target cfs_rq is throttled, we still want this task to go there and finish its job in kernel mode sooner, this could help other resource waiters. But, this may not be a big deal and in most of the time, balancing a task to a throttled cfs_rq doesn't look like a meaningful thing to do. Best regards, Aaron