From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pg1-f179.google.com (mail-pg1-f179.google.com [209.85.215.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 845422AE68 for ; Thu, 4 Sep 2025 08:16:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.179 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756973785; cv=none; b=q5O3+B+3zWh2NQstLPSZdL7ov3Muk3ek0MF28yUr89Ha2lLVygvwG1PExrmoy+R+XADvARZne2ohHFD8dUZJVxYlj/9zOewlx1dTxf6jSBU1sTrW+9KyPJI+UDSVSGQnR/Ffk/TOX+D3ULfwK5r/yZOyRGSVpWCcim2PYa8yQQo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756973785; c=relaxed/simple; bh=lEAo/i+3UCtQzxnQIa3OX2Jw6ExCPfTRbgK3vT4vdPQ=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=h+pi6wxPitmUwvLPHfwpbyAcR3CnN/QvxXeTK9NeeDCqWXAh4fOVbKKiL4vx499aV4mBnQuiSZCKmrb+jUE7Ir+8fyCCiNf+1UiLFYp10Hyqilp6pnA60t8uTzzVMFZXjs1qXAdWq8cXw8IUnlH+LYVXrBdINJbAkVupFEmxT2E= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=cg0TBEAe; arc=none smtp.client-ip=209.85.215.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="cg0TBEAe" Received: by mail-pg1-f179.google.com with SMTP id 41be03b00d2f7-b4f9d61e7deso461231a12.2 for ; Thu, 04 Sep 2025 01:16:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1756973783; x=1757578583; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=2ADLLJPjcyHqMqFotpE8rqFKN/fEUykCZgiIgkTBpDE=; b=cg0TBEAeOWhVFEUURnCTKfzdpOWvniVT9JwxfLzP1LebAddY3lsxN60xJ8Xa3SyiRk q9T/idpgLORo62CXQNFdmVEvANgcQgKFdTYCmBEMQILASdLcLSXepapB9RvFT/Osd5Dy zyFN9d4dHMBtrkeswyH0ZD5c5YO/KrddLjFk3fYhKIv+YhHjAx5iQfjLoowutka6JAjS DVysAcyilAvFWomsMquyXH+oD+DjSseNJ8+yJd161gyDhTZShFdRXLP0K9bWQTlH+FuJ UyRcC4NxihSvCqptUcoX+1NAm2n+fpRG1PXQmQAOoAb3sl1SUBKm/xm2ZjZcjCrwh30v 3yxQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1756973783; x=1757578583; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=2ADLLJPjcyHqMqFotpE8rqFKN/fEUykCZgiIgkTBpDE=; b=Tv12sF6kuk8pDUE1q+/HeAXASLh2AisTFj5ScISfsm1fDmoQWV8nzID/9jSUXv+L1k /sxwDx24thuusDzUyyJzqTOcNSuJZLhO9q5Un7OjwvTqBsdatbLGDnce+lj20t2HEMzX rnTgPAhEfzrKKEBpEBiGn/tajYS4vrBuJGXmylV6gDwS8rwq2KIYcD+V475uBkBJ53i+ JL/Exy44Y+hId8w8t8OtcTgSG2CHrpMElS68s053nnEEdYFc+2mZIY+t9T1MSstKpdvK 3uOY2bFy2jxl3DMSUaHtGzwUOUPhzKzUj7o7hiV2OoDe2dP0Kk7fI3TUen8+w6aFzpBO HFTw== X-Forwarded-Encrypted: i=1; AJvYcCUT1EMu5ukFp6X8nzs0yDHn1C2Rpcb5/XYX2Fz3GF3yJMTsbchYmLVZhNTjFVFkbxSh409soz9x/ie6Wlo=@vger.kernel.org X-Gm-Message-State: AOJu0Yy6n2Bjyow5BFc3bo46zywEfga0sf5k5uteOC7Y80BfqPzwM0Yc ZVNWDloPEkyaZ8G/lQrKkRolQKseS8AADq2dosdZItuhiTsIoAubKIlV7oA8NcoKPA== X-Gm-Gg: ASbGncviqebmdGJxk7R3E6uSReiojCzFFSrUBzAMOCzdXAyVP1Sza/3eADFR6C+mpPW tmSvjKyMQGvoT6JhWdXBs/SdxGbp3KiepnY5SKuQOOKR1dhK6XlfS8tQVm6w+wZ6HmcMaMEAsSQ YtnT7VUxwyqHQvaV/G04twoiC3DpnClK7+MXUM7Mek4nN6Su2Aq/vXbD8kyrC+v9E8332rzkVya hm7we899DJ0/0g4QoCGeD7/RWkXako5VjRwZfUCNmpfj8OzVKGJFPduacO8YrpIVK1qgRDt8Zl+ ugo0hOdZH8J40L9blGb+lgp7cuTHJFxabW4mV9oFJAJzyvxPPrA3/cedG5HSlcqrjmoDWBPiiZW QTfpjzfdjv1WGFOPyx1Z7SgNcVxCJhEHwKqB256yw9YeRULHlmw== X-Google-Smtp-Source: AGHT+IG31mZwNCOcdnsyBIHLSET1LHfkzf8HUmPQ3hOewg/SGDXYr3rw24rtaKmNjAbwfDspCWCIKg== X-Received: by 2002:a17:902:ecc1:b0:24c:b54e:bf0a with SMTP id d9443c01a7336-24cb54ec1c4mr40210455ad.0.1756973782487; Thu, 04 Sep 2025 01:16:22 -0700 (PDT) Received: from bytedance ([61.213.176.55]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-24903702396sm183459885ad.14.2025.09.04.01.16.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 Sep 2025 01:16:21 -0700 (PDT) Date: Thu, 4 Sep 2025 16:16:11 +0800 From: Aaron Lu To: Benjamin Segall Cc: K Prateek Nayak , Peter Zijlstra , Valentin Schneider , Chengming Zhou , Josh Don , Ingo Molnar , Vincent Guittot , Xi Wang , linux-kernel@vger.kernel.org, Juri Lelli , Dietmar Eggemann , Steven Rostedt , Mel Gorman , Chuyi Zhou , Jan Kiszka , Florian Bezdeka , Songtang Liu , Chen Yu , Matteo Martelli , Michal Koutn?? , Sebastian Andrzej Siewior Subject: Re: [PATCH v4 3/5] sched/fair: Switch to task based throttle model Message-ID: <20250904081611.GE42@bytedance> References: <20250829081120.806-1-ziqianlu@bytedance.com> <20250829081120.806-4-ziqianlu@bytedance.com> <20250903145124.GM4067720@noisy.programming.kicks-ass.net> <14be66aa-e088-4267-ac10-d04d600b1294@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Wed, Sep 03, 2025 at 01:46:48PM -0700, Benjamin Segall wrote: > K Prateek Nayak writes: > > > Hello Peter, > > > > On 9/3/2025 8:21 PM, Peter Zijlstra wrote: > >>> static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > >>> { > >>> + if (task_is_throttled(p)) { > >>> + dequeue_throttled_task(p, flags); > >>> + return true; > >>> + } > >>> + > >>> if (!p->se.sched_delayed) > >>> util_est_dequeue(&rq->cfs, p); > >>> > >> > >> OK, so this makes it so that either a task is fully enqueued (all > >> cfs_rq's) or full not. A group cfs_rq is only marked throttled when all > >> its tasks are gone, and unthrottled when a task gets added. Right? > > > > cfs_rq (and the hierarchy below) is marked throttled when the quota > > has elapsed. Tasks on the throttled hierarchies will dequeue > > themselves completely via task work added during pick. When the last > > task leaves on a cfs_rq of throttled hierarchy, PELT is frozen for > > that cfs_rq. > > > > When a new task is added on the hierarchy, the PELT is unfrozen and > > the task becomes runnable. The cfs_rq and the hierarchy is still > > marked throttled. > > > > Unthrottling of hierarchy is only done at distribution. > > > >> > >> But propagate_entity_cfs_rq() is still doing the old thing, and has a > >> if (cfs_rq_throttled(cfs_rq)) break; inside the for_each_sched_entity() > >> iteration. > >> > >> This seems somewhat inconsistent; or am I missing something ? > > > > Probably an oversight. But before that, what was the reason to have > > stopped this propagation at throttled_cfs_rq() before the changes? > > > > Yeah, this was one of the things I was (slowly) looking at - with this > series we currently still abort in: > > 1) update_cfs_group > 2) dequeue_entities's set_next_buddy > 3) check_preempt_fair > 4) yield_to > 5) propagate_entity_cfs_rq > > In the old design on throttle immediately remove the entire cfs_rq, > freeze time for it, and stop adjusting load. In the new design we still > pick from it, so we definitely don't want to stop time (and don't). I'm > guessing we probably also want to now adjust load for it, but it is > arguable - since all the cfs_rqs for the tg are likely to throttle at the > same time, so we might not want to mess with the shares distribution, > since when unthrottle comes around the most likely correct distribution > is the distribution we had at the time of throttle. > I can give it a test to see how things change by adjusting load and share distribution using my previous performance tests. > Assuming we do want to adjust load for a throttle then we probably want > to remove the aborts from update_cfs_group and propagate_entity_cfs_rq. > I'm guessing that we need the list_add_leaf_cfs_rq from propagate, but > I'm not 100% sure when they are actually doing something in propagate as > opposed to enqueue. > Yes, commit 0258bdfaff5bd("sched/fair: Fix unfairness caused by missing load decay") added that list_add_leaf_cfs_rq() in propagate_entity_cfs_rq() to fix a problem. > The other 3 are the same sort of thing - scheduling pick heuristics > which imo are pretty arbitrary to keep. We can reasonably say that "the > most likely thing a task in a throttled hierarchy will do is just go > throttle itself, so we shouldn't buddy it or let it preempt", but it > would also be reasonable to let them preempt/buddy normally, in case > they hold locks or such. I think we do not need to special case tasks in throttled hierarchy in check_preempt_wakeup_fair(). > > yield_to is used by kvm and st-dma-fence-chain.c. Yielding to a > throttle-on-exit kvm cpu thread isn't useful (so no need to remove the > abort there). The dma code is just yielding to a just-spawned kthread, > so it should be fine either way. Get it. The cumulated diff I'm going to experiment is below, let me know if something is wrong, thanks. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3e927b9b7eeb6..c2e46b8e5e3d9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3957,9 +3957,6 @@ static void update_cfs_group(struct sched_entity *se) if (!gcfs_rq || !gcfs_rq->load.weight) return; - if (throttled_hierarchy(gcfs_rq)) - return; - shares = calc_group_shares(gcfs_rq); if (unlikely(se->load.weight != shares)) reweight_entity(cfs_rq_of(se), se, shares); @@ -5234,6 +5231,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) static void check_enqueue_throttle(struct cfs_rq *cfs_rq); static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq); +static inline int cfs_rq_pelt_clock_throttled(struct cfs_rq *cfs_rq); static void requeue_delayed_entity(struct sched_entity *se); @@ -5729,6 +5727,11 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq) return cfs_bandwidth_used() && cfs_rq->throttled; } +static inline int cfs_rq_pelt_clock_throttled(struct cfs_rq *cfs_rq) +{ + return cfs_bandwidth_used() && cfs_rq->pelt_clock_throttled; +} + /* check whether cfs_rq, or any parent, is throttled */ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq) { @@ -6721,6 +6724,11 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq) return 0; } +static inline int cfs_rq_pelt_clock_throttled(struct cfs_rq *cfs_rq) +{ + return 0; +} + static inline int throttled_hierarchy(struct cfs_rq *cfs_rq) { return 0; @@ -7074,7 +7082,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) * Bias pick_next to pick a task from this cfs_rq, as * p is sleeping when it is within its sched_slice. */ - if (task_sleep && se && !throttled_hierarchy(cfs_rq)) + if (task_sleep && se) set_next_buddy(se); break; } @@ -8722,15 +8730,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int if (unlikely(se == pse)) return; - /* - * This is possible from callers such as attach_tasks(), in which we - * unconditionally wakeup_preempt() after an enqueue (which may have - * lead to a throttle). This both saves work and prevents false - * next-buddy nomination below. - */ - if (unlikely(throttled_hierarchy(cfs_rq_of(pse)))) - return; - if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) { set_next_buddy(pse); } @@ -13154,10 +13153,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se) { struct cfs_rq *cfs_rq = cfs_rq_of(se); - if (cfs_rq_throttled(cfs_rq)) - return; - - if (!throttled_hierarchy(cfs_rq)) + if (!cfs_rq_pelt_clock_throttled(cfs_rq)) list_add_leaf_cfs_rq(cfs_rq); /* Start to propagate at parent */ @@ -13168,10 +13164,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se) update_load_avg(cfs_rq, se, UPDATE_TG); - if (cfs_rq_throttled(cfs_rq)) - break; - - if (!throttled_hierarchy(cfs_rq)) + if (!cfs_rq_pelt_clock_throttled(cfs_rq)) list_add_leaf_cfs_rq(cfs_rq); } } -- 2.39.5