From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pj1-f52.google.com (mail-pj1-f52.google.com [209.85.216.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 87CC82CCC0 for ; Thu, 4 Sep 2025 12:04:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.52 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756987456; cv=none; b=bKQ9Z5zUWhKtVTv/ea0J9+z3JCfXI0n76DRNRY+jNaDTY9cTDVZhgKnWe30UnZcf3RhF7zkefaXY1+ItuREweMsufTCKnTvfWf6EDaYR/o9LWip8pSXY56PaZAR3zTp2rK+MYK7pbkojDuog6qzoTucpJYx+sPArogHoGttRjzU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756987456; c=relaxed/simple; bh=lGl2MbNsirxTb2TdNSbAhzazcBr/rp1kOoYDd4ovgYE=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=J5b4f43E2a+chkxvAqzgqqjTMdZSh22+S5vbp+/ItWgkpKQNKgoUUGavF/mrN59h0FWnANbLWqDJNtbduJCTIpKtTY8zQd6NEHFHAGEtSTFUiINBd2QyGFkJq7X5rOiRvD0RHYWtvqUdsRUMiZUPRLAME30vTc8/St6In+JqTgY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=KIXHkON0; arc=none smtp.client-ip=209.85.216.52 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="KIXHkON0" Received: by mail-pj1-f52.google.com with SMTP id 98e67ed59e1d1-329e1c8e079so799089a91.2 for ; Thu, 04 Sep 2025 05:04:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1756987453; x=1757592253; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=IJUjFRTIwREb6ZjdrljHQu9O3h67SQdkCtIrAFpCPbs=; b=KIXHkON0wIIk1p0oAZAnWDkgCpmrpd4lcLFbQZQ35iatLKnY6ug9h9wR5n/hVUTRHo qWC0dqSdAptENB9qVOkEVa7J/Br+fTw4rAPL025oLxXjsHOQVt3+XxSENcUajg51elK0 pC4qHKDOtnzt/drqNs7EyMC7VpRPzlQFfwjQfi8pPUvLL6MUjHI5LzQkvTMDqV4LB0d5 Yu2i9tAQmNKEGrt8DjsCsWxBTr67odORjriM98Z6LFQsBU+6bOd2lswHTXP9Gq3xRGLV GkHUUfTEhXRo8Qc8ozWBQD8Ybg1ZrAqBraRQw/5wWCjdPvc+ki3UyIeouJynxMq+BCjz epJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1756987453; x=1757592253; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=IJUjFRTIwREb6ZjdrljHQu9O3h67SQdkCtIrAFpCPbs=; b=jalRQLq0ChSbLqGK59IvD93sJJKc/x9TzWd4vywnBPqeoWgfW+R3I+H3Y0KfEO+jdN dD0ma35z5/hfIG2xTCVvk5Sw1N+8Tp77W2iBeQmWFVVpKkaBZnF1OSPvvxeTJh56LrD1 9u9Da92tWNbrgNsjKHNGnVP9Vb+nXNFX6HkUiQnwf9F8hUkh8NRW6x+eqCg0l0nOjQMm hY50xAoS0qGO7zMIzE5E9qOWrSqwLnJlXn3VUYqr0cP7Sxi19/HLaB7PVsXQVcL1KkM+ +CcBzu7zTC7W3IgJwwD9RGrcsTF7qZpBo4BxEBjOtdam1weMwuPlRaaEbr6Nsdk0Ukr+ ghzQ== X-Forwarded-Encrypted: i=1; AJvYcCU/aVxqnOsVBZIEN8Hqs+coaI/R/8XJV2w7XYycOTERIMiLrbqUOPDLyjpT4oBR/IyDaM38bWhHjyVUhMM=@vger.kernel.org X-Gm-Message-State: AOJu0Yymqu2fDfTpFAsTQNzqovt0Bzgora5OXCQAZQ+8aHR7hAygG14h g8YmyRslhjCR9l4odmEqzLsJ5CKyg6bF5dAl7E4K6guzqT04NYfxmZ7kqhnsRFG0qA== X-Gm-Gg: ASbGncslmCsi43MWcDGICv6n8q5Ekxrt7or8wbAeKzqeIoTJRYDk7g/vdyqgTdstGYP Wzvsm4no+R0ii0ao4WIfs3YGJwb4yOdMv/KpmbEw4v2+qA8wFmUdoRBf1nNelkZVKcQPzLFKKVc pWaHYKixhz1v75Ilirl/Irs+yoNsZvE3P98AVqFgXh9uNqpFT166RJ0IlMcE0lT16xZ82hXG0pi a0ogNfdvmTvAGSrhA+Y4a1Ny9MBC3qfrji1TRjn/p7+QnUeQnoldU039Ri46Ki+zB7qLPC4RywI DJ1WPxq6MtpIe3bS6NThosSZe6PNbvFm4OKg1ZCBJqL4UGevvIQ2RSap4n+9EcTszqniVoRNh3x Z7eSjLQZl2fJnuvNdYY8K4yspdNS511LwlDeIu7r/gDoDj4IrUw== X-Google-Smtp-Source: AGHT+IH8Tg9q+phA5iI4fVJwpeklHVNbQqKLI4VFDnhH5jlcO/eiU3I6e37RUnTQfdOYSHOwldwpyg== X-Received: by 2002:a17:90b:4d92:b0:327:a638:d21 with SMTP id 98e67ed59e1d1-3281531d3bbmr25209004a91.0.1756987452489; Thu, 04 Sep 2025 05:04:12 -0700 (PDT) Received: from bytedance ([61.213.176.58]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-7726c1ed65bsm8688569b3a.45.2025.09.04.05.04.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 Sep 2025 05:04:11 -0700 (PDT) Date: Thu, 4 Sep 2025 20:04:01 +0800 From: Aaron Lu To: Benjamin Segall Cc: K Prateek Nayak , Peter Zijlstra , Valentin Schneider , Chengming Zhou , Josh Don , Ingo Molnar , Vincent Guittot , Xi Wang , linux-kernel@vger.kernel.org, Juri Lelli , Dietmar Eggemann , Steven Rostedt , Mel Gorman , Chuyi Zhou , Jan Kiszka , Florian Bezdeka , Songtang Liu , Chen Yu , Matteo Martelli , Michal Koutn?? , Sebastian Andrzej Siewior Subject: Re: [PATCH v4 3/5] sched/fair: Switch to task based throttle model Message-ID: <20250904120401.GJ42@bytedance> References: <20250829081120.806-1-ziqianlu@bytedance.com> <20250829081120.806-4-ziqianlu@bytedance.com> <20250903145124.GM4067720@noisy.programming.kicks-ass.net> <14be66aa-e088-4267-ac10-d04d600b1294@amd.com> <20250904081611.GE42@bytedance> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250904081611.GE42@bytedance> On Thu, Sep 04, 2025 at 04:16:11PM +0800, Aaron Lu wrote: > On Wed, Sep 03, 2025 at 01:46:48PM -0700, Benjamin Segall wrote: > > K Prateek Nayak writes: > > > > > Hello Peter, > > > > > > On 9/3/2025 8:21 PM, Peter Zijlstra wrote: > > >>> static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > >>> { > > >>> + if (task_is_throttled(p)) { > > >>> + dequeue_throttled_task(p, flags); > > >>> + return true; > > >>> + } > > >>> + > > >>> if (!p->se.sched_delayed) > > >>> util_est_dequeue(&rq->cfs, p); > > >>> > > >> > > >> OK, so this makes it so that either a task is fully enqueued (all > > >> cfs_rq's) or full not. A group cfs_rq is only marked throttled when all > > >> its tasks are gone, and unthrottled when a task gets added. Right? > > > > > > cfs_rq (and the hierarchy below) is marked throttled when the quota > > > has elapsed. Tasks on the throttled hierarchies will dequeue > > > themselves completely via task work added during pick. When the last > > > task leaves on a cfs_rq of throttled hierarchy, PELT is frozen for > > > that cfs_rq. > > > > > > When a new task is added on the hierarchy, the PELT is unfrozen and > > > the task becomes runnable. The cfs_rq and the hierarchy is still > > > marked throttled. > > > > > > Unthrottling of hierarchy is only done at distribution. > > > > > >> > > >> But propagate_entity_cfs_rq() is still doing the old thing, and has a > > >> if (cfs_rq_throttled(cfs_rq)) break; inside the for_each_sched_entity() > > >> iteration. > > >> > > >> This seems somewhat inconsistent; or am I missing something ? > > > > > > Probably an oversight. But before that, what was the reason to have > > > stopped this propagation at throttled_cfs_rq() before the changes? > > > > > > > Yeah, this was one of the things I was (slowly) looking at - with this > > series we currently still abort in: > > > > 1) update_cfs_group > > 2) dequeue_entities's set_next_buddy > > 3) check_preempt_fair > > 4) yield_to > > 5) propagate_entity_cfs_rq > > > > In the old design on throttle immediately remove the entire cfs_rq, > > freeze time for it, and stop adjusting load. In the new design we still > > pick from it, so we definitely don't want to stop time (and don't). I'm Per my understanding, we keep PELT clock running because we want the throttled cfs_rq's load to continue get update when it still has tasks running in kernel mode and have that up2date load could let it have a hopefully more accurate weight through update_cfs_group(). So it looks to me, if PELT clock should not be stopped, then we should not abort in propagate_entity_cfs_rq() and update_cfs_group(). I missed these two aborts in these two functions, but now you and Peter have pointed this out, I suppose there is no doubt we should not abort in update_cfs_group() and propagate_entity_cfs_rq()? If we should not mess with shares distribution, then the up2date load is not useful and why not simply freeze PELT clock on throttle :) > > guessing we probably also want to now adjust load for it, but it is > > arguable - since all the cfs_rqs for the tg are likely to throttle at the > > same time, so we might not want to mess with the shares distribution, > > since when unthrottle comes around the most likely correct distribution > > is the distribution we had at the time of throttle. > > > > I can give it a test to see how things change by adjusting load and share > distribution using my previous performance tests. > Run hackbench and netperf on AMD Genoa and didn't notice any obvious difference with the cumulated diff. > > Assuming we do want to adjust load for a throttle then we probably want > > to remove the aborts from update_cfs_group and propagate_entity_cfs_rq. > > I'm guessing that we need the list_add_leaf_cfs_rq from propagate, but > > I'm not 100% sure when they are actually doing something in propagate as > > opposed to enqueue. > > > > Yes, commit 0258bdfaff5bd("sched/fair: Fix unfairness caused by missing > load decay") added that list_add_leaf_cfs_rq() in > propagate_entity_cfs_rq() to fix a problem. > > > The other 3 are the same sort of thing - scheduling pick heuristics > > which imo are pretty arbitrary to keep. We can reasonably say that "the > > most likely thing a task in a throttled hierarchy will do is just go > > throttle itself, so we shouldn't buddy it or let it preempt", but it > > would also be reasonable to let them preempt/buddy normally, in case > > they hold locks or such. > > I think we do not need to special case tasks in throttled hierarchy in > check_preempt_wakeup_fair(). > Since there is pros and cons either way and consider the performance test result, I'm now feeling maybe we can leave these 3 as is and revisit them later when there is some clear case. > > > > yield_to is used by kvm and st-dma-fence-chain.c. Yielding to a > > throttle-on-exit kvm cpu thread isn't useful (so no need to remove the > > abort there). The dma code is just yielding to a just-spawned kthread, > > so it should be fine either way. > > Get it. > > The cumulated diff I'm going to experiment is below, let me know if > something is wrong, thanks.