From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f169.google.com (mail-pl1-f169.google.com [209.85.214.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B7981221DAC for ; Mon, 27 Oct 2025 22:33:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.169 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761604406; cv=none; b=DWqTN3T1hkkmpKNP8UhtM/oOL/CTqx94Rr72YVZEgIq7jiJdl+pOM2KXhRC2ourzOLHnAZYhjz7DNqWKUslfrVKiWBzQ2pJ12UiO5k31rXNg6LS29hgN9jPkv/aecPjtqzMJXZlR0YxATPTcWV9j0w6rkwELSvXPwbaM81ldPvM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761604406; c=relaxed/simple; bh=zwrjCNlNufJVlJvBFlxfobQGeB1z+xEbVdASrKcBsTc=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=Dczz0Vymaa5bz9UdWNUto+8tp+YbZ0PaDBFhGYUX3yNzVoicmftnLtZqY+oSmLhuPuyX4cGoXku70QaTvZ4o1UreXqMa11cUWwtvbNET+dd+nKzp/tVfSU2HKj5XR9cMfxIMRXcUfVRxjn3/lSjQzQfBLQnevtJV64Ev02UlEd8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=3GICERYW; arc=none smtp.client-ip=209.85.214.169 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="3GICERYW" Received: by mail-pl1-f169.google.com with SMTP id d9443c01a7336-27ee41e074dso54166985ad.1 for ; Mon, 27 Oct 2025 15:33:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1761604404; x=1762209204; darn=vger.kernel.org; h=mime-version:user-agent:message-id:date:references:in-reply-to :subject:cc:to:from:from:to:cc:subject:date:message-id:reply-to; bh=HEXV+uraTxvvaKOMTO48dDMDH/SBV1aHMUlVLszCNKE=; b=3GICERYWvnROoqKiVub/+GaU5K5gqhJMfxCkQRjoHSeiUr+bswE+zyZd92JaLkWcaG rvGRpDW+X4iEtHkmQst4BHv/b8CdBvn0kwPMuEI2KL4hrhowSluiyBPq+2JchCmPiJz/ Jy2wcqZPkdXl0J9JGhrORFhvMme5lpTVSahHQR7BLxr6u1G4mIr3fMK7pnJ/wIihSXFS 6cl2NbvJ19yDBiTeUWCiBoXYgXUnrAYFEYVLNig3nzwhRXw0vfsv1XdeUnnXAe9ZQ3rc f7V60+a8ZpDxFSwNHsqx/nM2D31jlgEJIw0E4wssV/4qcq44UKHgn9R8B74pIvMNGk/s afDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761604404; x=1762209204; h=mime-version:user-agent:message-id:date:references:in-reply-to :subject:cc:to:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=HEXV+uraTxvvaKOMTO48dDMDH/SBV1aHMUlVLszCNKE=; b=eYNDPuaHpmcR3mBFhecscNtLu0bk+K48W3LTUehI3CFoLdQuSTA78AEefEKUgoI+7+ Gmy17WSgz8ErUUQiljA2pYOZbzZ+3SNposBCYeWxTRQibQnylWW/Wyu507gPSm1y2xfr P/s+35Q4NHihyNbRcj8NGKanT9s+8G3ChoxsOmZRiNuORAwFZk+N9MOa7oPr3PzHgnXn IOH2Fne+ns3JmSH2SytZ6rnz+a2T1pDJzpFqeq305P0B0SYbOvDOIKoYJsfMMVsMTawB xodghxJN7lpCgnJJ+fSQfLCi5HkXOb0j4u0MW2yRtu3Wi3wQjJHQRa0Qvw5+pRZuPiar F7Iw== X-Forwarded-Encrypted: i=1; AJvYcCWgCgBsi19Fa6pGWVUGjGqxwnkh4TfnGfBqIfthMv1PahKYS1Ru1dKCmwsG7M+5kvTdY5qhLwbir7kDuuc=@vger.kernel.org X-Gm-Message-State: AOJu0YzbHBL7B/QSYKb8/xrqzi2vTxyntMntrNpPHMGCh2htxSINdkD1 R2tEbiN8hCTvT/jO2beRLHIaNUV9nT/CZ2cvWwEcHGrrbUjzlO2oB6CBoBuNR7hPOQ== X-Gm-Gg: ASbGncuzYixPsrhhqYm6E3Yr2ayqEWHd5MYGBENj9wYvHlL64zB3IE2Ns000z9jxmeL DCuQnPnXQnCtoxtlm+a5ND4qAsV87KIAOeWRHYa3gpw+Fne+rLGQHsrtQR+yMvjVLU3V0td0CN+ uPUJj8unHl43nIbb6XuLE5Xk8g0Qna5V4mxuw9VwtT9wkIbH8JHx9elcybyngnQpzr2boV8xqeh /IAIrylofA8FdW+GvaHMfI89xTJOIHyuSY2u8a76zmHZd0QBfatXLcLfR4Qv9sbCKTDJPNP0BlI 87qbS4DNJHk5I9HX/y+7cgC2kSQ6up4W7Y6/k9eAlgNb4qllnymSsDkSJPBUc5H1CNR3bdtqeuN yP+yEJGl/qPjkeeBzUA4g0UWS07UDOJypeEgftdrVXG4Ud8Se1MEeAYXbGWKRdTVKNhCcp9Lu8k q29DURIy7TONU5I3YEka5OF5L2G97QVCa6dbUNyc6smMuttA/YPng= X-Google-Smtp-Source: AGHT+IGEW8AEiOYwqGBJFhgBznMYateoc1MVARqrKNcIl/37iCw7vRpLGYq9XeitevU9mf+6X0ByDw== X-Received: by 2002:a17:903:124c:b0:275:2328:5d3e with SMTP id d9443c01a7336-294cb3956cdmr17975895ad.18.1761604403595; Mon, 27 Oct 2025 15:33:23 -0700 (PDT) Received: from bsegall-glaptop.localhost (c-76-132-13-32.hsd1.ca.comcast.net. [76.132.13.32]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-29498e50dd9sm93198475ad.111.2025.10.27.15.33.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 27 Oct 2025 15:33:23 -0700 (PDT) From: Benjamin Segall To: Aaron Lu Cc: Valentin Schneider , K Prateek Nayak , Peter Zijlstra , Hao Jia , Chengming Zhou , Josh Don , Ingo Molnar , Vincent Guittot , Xi Wang , linux-kernel@vger.kernel.org, Juri Lelli , Dietmar Eggemann , Steven Rostedt , Mel Gorman , Chuyi Zhou , Jan Kiszka , Florian Bezdeka , Songtang Liu , Chen Yu , Matteo Martelli , Michal =?utf-8?Q?Koutn=C3=BD?= , Sebastian Andrzej Siewior Subject: Re: [PATCH v2] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining In-Reply-To: <20251023085604.244-1-ziqianlu@bytedance.com> (Aaron Lu's message of "Thu, 23 Oct 2025 16:56:04 +0800") References: <20251023085604.244-1-ziqianlu@bytedance.com> Date: Mon, 27 Oct 2025 15:33:19 -0700 Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain Aaron Lu writes: > When a cfs_rq is to be throttled, its limbo list should be empty and > that's why there is a warn in tg_throttle_down() for non empty > cfs_rq->throttled_limbo_list. > > When running a test with the following hierarchy: > > root > / \ > A* ... > / | \ ... > B > / \ > C* > > where both A and C have quota settings, that warn on non empty limbo list > is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu > part of the cfs_rq for the sake of simpler representation). > > Debug showed it happened like this: > Task group C is created and quota is set, so in tg_set_cfs_bandwidth(), > cfs_rq_c is initialized with runtime_enabled set, runtime_remaining > equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c, > *multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task > group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is > called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled), > these throttled tasks are directly placed into cfs_rq_c's limbo list by > enqueue_throttled_task(). > > Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these > tasks. The first enqueue triggers check_enqueue_throttle(), and with zero > runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it > can't get more runtime and enters tg_throttle_down(), where the warning > is hit due to remaining tasks in the limbo list. > > I think it's a chaos to trigger throttle on unthrottle path, the status > of a being unthrottled cfs_rq can be in a mixed state at the end, so fix > this by calling throttle_cfs_rq() in tg_set_cfs_bandwidth() immediately > after enabling bandwidth and setting runtime_remaining = 0. This ensures > cfs_rq_c is throttled upfront and cannot enter tg_unthrottle_up() with > zero runtime_remaining. > > Also, update outdated comments in tg_throttle_down() since > unthrottle_cfs_rq() is no longer called with zero runtime_remaining. > > While at it, remove a redundant assignment to se in tg_throttle_down(). > > Fixes: e1fad12dcb66("sched/fair: Switch to task based throttle model") > Signed-off-by: Aaron Lu > Reviewed-by: K Prateek Nayak > --- > v2: add update_rq_clock() before throttle_cfs_rq() as reported by Hao > Jia, or a warn on outdated rq clock is trigged in tg_throttle_down(). > This can happen when user specified a tiny quota. > > Note that Hao Jia also proposed another solution by using a special flag > when doing enqueue_task_fair() in unthrottle path to avoid doing > check_enqueue_throttle() [0]. I think that approach is fine too and it > also has the benefit of not needing to worry about any other potential > cases where a cfs_rq is unthrottled with <=0 runtime_remaining. Thoughts > on which approach to go is welcome, thanks. > [0]: https://lore.kernel.org/lkml/c4a1bcea-fb00-6f3f-6bf6-d876393190e4@gmail.com/ > > kernel/sched/core.c | 11 ++++++++++- > kernel/sched/fair.c | 16 +++++++--------- > kernel/sched/sched.h | 1 + > 3 files changed, 18 insertions(+), 10 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index f1ebf67b48e21..58185ec5b8efd 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -9608,7 +9608,16 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, > cfs_rq->runtime_enabled = runtime_enabled; > cfs_rq->runtime_remaining = 0; > > - if (cfs_rq->throttled) > + /* > + * Throttle cfs_rq now or it can be unthrottled with zero > + * runtime_remaining and gets throttled on its unthrottle path. > + */ > + if (cfs_rq->runtime_enabled && !cfs_rq->throttled) { > + update_rq_clock(rq); > + throttle_cfs_rq(cfs_rq); > + } > + > + if (!cfs_rq->runtime_enabled && cfs_rq->throttled) > unthrottle_cfs_rq(cfs_rq); > } > So if this is the only case it can come up, and it only occurs becasue we set runtime_remaining = 0 and check in unthrottle with <= 0, then I think we should just set runtime_remaining = 1 here. That seems simpler than either throttling immediately (despite likely having plenty of cfs_b->runtime) or adding an enqueue flag. Adding NR_CPUs nanoseconds worth of quota on configure seems fine. unthrottle_cfs_rq/tg_unthrottle_up itself doesn't drop rq lock, so we shouldn't be able to see cfs_rq->runtime_remaining being consumed during it, even if it's running on a remote cpu so that threads in the cfs_rq can be running. They should wind up stuck waiting for rq lock in order to update runtime_remaining. Is there anything you see missing from that approach? I think it doing = 0 in particular here is just an artifact, and while the extra check for runtime_remaining in unthrottle isn't unreasonable, the conflict with tg_set_cfs_bandwidth isn't a fundamental issue.