From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pf1-f174.google.com (mail-pf1-f174.google.com [209.85.210.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6D8DB111A8 for ; Mon, 18 Aug 2025 02:50:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.174 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755485438; cv=none; b=VDoyXJjQKIoEYq1H/2pWLYv+L5m/KFM+lpW6CuSFXVJMnvfC9nUaCX229Kp7uDuO4bCPIiqEqSwOYj+yBe5Fz+PNXO0HYrEH8cOhvOsqiLQq1i/cPrmzHfmljWVFBXRa8g6iKIb01UUwKkOc///hjERoePDKy0o/AVu5kVN5ECI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755485438; c=relaxed/simple; bh=PViqDczLcKA7G2xUvtvKSPWvoetbWNeLL/xKTcYaiXY=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=KDnoRj2Lygr43EHmUt0sftcMPLZtQ9/KZIyJpVY4q6xFPMUeJy8ySzzYJZq1qiv3vLphkRXbYyVpjEdPLyxI61byK3WemcdnT6ClcdEwJgIt1LEUOLqX6dhTfIIct6khXKtjTDiv8qjrkCJ/8R6ETALhkBnYW4IvxuaXA6ihni4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=dEJIRUQ9; arc=none smtp.client-ip=209.85.210.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="dEJIRUQ9" Received: by mail-pf1-f174.google.com with SMTP id d2e1a72fcca58-76e44537dccso1406099b3a.1 for ; Sun, 17 Aug 2025 19:50:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1755485436; x=1756090236; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=2VhIyc8lbfDWyD877vXvrvggoqpEeK0c8nmG+yxGdWw=; b=dEJIRUQ9YmQYPiTIMs0rhHVkh073GvAgvpCFL+ls5a91leB5uOhKJc8Z2x0smWpxMj j4gNKDxLflGjl2rvyVZ6gtYzBUNipnmxDNFw/FF+meNrmNCt+kDJiWRSgMMWu20ipEc9 q8fUFv4VTKbe7Tht0Rh1cUWwSlt1CqjFEfV3N251jp1Qc0MUTQ5LV7V6NPWHa0sn+cf5 O1gEJMhEeOc8Z8eYYFziY+Rg1BF2r6nbq9x/EJln9swiXf5fo3fXPcXa+GZBQxt5ShRA Pghy52bH2PJtB+qU8a1bcSMuEckI3XN6m0D3+5AUoeg+mwu02VkVuJdmG1GU1fpTXurR xFOA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1755485436; x=1756090236; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=2VhIyc8lbfDWyD877vXvrvggoqpEeK0c8nmG+yxGdWw=; b=JgfLgHXKlV2IwYn2oUGp4rD4m/+6H7Xb2cnq98lMyIV3SNxYhvswXctFs1Xi8GtRrZ 925p3KwycWyJIZ8SB+UAHcOEEMpeG9nvqgYS8am2p/I5DWkQQAIlZ2DSKv97D6moyYc4 95/y6uPY84V70b67FjTtNFr5pWfn46gLPTa6jC/kgIUuSRIAAFUJW8DzNIbAVr3L4zjG g4sgujaVSs+MQeZED0l7y5CfMprxK2RhMHtil6AP4Dui9YMJOd5L56+/gwDI7TagxOL+ 3gTlnhhrLccMkv//C5YSudB7EgSmgGYSGQ4OxerBeYdPtt8hfLgQNEY5u7tUY4MkuOM5 vaBw== X-Gm-Message-State: AOJu0YwkILge4Jl6L35Of59ytORJkVFBC+4cYW9GJGoHGqdUCYGKiISk jy0T2CTfFaM8sgaz1g5JCF8JqWzYmZwqVpPP6hQ0IfjfdJYPjhM1bPo0BwCNrOJrbA== X-Gm-Gg: ASbGncuOqJvzAkyPcdiFN7Y0/IeHGC6Bp3LvhRaNi8PGNZ7x7lguxLCtLsJWe1+HrX3 mujvYXUPgbWjyOnfdUDvA7qsjZYJhJ0bTF/WuZPD9vmFdUUiyHoClKV0riOVdXYYzvh9eKd0s/0 P/aj3yJWr+9TYwdKANmSc+Zbfn+icOCYMiYdn5UbefSn/24Kh4FvY10vpzT9RPFNXM8yw+HTbSO +MKuLqxf60N8Q5+1gBaS/bk+9evTnZlIGCawHpVBi+iIqvC735i9c+MD95eiS+8IiSKPEysN1wS jNEH32KevourbP75uOUQNfH/NJUOUIFRhu/z8rS+7fDJQ1td4hK8+SJw8zIIJeo4ZL1prpGpUF+ EHk0Fq3ZpyJIQeD1wbpdMw3k6xUY= X-Google-Smtp-Source: AGHT+IEQe01kbbXSTNLoPxtH6KW3enwJneJDatvTrHCbl0isPhSo7JzYq3iJfhP3PzcTRovZjsLGkw== X-Received: by 2002:a05:6a00:238c:b0:73d:fa54:afb9 with SMTP id d2e1a72fcca58-76e446fcdcbmr15841357b3a.7.1755485435557; Sun, 17 Aug 2025 19:50:35 -0700 (PDT) Received: from bytedance ([139.177.225.239]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-76e45566fafsm5966960b3a.73.2025.08.17.19.50.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 17 Aug 2025 19:50:35 -0700 (PDT) Date: Mon, 18 Aug 2025 10:50:14 +0800 From: Aaron Lu To: "Chen, Yu C" Cc: linux-kernel@vger.kernel.org, Juri Lelli , Dietmar Eggemann , Steven Rostedt , Mel Gorman , Chuyi Zhou , Jan Kiszka , Florian Bezdeka , Songtang Liu , Valentin Schneider , Ben Segall , K Prateek Nayak , Peter Zijlstra , Chengming Zhou , Josh Don , Ingo Molnar , Vincent Guittot , Xi Wang Subject: Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model Message-ID: <20250818025014.GA38@bytedance> References: <20250715071658.267-1-ziqianlu@bytedance.com> <20250715071658.267-4-ziqianlu@bytedance.com> <4efdc1a8-b624-4857-93cb-c40da6252983@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4efdc1a8-b624-4857-93cb-c40da6252983@intel.com> On Sun, Aug 17, 2025 at 04:50:50PM +0800, Chen, Yu C wrote: > On 7/15/2025 3:16 PM, Aaron Lu wrote: > > From: Valentin Schneider > > > > In current throttle model, when a cfs_rq is throttled, its entity will > > be dequeued from cpu's rq, making tasks attached to it not able to run, > > thus achiveing the throttle target. > > > > This has a drawback though: assume a task is a reader of percpu_rwsem > > and is waiting. When it gets woken, it can not run till its task group's > > next period comes, which can be a relatively long time. Waiting writer > > will have to wait longer due to this and it also makes further reader > > build up and eventually trigger task hung. > > > > To improve this situation, change the throttle model to task based, i.e. > > when a cfs_rq is throttled, record its throttled status but do not remove > > it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when > > they get picked, add a task work to them so that when they return > > to user, they can be dequeued there. In this way, tasks throttled will > > not hold any kernel resources. And on unthrottle, enqueue back those > > tasks so they can continue to run. > > > > Throttled cfs_rq's PELT clock is handled differently now: previously the > > cfs_rq's PELT clock is stopped once it entered throttled state but since > > now tasks(in kernel mode) can continue to run, change the behaviour to > > stop PELT clock only when the throttled cfs_rq has no tasks left. > > > > Tested-by: K Prateek Nayak > > Suggested-by: Chengming Zhou # tag on pick > > Signed-off-by: Valentin Schneider > > Signed-off-by: Aaron Lu > > --- > > [snip] > > > > @@ -8813,19 +8815,22 @@ static struct task_struct *pick_task_fair(struct rq *rq) > > { > > struct sched_entity *se; > > struct cfs_rq *cfs_rq; > > + struct task_struct *p; > > + bool throttled; > > again: > > cfs_rq = &rq->cfs; > > if (!cfs_rq->nr_queued) > > return NULL; > > + throttled = false; > > + > > do { > > /* Might not have done put_prev_entity() */ > > if (cfs_rq->curr && cfs_rq->curr->on_rq) > > update_curr(cfs_rq); > > - if (unlikely(check_cfs_rq_runtime(cfs_rq))) > > - goto again; > > + throttled |= check_cfs_rq_runtime(cfs_rq); > > se = pick_next_entity(rq, cfs_rq); > > if (!se) > > @@ -8833,7 +8838,10 @@ static struct task_struct *pick_task_fair(struct rq *rq) > > cfs_rq = group_cfs_rq(se); > > } while (cfs_rq); > > - return task_of(se); > > + p = task_of(se); > > + if (unlikely(throttled)) > > + task_throttle_setup_work(p); > > + return p; > > } > > Previously, I was wondering if the above change might impact > wakeup latency in some corner cases: If there are many tasks > enqueued on a throttled cfs_rq, the above pick-up mechanism > might return an invalid p repeatedly (where p is dequeued, By invalid, do you mean task that is in a throttled hierarchy? > and a reschedule is triggered in throttle_cfs_rq_work() to > pick the next p; then the new p is found again on a throttled > cfs_rq). Before the above change, the entire cfs_rq's corresponding > sched_entity was dequeued in throttle_cfs_rq(): se = cfs_rq->tg->se(cpu) > Yes this is true and it sounds inefficient, but these newly woken tasks may hold some kernel resources like a reader lock so we really want them to finish their kernel jobs and release that resource before being throttled or it can block/impact other tasks and even cause the whole system to hung. > So I did some tests for this scenario on a Xeon with 6 NUMA nodes and > 384 CPUs. I created 10 levels of cgroups and ran schbench on the leaf > cgroup. The results show that there is not much impact in terms of > wakeup latency (considering the standard deviation). Based on the data > and my understanding, for this series, > > Tested-by: Chen Yu Good to know this and thanks a lot for the test!