Re: [RFC PATCH 1/2] sched/fair: Only throttle CFS tasks on return to userspace

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Benjamin Segall <bsegall@google.com>
To: Valentin Schneider <vschneid@redhat.com>
Cc: linux-kernel@vger.kernel.org,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Phil Auld <pauld@redhat.com>,
	Clark Williams <williams@redhat.com>,
	Tomas Glozar <tglozar@redhat.com>
Subject: Re: [RFC PATCH 1/2] sched/fair: Only throttle CFS tasks on return to userspace
Date: Fri, 01 Dec 2023 14:09:43 -0800	[thread overview]
Message-ID: <xm26o7f93788.fsf@google.com> (raw)
In-Reply-To: <xhsmhfs0maw49.mognet@vschneid-thinkpadt14sgen2i.remote.csb> (Valentin Schneider's message of "Fri, 01 Dec 2023 14:30:14 +0100")

Valentin Schneider <vschneid@redhat.com> writes:

> On 30/11/23 13:26, Benjamin Segall wrote:
>> Valentin Schneider <vschneid@redhat.com> writes:
>>
>> The alternative we've been experimenting with (and still running into
>> other issues that have made it hard to tell if they work) is to still
>> leave the tasks on their cfs_rqs, but instead have two task_timelines or
>> similar per cfs_rq, one of which only has unthrottleable tasks (or
>> partially-throttled child cgroups) on it. Then when picking into a
>> partially-unthrottled cfs_rq you only look at that alternate task_timeline.
>>
>
> IIUC then you don't dequeue the cfs_rq's se, but instead rely on the RB
> tree switch to only have unthrottable tasks running.

Correct.

>
>> This means that we get to skip this per-actually-throttled-task loop:
>>
>>> @@ -5548,7 +5548,61 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
>>>  {
>>>      struct rq *rq = data;
>>>      struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
>>> +	struct sched_entity *se = tg->se[cpu_of(rq)];
>>> +	struct cfs_rq *pcfs_rq = cfs_rq_of(se);
>>> +	long task_delta = 0, idle_task_delta = 0;
>>> +	struct task_struct *p, *tmp;
>>>
>>> +	/*
>>> +	 * Re-enqueue the tasks that have been throttled at this level.
>>> +	 *
>>> +	 * The task count is up-propagated via ->unthrottled_*h_nr_running,
>>> +	 * as we can't purely rely on h_nr_running post-enqueue: the unthrottle
>>> +	 * might happen when a cfs_rq still has some tasks enqueued, either still
>>> +	 * making their way to userspace, or freshly migrated to it.
>>> +	 */
>>> +	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
>>> +		struct sched_entity *pse = &p->se;
>>> +
>>> +		list_del_init(&p->throttle_node);
>>> +		enqueue_entity(cfs_rq, pse, ENQUEUE_WAKEUP);
>>> +		task_delta++;
>>> +		idle_task_delta += task_has_idle_policy(p);
>>> +	}
>>
>> The downsides are that you instead have extra operations per
>> enqueue/dequeue/pick (but just an extra list/rbtree operation or check),
>> and that it doesn't do *any* accounting change for a partially dequeued
>> cfs_rq.
>>
>
> I would assume we want to keep the *nr_running as close to reality as
> possible, given their impact on pick_next_task_fair() & load_balance().
>

Yeah, and while it's maybe ok for the longer-period load_balance, it's
definitely sketchy for the shorter term things like select_idle_sibling.
In theory we could duplicate more and more of the accounting... (though
that also then becomes questionable amounts of increased overhead on
enqueue/dequeue).

>> I'm going to try putting together a cleaner variant of our version that
>> works via task_work instead of bracketing every relevant entry point.
>> (That design came from when we were trying instead to only do it for
>> tasks holding actual locks)
>
> Interesting, thank you for sharing! I assume then the motivation for this
> is to reduce latencies caused by throttling lock holders?

Yeah, and then we ran into things like percpu-rwsem where lock/unlock
accounting doesn't work as well, but kept the basic design.

next prev parent reply	other threads:[~2023-12-01 22:10 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-30 16:12 [RFC PATCH 0/2] sched/fair: Delay throttling to kernel exit Valentin Schneider
2023-11-30 16:12 ` [RFC PATCH 1/2] sched/fair: Only throttle CFS tasks on return to userspace Valentin Schneider
2023-11-30 21:26   ` Benjamin Segall
2023-12-01 13:30     ` Valentin Schneider
2023-12-01 22:09       ` Benjamin Segall [this message]
2023-12-07 23:47       ` Benjamin Segall
2023-12-18 18:07         ` Valentin Schneider
2023-12-18 22:34           ` Benjamin Segall
2023-12-19 11:50             ` Valentin Schneider
2023-12-05 18:50   ` kernel test robot
2023-12-05 18:50   ` kernel test robot
2023-12-05 20:53   ` kernel test robot
2023-11-30 16:12 ` [RFC PATCH 2/2] sched/fair: Repurpose cfs_rq_throttled() Valentin Schneider

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xm26o7f93788.fsf@google.com \
    --to=bsegall@google.com \
    --cc=bristot@redhat.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=pauld@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglozar@redhat.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=williams@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.