Re: [PATCH 04/10] sched/fair: Add penalty calculation and application logic

All of lore.kernel.org
 help / color / mirror / Atom feed

From: K Prateek Nayak <kprateek.nayak@amd.com>
To: Wanpeng Li <kernellwp@gmail.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	Sean Christopherson <seanjc@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	<linux-kernel@vger.kernel.org>, <kvm@vger.kernel.org>,
	Wanpeng Li <wanpengli@tencent.com>
Subject: Re: [PATCH 04/10] sched/fair: Add penalty calculation and application logic
Date: Wed, 12 Nov 2025 12:55:23 +0530	[thread overview]
Message-ID: <ef7d974e-7fcd-4e30-8a0d-8b97e00478bc@amd.com> (raw)
In-Reply-To: <20251110033232.12538-5-kernellwp@gmail.com>

Hello Wanpeng,

On 11/10/2025 9:02 AM, Wanpeng Li wrote:
> +/*
> + * Calculate penalty with debounce logic for EEVDF yield deboost.
> + * Computes vruntime penalty based on fairness gap (need) plus granularity,
> + * applies queue-size-based caps to prevent excessive penalties in small queues,
> + * and implements reverse-pair debounce (~300us) to reduce ping-pong effects.
> + * Returns 0 if no penalty needed, otherwise returns clamped penalty value.
> + */
> +static u64 __maybe_unused yield_deboost_calculate_penalty(struct rq *rq, struct sched_entity *se_y_lca,
> +				    struct sched_entity *se_t_lca, struct sched_entity *se_t,
> +				    int nr_queued)
> +{
> +	u64 gran, need, penalty, maxp;
> +	u64 gran_floor;
> +	u64 weighted_need, base;
> +
> +	gran = calc_delta_fair(sysctl_sched_base_slice, se_y_lca);
> +	/* Low-bound safeguard for gran when slice is abnormally small */
> +	gran_floor = calc_delta_fair(sysctl_sched_base_slice >> 1, se_y_lca);
> +	if (gran < gran_floor)

Is this even possible?

> +		gran = gran_floor;
> +
> +	need = 0;
> +	if (se_t_lca->vruntime > se_y_lca->vruntime)
> +		need = se_t_lca->vruntime - se_y_lca->vruntime;

So I'm assuming you want the yielding task's vruntime to
cross the target's vruntime simply because one task somewhere
down the hierarchy said so.

> +
> +	/* Apply 10% boost to need when positive (weighted_need = need * 1.10) */
> +	penalty = gran;

So at the very least I see it getting weighted(base_slice / 2) penalty
... 

> +	if (need) {
> +		/* weighted_need = need + 10% */
> +		weighted_need = need + need / 10;
> +		/* clamp to avoid overflow when adding to gran (still capped later) */
> +		if (weighted_need > U64_MAX - penalty)
> +			weighted_need = U64_MAX - penalty;
> +		penalty += weighted_need;

... if not more ...

> +	}
> +
> +	/* Apply debounce via helper to avoid ping-pong */
> +	penalty = yield_deboost_apply_debounce(rq, se_t, penalty, need, gran);

... since without debounce, penalty remains same.

> +
> +	/* Upper bound (cap): slightly more aggressive for mid-size queues */
> +	if (nr_queued == 2)
> +		maxp = gran * 6;		/* Strongest push for 2-task ping-pong */
> +	else if (nr_queued == 3)
> +		maxp = gran * 4;		/* 4.0 * gran */
> +	else if (nr_queued <= 6)
> +		maxp = (gran * 5) / 2;		/* 2.5 * gran */
> +	else if (nr_queued <= 8)
> +		maxp = gran * 2;		/* 2.0 * gran */
> +	else if (nr_queued <= 12)
> +		maxp = (gran * 3) / 2;		/* 1.5 * gran */
> +	else
> +		maxp = gran;			/* 1.0 * gran */

And all the nr_queued calculations are based on the entities queued
and not the "h_nr_queued" so we can have a boat load of tasks to
run above but since one task decided to call yield_to() let us make
them all starve a little?

> +
> +	if (penalty < gran)
> +		penalty = gran;
> +	if (penalty > maxp)
> +		penalty = maxp;
> +
> +	/* If no need, apply refined baseline push (low risk + mid risk combined). */
> +	if (need == 0) {
> +		/*
> +		 * Baseline multiplier for need==0:
> +		 *   2        -> 1.00 * gran
> +		 *   3        -> 0.9375 * gran
> +		 *   4–6      -> 0.625 * gran
> +		 *   7–8      -> 0.50  * gran
> +		 *   9–12     -> 0.375 * gran
> +		 *   >12      -> 0.25  * gran
> +		 */
> +		base = gran;
> +		if (nr_queued == 3)
> +			base = (gran * 15) / 16;	/* 0.9375 */
> +		else if (nr_queued >= 4 && nr_queued <= 6)
> +			base = (gran * 5) / 8;		/* 0.625 */
> +		else if (nr_queued >= 7 && nr_queued <= 8)
> +			base = gran / 2;		/* 0.5 */
> +		else if (nr_queued >= 9 && nr_queued <= 12)
> +			base = (gran * 3) / 8;		/* 0.375 */
> +		else if (nr_queued > 12)
> +			base = gran / 4;		/* 0.25 */
> +
> +		if (penalty < base)
> +			penalty = base;
> +	}
> +
> +	return penalty;
> +}
> +
> +/*
> + * Apply penalty and update EEVDF fields for scheduler consistency.
> + * Safely applies vruntime penalty with overflow protection, then updates
> + * EEVDF-specific fields (deadline, vlag) and cfs_rq min_vruntime to maintain
> + * scheduler state consistency. Returns true on successful application,
> + * false if penalty cannot be safely applied.
> + */
> +static void __maybe_unused yield_deboost_apply_penalty(struct rq *rq, struct sched_entity *se_y_lca,
> +				 struct cfs_rq *cfs_rq_common, u64 penalty)
> +{
> +	u64 new_vruntime;
> +
> +	/* Overflow protection */
> +	if (se_y_lca->vruntime > (U64_MAX - penalty))
> +		return;
> +
> +	new_vruntime = se_y_lca->vruntime + penalty;
> +
> +	/* Validity check */
> +	if (new_vruntime <= se_y_lca->vruntime)
> +		return;
> +
> +	se_y_lca->vruntime = new_vruntime;
> +	se_y_lca->deadline = se_y_lca->vruntime + calc_delta_fair(se_y_lca->slice, se_y_lca);

And with that we update vruntime to an arbitrary value simply
because one task in the hierarchy decided to call yield_to().

Since we are on the topic, you are also missing an update_curr()
which is only done in yield_task_fair() so you are actually
looking at old vruntime for the yielding entity.

> +	se_y_lca->vlag = avg_vruntime(cfs_rq_common) - se_y_lca->vruntime;
> +	update_min_vruntime(cfs_rq_common);
> +}
> +
>  /*
>   * sched_yield() is very simple
>   */

-- 
Thanks and Regards,
Prateek

next prev parent reply	other threads:[~2025-11-12  7:25 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-10  3:32 [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Wanpeng Li
2025-11-10  3:32 ` [PATCH 01/10] sched: Add vCPU debooster infrastructure Wanpeng Li
2025-11-10  3:32 ` [PATCH 02/10] sched/fair: Add rate-limiting and validation helpers Wanpeng Li
2025-11-12  6:40   ` K Prateek Nayak
2025-11-12  6:44     ` K Prateek Nayak
2025-11-13 13:36       ` Wanpeng Li
2025-11-13 12:00     ` Wanpeng Li
2025-11-10  3:32 ` [PATCH 03/10] sched/fair: Add cgroup LCA finder for hierarchical yield Wanpeng Li
2025-11-12  6:50   ` K Prateek Nayak
2025-11-13  8:59     ` Wanpeng Li
2025-11-10  3:32 ` [PATCH 04/10] sched/fair: Add penalty calculation and application logic Wanpeng Li
2025-11-12  7:25   ` K Prateek Nayak [this message]
2025-11-13 13:25     ` Wanpeng Li
2025-11-10  3:32 ` [PATCH 05/10] sched/fair: Wire up yield deboost in yield_to_task_fair() Wanpeng Li
2025-11-10  5:16   ` kernel test robot
2025-11-10  5:16   ` kernel test robot
2025-11-10  3:32 ` [PATCH 06/10] KVM: Fix last_boosted_vcpu index assignment bug Wanpeng Li
2025-11-21  0:35   ` Sean Christopherson
2025-11-21  0:38     ` Sean Christopherson
2025-11-21 11:46     ` Wanpeng Li
2025-11-10  3:32 ` [PATCH 07/10] KVM: x86: Add IPI tracking infrastructure Wanpeng Li
2025-11-10  3:32 ` [PATCH 08/10] KVM: x86/lapic: Integrate IPI tracking with interrupt delivery Wanpeng Li
2025-11-10  3:32 ` [PATCH 09/10] KVM: Implement IPI-aware directed yield candidate selection Wanpeng Li
2025-11-10  3:39 ` [PATCH 10/10] KVM: Relaxed boost as safety net Wanpeng Li
2025-11-10 12:02 ` [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM Christian Borntraeger
2025-11-12  5:01   ` Wanpeng Li
2025-11-18  8:11     ` Christian Borntraeger
2025-11-18 14:19       ` Wanpeng Li
2025-11-11  6:28 ` K Prateek Nayak
2025-11-12  4:54   ` Wanpeng Li
2025-11-12  6:07     ` K Prateek Nayak
2025-11-13  5:37       ` Wanpeng Li
2025-11-13  4:42     ` K Prateek Nayak
2025-11-13  8:33       ` Wanpeng Li
2025-11-13  9:48         ` K Prateek Nayak
2025-11-13 13:56           ` Wanpeng Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ef7d974e-7fcd-4e30-8a0d-8b97e00478bc@amd.com \
    --to=kprateek.nayak@amd.com \
    --cc=juri.lelli@redhat.com \
    --cc=kernellwp@gmail.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=seanjc@google.com \
    --cc=tglx@linutronix.de \
    --cc=vincent.guittot@linaro.org \
    --cc=wanpengli@tencent.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.