linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Paul Turner <pjt@google.com>
To: Venkatesh Pallipadi <venki@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@elte.hu>,
	linux-kernel@vger.kernel.org, Mike Galbraith <efault@gmx.de>,
	Rik van Riel <riel@redhat.com>
Subject: Re: [PATCH] sched: next buddy hint on sleep and preempt path - v1
Date: Mon, 7 Mar 2011 17:29:04 -0800	[thread overview]
Message-ID: <AANLkTi=dSdnn0p8973C5-URTLbTkJ2XROV9XePrQezJ3@mail.gmail.com> (raw)
In-Reply-To: <1299545997-26304-1-git-send-email-venki@google.com>

On Mon, Mar 7, 2011 at 4:59 PM, Venkatesh Pallipadi <venki@google.com> wrote:
> When a task in a taskgroup sleeps, pick_next_task starts all the way back at
> the root and picks the task/taskgroup with the min vruntime across all
> runnable tasks. But, when there are many frequently sleeping tasks
> across different taskgroups, it makes better sense to stay with same taskgroup
> for its slice period (or until all tasks in the taskgroup sleeps) instead of
> switching cross taskgroup on each sleep after a short runtime.
> This helps specifically where taskgroups corresponds to a process with
> multiple threads. The change reduces the number of CR3 switches in this case.
>
> Example:
> Two taskgroups with 2 threads each which are running for 2ms and
> sleeping for 1ms. Looking at sched:sched_switch shows -
>
> BEFORE: taskgroup_1 threads [5004, 5005], taskgroup_2 threads [5016, 5017]
>      cpu-soaker-5004  [003]  3683.391089
>      cpu-soaker-5016  [003]  3683.393106
>      cpu-soaker-5005  [003]  3683.395119
>      cpu-soaker-5017  [003]  3683.397130
>      cpu-soaker-5004  [003]  3683.399143
>      cpu-soaker-5016  [003]  3683.401155
>      cpu-soaker-5005  [003]  3683.403168
>      cpu-soaker-5017  [003]  3683.405170
>
> AFTER: taskgroup_1 threads [21890, 21891], taskgroup_2 threads [21934, 21935]
>      cpu-soaker-21890 [003]   865.895494
>      cpu-soaker-21935 [003]   865.897506
>      cpu-soaker-21934 [003]   865.899520
>      cpu-soaker-21935 [003]   865.901532
>      cpu-soaker-21934 [003]   865.903543
>      cpu-soaker-21935 [003]   865.905546
>      cpu-soaker-21891 [003]   865.907548
>      cpu-soaker-21890 [003]   865.909560
>      cpu-soaker-21891 [003]   865.911571
>      cpu-soaker-21890 [003]   865.913582
>      cpu-soaker-21891 [003]   865.915594
>      cpu-soaker-21934 [003]   865.917606
>
> Similar problem is there when there are multiple taskgroups and say a task A
> preempts currently running task B of taskgroup_1. On schedule, pick_next_task
> can pick an unrelated task on taskgroup_2. Here it would be better to give some
> preference to task B on pick_next_task.
>
> A simple (may be extreme case) benchmark I tried was tbench with 2 tbench
> client processes with 2 threads each running on a single CPU. Avg throughput
> across 5 50 sec runs was -
> BEFORE: 105.84 MB/sec
> AFTER: 112.42 MB/sec
>
> Changes from v0:
> * Always pass task se to set_next_buddy
> * Avoid repeated set_next_buddy in check_preempt_wakeup
> * Minor flag cleanup in dequeue_task_fair
>
> Signed-off-by: Venkatesh Pallipadi <venki@google.com>
> ---
>  kernel/sched_fair.c |   41 ++++++++++++++++++++++++++++++++++++++---
>  1 files changed, 38 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 3a88dee..cbe442e 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1339,6 +1339,20 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>        hrtick_update(rq);
>  }
>
> +static struct sched_entity *pick_next_taskse_on_cfsrq(struct cfs_rq *cfs_rq)
> +{
> +       struct sched_entity *se;
> +
> +       do {
> +               se = pick_next_entity(cfs_rq);
> +               cfs_rq = group_cfs_rq(se);
> +       } while (cfs_rq);
> +
> +       return se;
> +}
> +

I think the original approach was much cleaner; the notion of a
SCHED_IDLE task is only relative versus siblings in group scheduling

Generalizing the buddies to work on entities, e.g.:

@@ -2137,10 +2180,11 @@ static void set_last_buddy(struct sched_entity *se)

 static void set_next_buddy(struct sched_entity *se)
 {
-       if (likely(task_of(se)->policy != SCHED_IDLE)) {
-               for_each_sched_entity(se)
-                       cfs_rq_of(se)->next = se;
-       }
+       if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE))
+               return;
+
+       for_each_sched_entity(se)
+               cfs_rq_of(se)->next = se;
 }

Avoids all the picking descent and gets us back there.

> +static void set_next_buddy(struct sched_entity *se);
> +
>  /*
>  * The dequeue_task method is called before nr_running is
>  * decreased. We remove the task from the rbtree and
> @@ -1348,14 +1362,25 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  {
>        struct cfs_rq *cfs_rq;
>        struct sched_entity *se = &p->se;
> +       int task_sleep = flags & DEQUEUE_SLEEP;
>
>        for_each_sched_entity(se) {
>                cfs_rq = cfs_rq_of(se);
>                dequeue_entity(cfs_rq, se, flags);
>
>                /* Don't dequeue parent if it has other entities besides us */
> -               if (cfs_rq->load.weight)
> +               if (cfs_rq->load.weight) {
> +                       /*
> +                        * Bias pick_next to pick a task from this cfs_rq, as
> +                        * p is sleeping when it is within its sched_slice.
> +                        */
> +                       if (task_sleep) {
> +                               struct sched_entity *next_se;
> +                               next_se = pick_next_taskse_on_cfsrq(cfs_rq);
> +                               set_next_buddy(next_se);
> +                       }
>                        break;
> +               }
>                flags |= DEQUEUE_SLEEP;
>        }
>
> @@ -1856,12 +1881,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
>        struct sched_entity *se = &curr->se, *pse = &p->se;
>        struct cfs_rq *cfs_rq = task_cfs_rq(curr);
>        int scale = cfs_rq->nr_running >= sched_nr_latency;
> +       int next_buddy_marked = 0;
>
>        if (unlikely(se == pse))
>                return;
>
> -       if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK))
> +       if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
>                set_next_buddy(pse);
> +               next_buddy_marked = 1;
> +       }
>
>        /*
>         * We can come here with TIF_NEED_RESCHED already set from new task
> @@ -1887,8 +1915,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
>        update_curr(cfs_rq);
>        find_matching_se(&se, &pse);
>        BUG_ON(!pse);
> -       if (wakeup_preempt_entity(se, pse) == 1)
> +       if (wakeup_preempt_entity(se, pse) == 1) {

Can't this just be:

if ((wakeup_preempt_entity(se, pse) == 1) || (scale && !fork))

Or even:

( wakeup || scale ) && !fork

Storing the state seems messy just for the
prempt-with-resched-already-set case (effective behavioral difference.
 With this the other case can be deleted.

> +               /*
> +                * Bias pick_next to pick the task that is
> +                * triggering this preemption.
> +                */
> +               if (!next_buddy_marked)
> +                       set_next_buddy(&p->se);
>                goto preempt;
> +       }
>
>        return;
>
> --
> 1.7.3.1
>
>

  reply	other threads:[~2011-03-08  1:29 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-03-01 23:33 [PATCH] sched: next buddy hint on sleep and preempt path Venkatesh Pallipadi
2011-03-02  2:44 ` Rik van Riel
2011-03-02  5:43 ` Paul Turner
2011-03-02  6:47   ` Mike Galbraith
2011-03-02  7:08     ` Paul Turner
2011-03-02  7:40       ` Mike Galbraith
2011-03-02 19:12     ` Venkatesh Pallipadi
2011-03-08  0:59       ` [PATCH] sched: next buddy hint on sleep and preempt path - v1 Venkatesh Pallipadi
2011-03-08  1:29         ` Paul Turner [this message]
2011-03-08  1:47           ` Venkatesh Pallipadi
2011-04-14  1:21             ` [PATCH 0/2] sched: Avoid frequent cross taskgroup switches -v2 Venkatesh Pallipadi
2011-04-14  1:21             ` [PATCH 1/2] sched: Make set_*_buddy work on non-task entity -v2 Venkatesh Pallipadi
2011-04-19 12:05               ` [tip:sched/core] sched: Make set_*_buddy() work on non-task entities tip-bot for Venkatesh Pallipadi
2011-04-14  1:21             ` [PATCH 2/2] sched: next buddy hint on sleep and preempt path -v2 Venkatesh Pallipadi
2011-04-14 10:50               ` Peter Zijlstra
2011-04-14 17:30                 ` Venkatesh Pallipadi
2011-04-15 21:45                   ` Rik van Riel
2011-04-19 12:05                   ` [tip:sched/core] sched: Next buddy hint on sleep and preempt path tip-bot for Venkatesh Pallipadi
2011-03-08  2:33           ` [PATCH] sched: next buddy hint on sleep and preempt path - v1 Venkatesh Pallipadi
2011-03-02 19:22   ` [PATCH] sched: next buddy hint on sleep and preempt path Venkatesh Pallipadi
2011-03-02 10:31 ` Peter Zijlstra
2011-03-02 15:25   ` Mike Galbraith

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='AANLkTi=dSdnn0p8973C5-URTLbTkJ2XROV9XePrQezJ3@mail.gmail.com' \
    --to=pjt@google.com \
    --cc=efault@gmx.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=venki@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).