public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Radu Rendec <radu.rendec@gmail.com>
Cc: linux-kernel@vger.kernel.org, Ingo Molnar <mingo@redhat.com>
Subject: Re: pick_next_task() picking the wrong task [v4.9.163]
Date: Sat, 23 Mar 2019 11:15:40 +0100	[thread overview]
Message-ID: <20190323101540.GC6058@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <CAD5jUk9s6+dXxAGWNWGbNRZ60QUKN0B37R=pYwci=SDOSSFmFg@mail.gmail.com>

On Fri, Mar 22, 2019 at 05:57:59PM -0400, Radu Rendec wrote:
> Hi Everyone,
> 
> I believe I'm seeing a weird behavior of pick_next_task() where it
> chooses a lower priority task over a higher priority one. The scheduling
> class of the two tasks is also different ('fair' vs. 'rt'). The culprit
> seems to be the optimization at the beginning of the function, where
> fair_sched_class.pick_next_task() is called directly.  I'm running
> v4.9.163, but that piece of code is very similar in recent kernels.
> 
> My use case is quite simple: I have a real-time thread that is woken up
> by a GPIO hardware interrupt. The thread sleeps most of the time in
> poll(), waiting for gpio_sysfs_irq() to wake it. The latency between the
> interrupt and the thread being woken up/scheduled is very important for
> the application. Note that I backported my own commit 03c0a9208bb1, so
> the thread is always woken up synchronously from HW interrupt context.
> 
> Most of the time things work as expected, but sometimes the scheduler
> picks kworker and even the idle task before my real-time thread. I used
> the trace infrastructure to figure out what happens and I'm including a
> snippet below (I apologize for the wide lines).

If only they were wide :/ I had to unwrap them myself..

>      <idle>-0     [000] d.h2   161.202970: gpio_sysfs_irq  <-__handle_irq_event_percpu
>      <idle>-0     [000] d.h2   161.202981: kernfs_notify <-gpio_sysfs_irq
>      <idle>-0     [000] d.h4   161.202998: sched_waking: comm=irqWorker pid=1141 prio=9 target_cpu=000
>      <idle>-0     [000] d.h5   161.203025: sched_wakeup: comm=irqWorker pid=1141 prio=9 target_cpu=000

weird how the next line doesn't have 'n/N' set:

>      <idle>-0     [000] d.h3   161.203047: workqueue_queue_work: work struct=806506b8 function=kernfs_notify_workfn workqueue=8f5dae60 req_cpu=1 cpu=0
>      <idle>-0     [000] d.h3   161.203049: workqueue_activate_work: work struct 806506b8
>      <idle>-0     [000] d.h4   161.203061: sched_waking: comm=kworker/0:1 pid=134 prio=120 target_cpu=000
>      <idle>-0     [000] d.h5   161.203083: sched_wakeup: comm=kworker/0:1 pid=134 prio=120 target_cpu=000

There's that kworker wakeup.

>      <idle>-0     [000] d..2   161.203201: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=134 next_prio=120

And I agree that that is weird.

> kworker/0:1-134   [000] ....   161.203222: workqueue_execute_start: work struct 806506b8: function kernfs_notify_workfn
> kworker/0:1-134   [000] ...1   161.203286: schedule <-worker_thread
> kworker/0:1-134   [000] d..2   161.203329: sched_switch: prev_comm=kworker/0:1 prev_pid=134 prev_prio=120 prev_state=S ==> next_comm=swapper next_pid=0 next_prio=120
>      <idle>-0     [000] .n.1   161.230287: schedule <-schedule_preempt_disabled

Only here do I see 'n'.

>      <idle>-0     [000] d..2   161.230310: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R+ ==> next_comm=irqWorker next_pid=1141 next_prio=9
>   irqWorker-1141  [000] d..3   161.230316: finish_task_switch <-schedule
> 
> The system is Freescale MPC8378 (PowerPC, single processor).
> 
> I instrumented pick_next_task() with trace_printk() and I am sure that
> every time the wrong task is picked, flow goes through the optimization

That's weird, because when you wake a RT task, the:

  rq->nr_running == rq->cfs.h_nr_running

condition should not be true. Maybe try adding trace_printk() to all
rq->nr_running manipulation to see what goes wobbly?

> path and idle_sched_class.pick_next_task() is called directly. When the
> right task is eventually picked, flow goes through the bottom block that
> iterates over all scheduling classes. This probably makes sense: when
> the scheduler runs in the context of the idle task, prev->sched_class is
> no longer fair_sched_class, so the bottom block with the full iteration
> is used. Note that in v4.9.163 the optimization path is taken only when
> prev->sched_class is fair_sched_class, whereas in recent kernels it is
> taken for both fair_sched_class and idle_sched_class.
> 
> Any help or feedback would be much appreciated. In the meantime, I will
> experiment with commenting out the optimization (at the expense of a
> slower scheduler, of course).

It would be very good if you could confirm on the very latest kernel,
instead of on 4.9.

  reply	other threads:[~2019-03-23 10:16 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-22 21:57 pick_next_task() picking the wrong task [v4.9.163] Radu Rendec
2019-03-23 10:15 ` Peter Zijlstra [this message]
2019-03-25  2:17   ` Radu Rendec
2019-03-28  0:56   ` Radu Rendec

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190323101540.GC6058@hirez.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=radu.rendec@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox