Re: schedulers and topology exposing questions

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <elena.ufimtseva@oracle.com>,
	george.dunlap@eu.citrix.com,
	George Dunlap <george.dunlap@citrix.com>,
	xen-devel@lists.xen.org, joao.m.martins@oracle.com,
	boris.ostrovsky@oracle.com
Subject: Re: schedulers and topology exposing questions
Date: Wed, 3 Feb 2016 13:05:27 -0500	[thread overview]
Message-ID: <20160203180527.GD1069@char.us.oracle.com> (raw)
In-Reply-To: <1454413500.9227.97.camel@citrix.com>

On Tue, Feb 02, 2016 at 12:45:00PM +0100, Dario Faggioli wrote:
> On Thu, 2016-01-28 at 22:27 -0500, Konrad Rzeszutek Wilk wrote:
> > On Thu, Jan 28, 2016 at 03:10:57PM +0000, Dario Faggioli wrote:
> > > 
> > > So, may I ask what piece of (Linux) code are we actually talking
> > > about?
> > > Because I had a quick look, and could not find where what you
> > > describe
> > > happens....
> > 
> > udp_recvmsg->__skb_recv_datagram->sock_rcvtimeo->schedule_timeout
> > The sk_rcvtimeo is MAX_SCHEDULE_TIMEOUT but you can alter the
> > UDP by having a diffrent timeout.
> > 
> Ha, recvmsg! At some point you mentioned sendmsg, and I was looking
> there and seeing nothing! But yes, it indeed makes sense to consider
> the receiving side... let me have a look...
> 
> So, it looks to me that this is what happens:
> 
>  udp_recvmsg(noblock=0)
>    |
>    ---> __skb_recv_datagram(flags=0) {
>                 timeo = sock_rcvtimeo(flags=0) /* returns sk->sk_rcvtimeo */
>                 do {...} wait_for_more_packets(timeo);
>                            |
>                            ---> schedule_timeor(timeo)
> 
> So, at least in Linux 4.4, the timeout used is the one defined in
> sk->sk_rcvtimeo, which it looks to me to be this (unless I've followed
> some link wrong, which can well be the case):
> 
> http://lxr.free-electrons.com/source/include/uapi/asm-generic/socket.h#L31
> #define SO_RCVTIMEO     20
> 
> So there looks to be a timeout. But anyways, let's check
> schedule_timeout().
> 
> > And MAX_SCHEDULE_TIMEOUT when it eventually calls 'schedule()' just
> > goes to sleep (HLT) and eventually gets woken up VIRQ_TIMER.
> > 
> So, if the timeout is MAX_SCHEDULE_TIMEOUT, the function does:
> 
> schedule_timeout(SCHEDULE_TIMEOUT) {
>     schedule();
>     return;
> }
> 
> If the timeout is anything else than MAX_SCHEDULE_TIMEOUT (but still a
> valid value), the function does:
> 
> schedule_timeout(timeout) {
>     struct timer_list timer;
> 
>     setup_timer_on_stack(&timer);
>     __mod_timer(&timer);
>     schedule();
>     del_singleshot_timer_sync(&timer);
>     destroy_timer_on_stack(&timer);
>     return;
> }
> 
> So, in both cases, it pretty much calls schedule() just about
> immediately. And when schedule() it's called, the calling process --
> which would be out UDP receiver-- goes to sleep.
> 
> The difference is that, in case of MAX_SCHEDULE_TIMEOUT, it does not
> arrange for anyone to wakeup the thread that is going to sleep. In
> theory, it could even be stuck forever... Of course, this depends on
> whether the receiver thread is on a runqueue or not, if (in case it's
> not) if it's status is TASK_INTERRUPTIBLE OR TASK_UNINTERRUPTIBLE,
> etc., and, in prractice, it never happens! :-D
> 
> In this case, I think we take the other branch (the one 'with
> timeout'). But even if we would take this one, I would expect the
> receiver thread to not be on any runqueue, but yet to be (either in
> interruptible or not state) in a blocking list from where it is taken
> out when a packet arrives.
> 
> In case of anything different than MAX_SCHEDULE_TIMEOUT, all the above
> is still true, but a timer is set before calling schedule() and putting
> the thread to sleep. This means that, in case nothing that would wakeup
> such thread happens, or in case it hasn't happened yet when the timeout
> expires, the thread is woken up by the timer.

Right.
> 
> And in fact, schedule_timeout() is not a different way, with respect to
> just calling schedule(), to going to sleep. It is the way you go to
> sleep for at most some amount of time... But in all cases, you just and
> immediately go to sleep!
> 
> And I also am not sure I see where all that discussion you've had with
> George about IPIs fit into this all... The IPI that will trigger the
> call to schedule() that will actually put back to execution the thread
> that we're putting to sleep in here (i.e., the receiver), happens when
> the sender manages to send a packet (actually, when the packet arrives,
> I think) _or_ when the timer expires.

The IPI were observed when SMT was exposed to the guest. That is because
the Linux scheduler put both applications on the same CPU - udp_sender
and udp_receiver. Which meant that the 'schedule' call would immediately
pick the next application (udp_sender) and schedule it (and send an IPI 
to itself to do that).

> 
> The two possible calls to schedule() in schedule_timeout() behave
> exactly in the same way, and I don't think having a timeout or not is
> responsible for any particular behavior.

Correct. The quirk was that if the applications were on seperate
CPUs - the "thread [would be] woken up by the timer". While if they
were on the same CPU - the scheduler would pick the next application
on the run-queue (which coincidentally was the UDP sender - or receiver).

> 
> What I think it's happening is this: when such a call to schedule()
> (from inside schedule_timeout(), I mean) is made what happens is that
> the receiver task just goes to sleep, and another one, perhaps the
> sender, is executed. The sender sends the packet, which arrives before
> the timeout, and the receiver is woken up.

Yes!
> 
> *Here* is where an IPI should or should not happen, depending on where
> our receiver task is going to be executed! And where would that be?
> Well, that depends on the Linux scheduler's load balancer, the behavior
> of which is controlled by scheduling domains flags like BALANCE_FORK,
> BALANCE_EXEC, BALANCE_WAKE, WAKE_AFFINE and PREFER_SIBLINGS (and
> others, but I think these are the most likely ones to be involved
> here).

Probably.
> 
> So, in summary, where the receiver executes when it wakes up on what is
> the configuration of such flags in the (various) scheduling domain(s).
> Check, for instance, this path:
> 
>   try_to_wakeu_up() --> select_task_irq() --> select_task_rq_fair()
> 
> The reason why the tests 'reacts' to topology changes is that which set
> of flags is used for the various scheduling domains is, during the time
> the scheduling domains themselves are created and configured-- depends
> on topology... So it's quite possible that exposing the SMT topology,
> wrt to not doing so, makes one of the flag flip in a way which makes
> the benchmark work better.

/me nods.
> 
> If you play with the flags above (or whatever they equivalents were in
> 2.6.39) directly, even without exposing the SMT-topology, I'm quite
> sure you would be able to trigger the same behavior.

I did. And that was the work-around - echo 4xyz flag in the SysFS domain
and suddenly things go much faster.
.
> 
> Regards,
> Dario
> -- 
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>

next prev parent reply	other threads:[~2016-02-03 18:05 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-22 16:54 schedulers and topology exposing questions Elena Ufimtseva
2016-01-22 17:29 ` Dario Faggioli
2016-01-22 23:58   ` Elena Ufimtseva
2016-01-26 11:21 ` George Dunlap
2016-01-27 14:25   ` Dario Faggioli
2016-01-27 14:33   ` Konrad Rzeszutek Wilk
2016-01-27 15:10     ` George Dunlap
2016-01-27 15:27       ` Konrad Rzeszutek Wilk
2016-01-27 15:53         ` George Dunlap
2016-01-27 16:12           ` Konrad Rzeszutek Wilk
2016-01-28  9:55           ` Dario Faggioli
2016-01-29 21:59             ` Elena Ufimtseva
2016-02-02 11:58               ` Dario Faggioli
2016-01-27 16:03         ` Elena Ufimtseva
2016-01-28  9:46           ` Dario Faggioli
2016-01-29 16:09             ` Elena Ufimtseva
2016-01-28 15:10         ` Dario Faggioli
2016-01-29  3:27           ` Konrad Rzeszutek Wilk
2016-02-02 11:45             ` Dario Faggioli
2016-02-03 18:05               ` Konrad Rzeszutek Wilk [this message]
2016-01-27 14:01 ` Dario Faggioli
2016-01-28 18:51   ` Elena Ufimtseva

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160203180527.GD1069@char.us.oracle.com \
    --to=konrad.wilk@oracle.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=dario.faggioli@citrix.com \
    --cc=elena.ufimtseva@oracle.com \
    --cc=george.dunlap@citrix.com \
    --cc=george.dunlap@eu.citrix.com \
    --cc=joao.m.martins@oracle.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.