From mboxrd@z Thu Jan  1 00:00:00 1970
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Subject: Re: schedulers and topology exposing questions
Date: Wed, 3 Feb 2016 13:05:27 -0500
Message-ID: <20160203180527.GD1069@char.us.oracle.com>
References: <20160122165423.GA8595@elena.ufimtseva> <56A756C0.20501@citrix.com>
	<20160127143303.GA1094@char.us.oracle.com>
	<56A8DDC9.4080307@citrix.com>
	<20160127152701.GF552@char.us.oracle.com>
	<1453993857.26691.32.camel@citrix.com>
	<20160129032701.GA20981@x230.dumpdata.com>
	<1454413500.9227.97.camel@citrix.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Return-path: <xen-devel-bounces@lists.xen.org>
Content-Disposition: inline
In-Reply-To: <1454413500.9227.97.camel@citrix.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <elena.ufimtseva@oracle.com>, george.dunlap@eu.citrix.com, George Dunlap <george.dunlap@citrix.com>, xen-devel@lists.xen.org, joao.m.martins@oracle.com, boris.ostrovsky@oracle.com
List-Id: xen-devel@lists.xenproject.org

On Tue, Feb 02, 2016 at 12:45:00PM +0100, Dario Faggioli wrote:
> On Thu, 2016-01-28 at 22:27 -0500, Konrad Rzeszutek Wilk wrote:
> > On Thu, Jan 28, 2016 at 03:10:57PM +0000, Dario Faggioli wrote:
> > >=A0
> > > So, may I ask what piece of (Linux) code are we actually talking
> > > about?
> > > Because I had a quick look, and could not find where what you
> > > describe
> > > happens....
> > =

> > udp_recvmsg->__skb_recv_datagram->sock_rcvtimeo->schedule_timeout
> > The sk_rcvtimeo is MAX_SCHEDULE_TIMEOUT but you can alter the
> > UDP by having a diffrent timeout.
> > =

> Ha, recvmsg! At some point you mentioned sendmsg, and I was looking
> there and seeing nothing! But yes, it indeed makes sense to consider
> the receiving side... let me have a look...
> =

> So, it looks to me that this is what happens:
> =

> =A0udp_recvmsg(noblock=3D0)
> =A0 =A0|
> =A0 =A0---> __skb_recv_datagram(flags=3D0) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 timeo =3D sock_rcvtimeo(flags=3D0) /* ret=
urns sk->sk_rcvtimeo=A0*/
>                 do {...} wait_for_more_packets(timeo);
>                            |
>                            ---> schedule_timeor(timeo)
> =

> So, at least in Linux 4.4, the timeout used is the one defined in
> sk->sk_rcvtimeo, which it looks to me to be this (unless I've followed
> some link wrong, which can well be the case):
> =

> http://lxr.free-electrons.com/source/include/uapi/asm-generic/socket.h#L31
> #define SO_RCVTIMEO=A0=A0=A0=A0=A020
> =

> So there looks to be a timeout. But anyways, let's check
> schedule_timeout().
> =

> > And MAX_SCHEDULE_TIMEOUT when it eventually calls 'schedule()' just
> > goes to sleep (HLT) and eventually gets woken up VIRQ_TIMER.
> > =

> So, if the timeout is MAX_SCHEDULE_TIMEOUT, the function does:
> =

> schedule_timeout(SCHEDULE_TIMEOUT) {
> =A0 =A0 schedule();
> =A0 =A0 return;
> }
> =

> If the timeout is anything else than MAX_SCHEDULE_TIMEOUT (but still a
> valid value), the function does:
> =

> schedule_timeout(timeout) {
> =A0 =A0 struct timer_list timer;
> =

> =A0 =A0 setup_timer_on_stack(&timer);
> =A0 =A0 __mod_timer(&timer);
> =A0 =A0 schedule();
> =A0 =A0 del_singleshot_timer_sync(&timer);
> =A0 =A0 destroy_timer_on_stack(&timer);
> =A0 =A0 return;
> }
> =

> So, in both cases, it pretty much calls schedule() just about
> immediately. And when schedule() it's called, the calling process --
> which would be out UDP receiver-- goes to sleep.
> =

> The difference is that, in case of MAX_SCHEDULE_TIMEOUT, it does not
> arrange for anyone to wakeup the thread that is going to sleep. In
> theory, it could even be stuck forever... Of course, this depends on
> whether the receiver thread is on a runqueue or not, if (in case it's
> not) if it's status is TASK_INTERRUPTIBLE OR TASK_UNINTERRUPTIBLE,
> etc., and, in prractice, it never happens! :-D
> =

> In this case, I think we take the other branch (the one 'with
> timeout'). But even if we would take this one, I would expect the
> receiver thread to not be on any runqueue, but yet to be (either in
> interruptible or not state) in a blocking list from where it is taken
> out when a packet arrives.
> =

> In case of anything different than MAX_SCHEDULE_TIMEOUT, all the above
> is still true, but a timer is set before calling schedule() and putting
> the thread to sleep. This means that, in case nothing that would wakeup
> such thread happens, or in case it hasn't happened yet when the timeout
> expires, the thread is woken up by the timer.

Right.
> =

> And in fact, schedule_timeout() is not a different way, with respect to
> just calling schedule(), to going to sleep. It is the way you go to
> sleep for at most some amount of time... But in all cases, you just and
> immediately go to sleep!
> =

> And I also am not sure I see where all that discussion you've had with
> George about IPIs fit into this all... The IPI that will trigger the
> call to schedule() that will actually put back to execution the thread
> that we're putting to sleep in here (i.e., the receiver), happens when
> the sender manages to send a packet (actually, when the packet arrives,
> I think) _or_ when the timer expires.

The IPI were observed when SMT was exposed to the guest. That is because
the Linux scheduler put both applications on the same CPU - udp_sender
and udp_receiver. Which meant that the 'schedule' call would immediately
pick the next application (udp_sender) and schedule it (and send an IPI =

to itself to do that).

> =

> The two possible calls to schedule() in schedule_timeout() behave
> exactly in the same way, and I don't think having a timeout or not is
> responsible for any particular behavior.

Correct. The quirk was that if the applications were on seperate
CPUs - the "thread [would be] woken up by the timer". While if they
were on the same CPU - the scheduler would pick the next application
on the run-queue (which coincidentally was the UDP sender - or receiver).

> =

> What I think it's happening is this: when such a call to schedule()
> (from inside schedule_timeout(), I mean) is made what happens is that
> the receiver task just goes to sleep, and another one, perhaps the
> sender, is executed. The sender sends the packet, which arrives before
> the timeout, and the receiver is woken up.

Yes!
> =

> *Here* is where an IPI should or should not happen, depending on where
> our receiver task is going to be executed! And where would that be?
> Well, that depends on the Linux scheduler's load balancer, the behavior
> of which is controlled by scheduling domains flags like BALANCE_FORK,
> BALANCE_EXEC, BALANCE_WAKE, WAKE_AFFINE and PREFER_SIBLINGS (and
> others, but I think these are the most likely ones to be involved
> here).

Probably.
> =

> So, in summary, where the receiver executes when it wakes up on what is
> the configuration of such flags in the (various) scheduling domain(s).
> Check, for instance, this path:
> =

> =A0 try_to_wakeu_up() --> select_task_irq() --> select_task_rq_fair()
> =

> The reason why the tests 'reacts' to topology changes is that which set
> of flags is used for the various scheduling domains is, during the time
> the scheduling domains themselves are created and configured-- depends
> on topology... So it's quite possible that exposing the SMT topology,
> wrt to not doing so, makes one of the flag flip in a way which makes
> the benchmark work better.

/me nods.
> =

> If you play with the flags above (or whatever they equivalents were in
> 2.6.39) directly, even without exposing the SMT-topology, I'm quite
> sure you would be able to trigger the same behavior.

I did. And that was the work-around - echo 4xyz flag in the SysFS domain
and suddenly things go much faster.
.
> =

> Regards,
> Dario
> -- =

> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
> =