From mboxrd@z Thu Jan 1 00:00:00 1970 From: Konrad Rzeszutek Wilk Subject: Re: schedulers and topology exposing questions Date: Wed, 3 Feb 2016 13:05:27 -0500 Message-ID: <20160203180527.GD1069@char.us.oracle.com> References: <20160122165423.GA8595@elena.ufimtseva> <56A756C0.20501@citrix.com> <20160127143303.GA1094@char.us.oracle.com> <56A8DDC9.4080307@citrix.com> <20160127152701.GF552@char.us.oracle.com> <1453993857.26691.32.camel@citrix.com> <20160129032701.GA20981@x230.dumpdata.com> <1454413500.9227.97.camel@citrix.com> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Return-path: Content-Disposition: inline In-Reply-To: <1454413500.9227.97.camel@citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Dario Faggioli Cc: Elena Ufimtseva , george.dunlap@eu.citrix.com, George Dunlap , xen-devel@lists.xen.org, joao.m.martins@oracle.com, boris.ostrovsky@oracle.com List-Id: xen-devel@lists.xenproject.org On Tue, Feb 02, 2016 at 12:45:00PM +0100, Dario Faggioli wrote: > On Thu, 2016-01-28 at 22:27 -0500, Konrad Rzeszutek Wilk wrote: > > On Thu, Jan 28, 2016 at 03:10:57PM +0000, Dario Faggioli wrote: > > >=A0 > > > So, may I ask what piece of (Linux) code are we actually talking > > > about? > > > Because I had a quick look, and could not find where what you > > > describe > > > happens.... > > = > > udp_recvmsg->__skb_recv_datagram->sock_rcvtimeo->schedule_timeout > > The sk_rcvtimeo is MAX_SCHEDULE_TIMEOUT but you can alter the > > UDP by having a diffrent timeout. > > = > Ha, recvmsg! At some point you mentioned sendmsg, and I was looking > there and seeing nothing! But yes, it indeed makes sense to consider > the receiving side... let me have a look... > = > So, it looks to me that this is what happens: > = > =A0udp_recvmsg(noblock=3D0) > =A0 =A0| > =A0 =A0---> __skb_recv_datagram(flags=3D0) { > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 timeo =3D sock_rcvtimeo(flags=3D0) /* ret= urns sk->sk_rcvtimeo=A0*/ > do {...} wait_for_more_packets(timeo); > | > ---> schedule_timeor(timeo) > = > So, at least in Linux 4.4, the timeout used is the one defined in > sk->sk_rcvtimeo, which it looks to me to be this (unless I've followed > some link wrong, which can well be the case): > = > http://lxr.free-electrons.com/source/include/uapi/asm-generic/socket.h#L31 > #define SO_RCVTIMEO=A0=A0=A0=A0=A020 > = > So there looks to be a timeout. But anyways, let's check > schedule_timeout(). > = > > And MAX_SCHEDULE_TIMEOUT when it eventually calls 'schedule()' just > > goes to sleep (HLT) and eventually gets woken up VIRQ_TIMER. > > = > So, if the timeout is MAX_SCHEDULE_TIMEOUT, the function does: > = > schedule_timeout(SCHEDULE_TIMEOUT) { > =A0 =A0 schedule(); > =A0 =A0 return; > } > = > If the timeout is anything else than MAX_SCHEDULE_TIMEOUT (but still a > valid value), the function does: > = > schedule_timeout(timeout) { > =A0 =A0 struct timer_list timer; > = > =A0 =A0 setup_timer_on_stack(&timer); > =A0 =A0 __mod_timer(&timer); > =A0 =A0 schedule(); > =A0 =A0 del_singleshot_timer_sync(&timer); > =A0 =A0 destroy_timer_on_stack(&timer); > =A0 =A0 return; > } > = > So, in both cases, it pretty much calls schedule() just about > immediately. And when schedule() it's called, the calling process -- > which would be out UDP receiver-- goes to sleep. > = > The difference is that, in case of MAX_SCHEDULE_TIMEOUT, it does not > arrange for anyone to wakeup the thread that is going to sleep. In > theory, it could even be stuck forever... Of course, this depends on > whether the receiver thread is on a runqueue or not, if (in case it's > not) if it's status is TASK_INTERRUPTIBLE OR TASK_UNINTERRUPTIBLE, > etc., and, in prractice, it never happens! :-D > = > In this case, I think we take the other branch (the one 'with > timeout'). But even if we would take this one, I would expect the > receiver thread to not be on any runqueue, but yet to be (either in > interruptible or not state) in a blocking list from where it is taken > out when a packet arrives. > = > In case of anything different than MAX_SCHEDULE_TIMEOUT, all the above > is still true, but a timer is set before calling schedule() and putting > the thread to sleep. This means that, in case nothing that would wakeup > such thread happens, or in case it hasn't happened yet when the timeout > expires, the thread is woken up by the timer. Right. > = > And in fact, schedule_timeout() is not a different way, with respect to > just calling schedule(), to going to sleep. It is the way you go to > sleep for at most some amount of time... But in all cases, you just and > immediately go to sleep! > = > And I also am not sure I see where all that discussion you've had with > George about IPIs fit into this all... The IPI that will trigger the > call to schedule() that will actually put back to execution the thread > that we're putting to sleep in here (i.e., the receiver), happens when > the sender manages to send a packet (actually, when the packet arrives, > I think) _or_ when the timer expires. The IPI were observed when SMT was exposed to the guest. That is because the Linux scheduler put both applications on the same CPU - udp_sender and udp_receiver. Which meant that the 'schedule' call would immediately pick the next application (udp_sender) and schedule it (and send an IPI = to itself to do that). > = > The two possible calls to schedule() in schedule_timeout() behave > exactly in the same way, and I don't think having a timeout or not is > responsible for any particular behavior. Correct. The quirk was that if the applications were on seperate CPUs - the "thread [would be] woken up by the timer". While if they were on the same CPU - the scheduler would pick the next application on the run-queue (which coincidentally was the UDP sender - or receiver). > = > What I think it's happening is this: when such a call to schedule() > (from inside schedule_timeout(), I mean) is made what happens is that > the receiver task just goes to sleep, and another one, perhaps the > sender, is executed. The sender sends the packet, which arrives before > the timeout, and the receiver is woken up. Yes! > = > *Here* is where an IPI should or should not happen, depending on where > our receiver task is going to be executed! And where would that be? > Well, that depends on the Linux scheduler's load balancer, the behavior > of which is controlled by scheduling domains flags like BALANCE_FORK, > BALANCE_EXEC, BALANCE_WAKE, WAKE_AFFINE and PREFER_SIBLINGS (and > others, but I think these are the most likely ones to be involved > here). Probably. > = > So, in summary, where the receiver executes when it wakes up on what is > the configuration of such flags in the (various) scheduling domain(s). > Check, for instance, this path: > = > =A0 try_to_wakeu_up() --> select_task_irq() --> select_task_rq_fair() > = > The reason why the tests 'reacts' to topology changes is that which set > of flags is used for the various scheduling domains is, during the time > the scheduling domains themselves are created and configured-- depends > on topology... So it's quite possible that exposing the SMT topology, > wrt to not doing so, makes one of the flag flip in a way which makes > the benchmark work better. /me nods. > = > If you play with the flags above (or whatever they equivalents were in > 2.6.39) directly, even without exposing the SMT-topology, I'm quite > sure you would be able to trigger the same behavior. I did. And that was the work-around - echo 4xyz flag in the SysFS domain and suddenly things go much faster. . > = > Regards, > Dario > -- = > <> (Raistlin Majere) > ----------------------------------------------------------------- > Dario Faggioli, Ph.D, http://about.me/dario.faggioli > Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) > =