Re: Scheduler regression in 4.7

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Dario Faggioli <dario.faggioli@citrix.com>
To: Andrew Cooper <andrew.cooper3@citrix.com>,
	George Dunlap <george.dunlap@citrix.com>,
	George Dunlap <george.dunlap@eu.citrix.com>
Cc: Jan Beulich <JBeulich@suse.com>,
	Xen-devel List <xen-devel@lists.xen.org>
Subject: Re: Scheduler regression in 4.7
Date: Fri, 12 Aug 2016 05:32:33 +0200	[thread overview]
Message-ID: <1470972753.6250.79.camel@citrix.com> (raw)
In-Reply-To: <bbc5e619-4ef5-16fd-bb4c-782ffd9b0988@citrix.com>


[-- Attachment #1.1: Type: text/plain, Size: 5828 bytes --]

On Thu, 2016-08-11 at 16:42 +0100, Andrew Cooper wrote:
> On 11/08/16 15:28, Dario Faggioli wrote:
> > On Thu, 2016-08-11 at 14:39 +0100, Andrew Cooper wrote:
> > > It will be IS_RUNQ_IDLE() which is the problem.
> > > 
> > Ok, that does one step of list traversing (the runq). What I didn't
> > understand from your report is what crashed when.
> IS_RUNQ_IDLE() was traversing a list, and it encountered an element
> which was being concurrently deleted on a different pcpu.
> 
Yes, I figured it was a race like this, I was asking whether it was
boot, domain creation, domain shutdown, etc (to which you replied
below, so thanks :-) ).

> > AFAICR, during domain destruction we basically move the domain to
> > cpupool0, and without a patch that I sent recently, that is always
> > done
> > as a full fledged cpupool movement, even if the domain is _already_
> > in
> > cpupool0. So, even if you are not using cpupools, and since you
> > mention
> > domain shutdown we probably are looking at 2).
> XenServer doesn't use any cpupools, so all pcpus and vcpus are in
> cpupool0.
> 
Right. But since you say the issue manifests during domain destruction,
this may still be triggered by this callchain:

  domain_kill(d)
    cpupool_move_domain(d, cpupool0)
      cpupool_move_domain_locked()
        sched_move_domain()
          SCHED_OP(insert_vcpu)
            csched_vcpu_insert()
              csched_cpu_pick()

Considering that, as I was saying, without f6bde162c4 (which is not in
4.7), even if the domain is already in cpupool0, sched_move_domain() is
called, and it then calls insert_vcpu(), etc.

Of course, f6bde162c4 is *not* the solution. It will mitigate the issue
in such a way that it won't show up if you don't really use cpupools,
but if there's a race, that still may happen when using cpupools and
destroying a domain in a pool different than cpupool0.

I have to say that, on staging-4.7, I am able to create, reboot and
destroy a domain, without seeing this issue... but again, if it's a
race, this certainly does not mean it's not there!

Also, below, you say that you think we're in domain construction
(which, I agree, is what seems to result from the stack trace).

> It is a vm reboot of a an HVM domU (CentOS 7 64bit, although I doubt
> that is relevant).
> 
> The testcase is vm lifecycle ops on a 32vcpu VM, on a host which
> happens
> to have 32pcpus.
> 
FWIW, I tried 16 vcpus on an host with 16 pcpus (and 16 vcpus dom0).

> > The questions I'm asking above have the aim of figuring out what
> > the
> > status of the runq could be, and why adding a call to
> > csched_cpu_pick()
> > from insert_vcpu() is making things explode...
> It turns out that the stack trace is rather less stack rubble than I
> first thought.  We are in domain construction, and specifically the
> XEN_DOMCTL_max_vcpus hypercall.  All other pcpus are in idle.
> 
>     for ( i = 0; i < max; i++ )
>     {
>             if ( d->vcpu[i] != NULL )
>                 continue;
> 
>             cpu = (i == 0) ?
>                 cpumask_any(online) :
>                 cpumask_cycle(d->vcpu[i-1]->processor, online);
> 
>             if ( alloc_vcpu(d, i, cpu) == NULL )
>                 goto maxvcpu_out;
>     }
> 
> The cpumask_cycle() call is complete and execution has moved into
> alloc_vcpu()
> 
> Unfortunately, none of the code around here spills i or cpu onto the
> stack, so I can't see which values the have from the stack dump.
> 
> However, I see that csched_vcpu_insert() plays with vc->processor,
> which
> surely invalidates the cycle logic behind this loop?
> 
Yes, but I don't think that is a problem. The purpose of calling
csched_cpu_pick() from insert_vcpu() is to find a (the best?) placement
for the vcpu. That will have to be put in v->processor (and the vcpu
queued in the runq of such processor).

But the point is that _csched_vcpu_pick(), in order to come up with a
(potentially) new (and better!) cpu for the vcpu, *reads* v->processor, 
to know where v is now, and take that into account in load balancing
calculations (mostly, idleness/SMT stuff). So, those calls to
cpumask_any() and cpumask_cycle() are useful as they give
csched_cpu_pick() something to start with, and it's ok for it to be
overridden.

But fact is, IS_RUNQ_IDLE() *does* access the runq of that initial cpu
and, as explained here, for Credit1, this happens without holding the
proper spinlock:

    /* This is safe because vc isn't yet being scheduled */
    vc->processor = csched_cpu_pick(ops, vc);

    lock = vcpu_schedule_lock_irq(vc);

    if ( !__vcpu_on_runq(svc) && vcpu_runnable(vc) && !vc->is_running )
        __runq_insert(svc);

    vcpu_schedule_unlock_irq(lock, vc);

Which I think may well explain the race.

So, the solution appears to me to be to move the call to
csched_cpu_pick() inside the critical section. As a matter of fact, for
Credit2 it's already like that (while for RTDS, that may indeed not be
necessary).

I guess the fact that the runq was actually being accessed also in the
Credit1 case, was hidden enough inside IS_RUNQ_IDLE() for both George
not to notice when doing the patch, as well as me when reviewing...
sorry for that.

I'll send a patch for staging (which then will have to be backported).

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

     prev parent reply	other threads:[~2016-08-12  3:32 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-11 11:35 Scheduler regression in 4.7 Andrew Cooper
2016-08-11 13:24 ` George Dunlap
2016-08-11 13:39   ` Andrew Cooper
2016-08-11 14:28     ` Dario Faggioli
2016-08-11 15:42       ` Andrew Cooper
2016-08-12  3:32         ` Dario Faggioli [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1470972753.6250.79.camel@citrix.com \
    --to=dario.faggioli@citrix.com \
    --cc=JBeulich@suse.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=george.dunlap@citrix.com \
    --cc=george.dunlap@eu.citrix.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.