Re: [PATCH V2 1/1] Improved RTDS scheduler

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: Tianyang Chen <tiche@seas.upenn.edu>
To: Meng Xu <xumengpanda@gmail.com>,
	Dario Faggioli <dario.faggioli@citrix.com>
Cc: George Dunlap <george.dunlap@eu.citrix.com>,
	Meng Xu <mengxu@cis.upenn.edu>,
	Dagaen Golomb <dgolomb@seas.upenn.edu>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>
Subject: Re: [PATCH V2 1/1] Improved RTDS scheduler
Date: Fri, 29 Jan 2016 17:40:23 -0500	[thread overview]
Message-ID: <56ABEA57.9090904@seas.upenn.edu> (raw)
In-Reply-To: <CAENZ-+ki9piRdvWnUEJLu9h95ycoKM5MwD_ORmnAzh2DNzSGvQ@mail.gmail.com>



On 1/26/2016 12:28 PM, Meng Xu wrote:
> Hi Dario and Tianyang,
>
> On Tue, Jan 26, 2016 at 9:06 AM, Dario Faggioli
> <dario.faggioli@citrix.com> wrote:
>> On Mon, 2016-01-25 at 18:00 -0500, Meng Xu wrote:
>>> On Mon, Jan 25, 2016 at 5:04 PM, Tianyang Chen <tiche@seas.upenn.edu>
>>> wrote:
>>>> I have removed some of the Ccs so they won't get bothered as we
>>>> discussed
>>>> previously.
>>>>
>>>> On 1/25/2016 4:00 AM, Dario Faggioli wrote:
>>>>>
>>>>> On Thu, 2015-12-31 at 05:20 -0500, Tianyang Chen wrote:
>>>>>>
>>>>>>
>>>>>> @@ -147,6 +148,16 @@ static unsigned int nr_rt_ops;
>>>>>>     * Global lock is referenced by schedule_data.schedule_lock
>>>>>> from all
>>>>>>     * physical cpus. It can be grabbed via
>>>>>> vcpu_schedule_lock_irq()
>>>>>>     */
>>>>>> +
>>>>>> +/* dedicated timer for replenishment */
>>>>>> +static struct timer repl_timer;
>>>>>> +
>>>>>
>>>>> So, there's always only one timer... Even if we have multiple
>>>>> cpupool
>>>>> with RTDS as their scheduler, they share the replenishment timer?
>>>>> I
>>>>> think it makes more sense to make this per-scheduler.
>>>>>
>>>> Yeah, I totally ignored the case for cpu-pools. It looks like when
>>>> a
>>>> cpu-pool is created, it copies the scheduler struct and calls
>>>> rt_init()
>>>> where a private field is initialized. So I assume the timer should
>>>> be put
>>>> inside the scheduler private struct? Now that I think about it, the
>>>> timer is
>>>> hard-coded to run on cpu0. If there're lots of cpu-pools but the
>>>> replenishment can only be done on the same pcpu, would that be a
>>>> problem?
>>>> Should we keep track of all instances of schedulers (nr_rt_ops
>>>> counts how
>>>> many) and just put times on different pcpus?
>>>>
>>>>>> +/* controls when to first start the timer*/
>>>>>> +static int timer_started;
>>>>>> +
>>>>>
>>>>> I don't like this, and I don't think we need it. In fact, you
>>>>> removed
>>>>> it yourself from v3, AFAICT.
>>>>>
>>>>>> @@ -635,6 +652,13 @@ rt_vcpu_insert(const struct scheduler
>>>>>> *ops,
>>>>>> struct vcpu *vc)
>>>>>>
>>>>>>        /* add rt_vcpu svc to scheduler-specific vcpu list of the
>>>>>> dom */
>>>>>>        list_add_tail(&svc->sdom_elem, &svc->sdom->vcpu);
>>>>>> +
>>>>>> +    if(!timer_started)
>>>>>> +    {
>>>>>> +        /* the first vcpu starts the timer for the first
>>>>>> time*/
>>>>>> +        timer_started = 1;
>>>>>> +        set_timer(&repl_timer,svc->cur_deadline);
>>>>>> +    }
>>>>>>    }
>>>>>>
>>>>> This also seems to be gone in v3, which is good. In fact, it uses
>>>>> timer_started, which I already said I didn't like.
>>>>>
>>>>> About the actual startup of the timer (no matter whether for
>>>>> first time
>>>>> or not). Here, you were doing it in _vcpu_insert() and not in
>>>>> _vcpu_wake(); in v3 you're doing it in _vcpu_wake() and not in
>>>>> _runq_insert()... Which one is the proper way?
>>>>>
>>>>
>>>> Correct me if I'm wrong, at the beginning of the boot process, all
>>>> vcpus are
>>>> put to sleep/not_runnable after insertions. Therefore, the timer
>>>> should
>>>> start when the first vcpu wakes up. I think the wake() in v3 should
>>>> be
>>>> correct.
>>>>
>>>>
>>>>>> @@ -792,44 +816,6 @@ __runq_pick(const struct scheduler *ops,
>>>>>> const
>>>>>> cpumask_t *mask)
>>>>>>    }
>>>>>>
>>>>>>    /*
>>>>>> - * Update vcpu's budget and
>>>>>> - * sort runq by insert the modifed vcpu back to runq
>>>>>> - * lock is grabbed before calling this function
>>>>>> - */
>>>>>> -static void
>>>>>> -__repl_update(const struct scheduler *ops, s_time_t now)
>>>>>> -{
>>>>>>
>>>>> Please, allow me to say that seeing this function going away,
>>>>> fills my
>>>>> heart with pure joy!! :-D
>>>>>
>>>>>> @@ -889,7 +874,7 @@ rt_schedule(const struct scheduler *ops,
>>>>>> s_time_t
>>>>>> now, bool_t tasklet_work_sched
>>>>>>            }
>>>>>>        }
>>>>>>
>>>>>> -    ret.time = MIN(snext->budget, MAX_SCHEDULE); /* sched
>>>>>> quantum */
>>>>>> +    ret.time = snext->budget; /* invoke the scheduler next
>>>>>> time */
>>>>>>        ret.task = snext->vcpu;
>>>>>>
>>>>> This is ok as it is done in v3 (i.e., snext->budget if !idle, -1
>>>>> if
>>>>> idle).
>>>>>
>>>>>> @@ -1074,14 +1055,7 @@ rt_vcpu_wake(const struct scheduler
>>>>>> *ops,
>>>>>> struct vcpu *vc)
>>>>>>        /* insert svc to runq/depletedq because svc is not in
>>>>>> queue now
>>>>>> */
>>>>>>        __runq_insert(ops, svc);
>>>>>>
>>>>>> -    __repl_update(ops, now);
>>>>>> -
>>>>>> -    ASSERT(!list_empty(&prv->sdom));
>>>>>> -    sdom = list_entry(prv->sdom.next, struct rt_dom,
>>>>>> sdom_elem);
>>>>>> -    online = cpupool_scheduler_cpumask(sdom->dom->cpupool);
>>>>>> -    snext = __runq_pick(ops, online); /* pick snext from ALL
>>>>>> valid
>>>>>> cpus */
>>>>>> -
>>>>>> -    runq_tickle(ops, snext);
>>>>>> +    runq_tickle(ops, svc);
>>>>>>
>>>>> And this is another thing I especially like of this patch: it
>>>>> makes the
>>>>> wakeup path a lot simpler and a lot more similar to how it looks
>>>>> like
>>>>> in the other schedulers.
>>>>>
>>>>> Good job with this. :-)
>>>>>
>>>>>> @@ -1108,15 +1078,8 @@ rt_context_saved(const struct scheduler
>>>>>> *ops,
>>>>>> struct vcpu *vc)
>>>>>>        if ( test_and_clear_bit(__RTDS_delayed_runq_add, &svc-
>>>>>>> flags) &&
>>>>>>             likely(vcpu_runnable(vc)) )
>>>>>>        {
>>>>>> +        /* only insert the pre-empted vcpu back*/
>>>>>>            __runq_insert(ops, svc);
>>>>>> -        __repl_update(ops, NOW());
>>>>>> -
>>>>>> -        ASSERT(!list_empty(&prv->sdom));
>>>>>> -        sdom = list_entry(prv->sdom.next, struct rt_dom,
>>>>>> sdom_elem);
>>>>>> -        online = cpupool_scheduler_cpumask(sdom->dom-
>>>>>>> cpupool);
>>>>>> -        snext = __runq_pick(ops, online); /* pick snext from
>>>>>> ALL
>>>>>> cpus */
>>>>>> -
>>>>>> -        runq_tickle(ops, snext);
>>>>>>        }
>>>>>
>>>>> Mmm... I'll think about this more and let you know... But out of
>>>>> the
>>>>> top of my head, I think the tickling has to stay? You preempted a
>>>>> vcpu
>>>>> from the pcpu where it was running, maybe some other pcpu is
>>>>> either
>>>>> idle or running a vcpu with a later deadline, and should come and
>>>>> pick
>>>>> this one up?
>>>
>>> If that's the case, why should we preempt this VCPU? We should use
>>> the
>>> top VCPu in the runq to preempt the lowest priority VCPU, right?
>>>
>> Yeah, in theory, and as far as things goes by "just" looking at
>> runq_tickle() in sched_rt.c, you're right.
>>
>> Tickling is (like everything in life :-P) not perfect, though. At least
>> it's not "instantaneously" that a tickled pcpu come and pick up work...
>> Something may happen that perturbates the scenario one depicted in his
>> head when thinking at what will happen after tickling (e.g., the pcpu
>> being tickled may be doing something which can't be interrupted for a
>> while).
>>
>> The scheduler should be robust, and don't explode or don't exhibit
>> wrong behavior in case, and I think the current code is ok in that
>> sense.
>>
>> My idea was that adding one more "tickling point" would make it more
>> robust, exactly in that regard, i.e., in tolerating anomalies due to
>> tickling resulting in the pcpu that picked up the work was a different
>> one from what we expected.
>>
>> But this indeed introduces more overhead... So, I agree, let's not do
>> that and see if we encounter problems. We'll come back here if we do.
>> :-)
>
> I see the point. OK. We can do some test to see when the extra
> tickling point should be used by adding some temporary warning print
> in the code and see if it's called and if we can avoid it. If it's not
> called too frequently, it may be ok, (which I'm not so sure), that we
> use extra tickle for it. (Maybe have some fast path in the tickle for
> the common cases.)
>
>>
>>>> gEDF allows this but there is overhead and may not be worth it. I
>>>> have no
>>>> stats to support this but there are some papers on restricting what
>>>> tasks
>>>> can migrate. We can discuss more if we need extra logic here.
>>>
>>> As to gEDF, the scheduling policy does not restrict what tasks can
>>> migrate, except for the VCPU's hard affinity set by users.  So
>>> migrating is an option. but we should avoid the unnecessary
>>> preemption/migration.
>>>
>> Agreed.
>>
>>>>>> +        if( min_repl> svc->cur_deadline )
>>>>>> +        {
>>>>>> +            min_repl = svc->cur_deadline;
>>>>>> +        }
>>>>>> +        /* reinsert the vcpu if its deadline is updated */
>>>>>> +        __q_remove(svc);
>>>>>> +        __runq_insert(ops, svc);
>>>>>>
>>>>> One more proof of what I was trying to say. Is it really this
>>>>> handler's
>>>>> job to --basically-- re-sort the runqueue? I don't think so.
>>>>>
>>>>> What is the specific situation that you are trying to handle like
>>>>> this?
>>>>>
>>>> Right, if we want to count deadline misses, it could be done when a
>>>> vcpu is
>>>> picked. However, when selecting the most imminent "inter-release
>>>> time" of
>>>> all runnable vcpu, the head of the runq could be missing its
>>>> deadline and
>>>> the cur-deadline could be in the past. How do we handle this
>>>> situation? We
>>>> still need to scan the runq right?
>>>
>>> I think Dario's point is that:
>>> All VCPUs on runq should still have remaining budget, so they should
>>> not have come into the situation of replenishing their budget. So in
>>> the replenishment handler, runq should not be called.
>>>
>> Exactly.
>>
>>> I agree the runq
>>> should not be called to replenish the budget. But the top VCPU in the
>>> runq may be the next earliest one that should get its budget
>>> replenished.
>>>
>> With "the top VCPU in the runq" you mean the one that is running on a
>> pCPU? No, I don't think you refer to that one since, as you said,
>> running vCPUs are not in the runqueues.
>>
>> And then why it is only the first vCPU in the runqueue  you seem to
>> care about. I appreciate it has the shortest deadline, but I can't tell
>> if that's enough to assume we only need to care about it. I probably do
>> not recall with 100% accuracy the details of the DS algorithm that you
>> want to implement... In particular, whether or not a replenishment need
>> to be programmed when the task/vcpu becomes running, so do feel free to
>> summarize it here for me. :-)
>
> Ah, you are right. I forgot that a VCPU with a large deadlien may have
> an earliest replenish time in the future. My fault. :-( The replenish
> time of the vcpu should be decided once the VCPU got running or once
> the VCPU is waked up or once the VCPU got its current budget in the
> current period. So the top VCPU in the runq actually should have its
> replenish time decided when/before it's added into the runq, since it
> has  to have some budget to stay in the runq.
>
> I think you are correct and I like the design you describe later that
> uses a separate "queue" to record the replenish time.
>
>>
>> In any case, whatever the algorithm says, what I'm proposing is
>> something completely different and general enough that would,
>> hopefully, make it easier to reason on things.
>>
>> Can we have, together with the timer, a list of replenishment events?
>
> Sure! this is a good idea and it will be easier for us to include the
> RM scheduling later, if needed. Reusing the runq and depeletedq will
> make the RM scheduling policy cause many changes to the RTDS
> scheduling framework.
>
>> I'm talking about an actual list of entries, where each entry contains
>> the following information:
>>   - time at which the replenishment should happen
>>   - amount of replenishment
>>   - vcpu to be replenished
>>
>> The list will be kept in order of replenishment time, and the timer
>> will always be programmed to fire at the most imminent replenishment
>> time instant (which will correspond to the first entry in the list).
>>
>> When the timer fires, it picks up the first entry, takes it out of the
>> list, does the replenishment and takes care of the effects of the
>> replenishment itself (deadline update, preemption, runq re-insertion,
>> moving from depletedq to runq), depending on what the status of the
>> vcpu being replenished was.
>>
>> After doing all this, the timer reprograms itself to fire at the
>> replenishment time of the next (which just became the new first) entry
>> in the list.
>>
>> At least, this is the idea. Now, for the implementation:
>>
>>   1. instead of only checking the first entry, it's wise that the timer
>>      handler start to walk the list, and, considering the current time
>>      (what NOW() gives you) takes care of all entries whose
>>      replenishment time has passed. This is to deal with cases where
>>      the handler may fire a little bit later than expected, and more
>>      than just one replenishment event should have occurred already.
>>      Since the list is in order, it is ok to stop as soon as the first
>>      entry with a replenishment time which is actually in the future is
>>      found;
>>
>>   2. embedded lists give us the opportunity to place one data structure
>>      in more than one list/queue. So, instead of actually dynamically
>>      allocating and using a dedicated data structure for each
>>      replenishment event, I think we can:
>>       - just add another embedded list element in rt_vcpu (just like
>>         q_elem),
>>       - use that to queue the rt_vcpu-sin the replenishment events list
>>       - add the fields necessary to handle replenishment directly in
>>         rt_vcpu (assuming we need any... If next replenishment time is
>>         the next absolute deadline and amount is the budget, everything
>>         we need should already be there)
>>
>> So.. What do you think?
>
> It's nice. Thank you so much for drafting this. :-)
> What do you think, Tianyang?
>
Dario and Meng,

This indeed makes it very clear and it looks like the timer handler 
doesn't even need to know where a vcpu is at all. Just plain 
replenishment when it needs it. It also gets rid off extra code in 
context_saved().

I'm assuming a vcpu should be added to the replq when it wakes up and 
removed when it's not runnable (checks in timer handler).
>>
>>>> This discussion was before I figured out things about idle_vcpu[]
>>>> and
>>>> tasklet. A vcpu could be preempted and put back to either runq or
>>>> depletedq
>>>> if a tasklet is scheduled. It could also end up in a depletedq in
>>>> other
>>>> situations. I guess Meng's point is this vcpu should be running
>>>> constantly
>>>> without being taken off if there is no tasklet, in an effort to
>>>> follow EDF.
>>> Hi Tianyang,
>>>
>>> I think Dario made a good point. In order to avoid vcpu being taken
>>> off from the core, we can handle it in the rt_schedule(), the budget
>>> enforcement function. This could probably make the code cleaner.
>>>
>> Exactly. Budget enforcement is perfectly fine being done in
>> rt_schedule().
>>
>>>>>>
>>>>>> +
>>>>>> +    /* if timer was triggered but none of the vcpus
>>>>>> +     * need to be refilled, set the timer to be the
>>>>>> +     * default period + now
>>>>>> +     */
>>>>>> +    if(min_repl == LONG_MAX)
>>>>>> +    {
>>>>>> +        set_timer(&repl_timer, now + RTDS_DEFAULT_PERIOD);
>>>>>>
>>>>> I agree with Meng's point in this thread: this should not be
>>>>> necessary.
>>>>> If it is, it's most likely because of a bug or to something else.
>>>>>
>>>>> Let's figure out what it is, and fix it properly. (I see that in
>>>>> v3
>>>>> this logic is gone, so hopefully you found and fixed the issue
>>>>> already.)
>>>>>
>>>> Yeah. Like I said the timer is originally programmed to fire when
>>>> the first
>>>> vcpu is inserted but all vcpus are not runnable at the beginning of
>>>> boot
>>>> process. If the timer is triggered before any vcpu wakes up, there
>>>> is
>>>> nothing on queue at all. This should be fixed with wake() in V3.
>>>
>>> Great! I'm wondering if we should look at the v3 version, which
>>> should
>>> have fixed many issues? We can decide if the runningq is needed in
>>> v3.
>>>
>> The fact that v3 added runningq is the reason why I'm commenting on
>> this version: I don't think it's necessary at all, and it was easier to
>> focus on other issues with that out of the way.
>>
>> What I've seen in v3 seems ok to me. I can take another look, but I
>> guess the best thing to do would be to put together a v4 with the fixes
>> to the issue v2 had taken from v3, the runningq taken away, and the
>> replenished queue implemented as I described (if you agree with it).
>
> I see the point. Tianyang, what do you think? Maybe we can just put
> together a v4 quickly and let Dario comment on v4?
>
>>
>>> Dario, What do you think? Right now, I'm kind of lost how we should
>>> proceed next step? Should we modify based on this version or the
>>> latest posted version v3?
>>>
>> As you wish, but I'd base the new version on this version, and
>> "backport" good stuff from v3 to here (again, as it doesn't have the
>> runningq in the way)
>
> I see. Tianyang, let's do what Dario suggests. Actually, the code
> changed from v2 to v3 will be a useful knowledge when we are working
> on the v4. What do you think? or we can talk in person today. :-)
>
> Thank you very much for your time on this, Dario!
Yeah definitely. I am still trying to figure out why there is an 
assertion failure at free_pdata() when I remove a pcpu from a pool... It 
that takes too long I will just send v4 out for discussion first.

Thanks for your time Dario.

Tianyang

next prev parent reply	other threads:[~2016-01-29 22:40 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-12-31 10:20 [PATCH V2 0/1] Improved RTDS scheduler Tianyang Chen
2015-12-31 10:20 ` [PATCH V2 1/1] " Tianyang Chen
2015-12-31 13:44   ` Meng Xu
2016-01-06  7:57     ` Tianyang Chen
2016-01-15 15:33       ` Meng Xu
2016-01-19 16:35         ` Tianyang Chen
2016-01-19 16:53           ` Meng Xu
2016-01-25  9:00   ` Dario Faggioli
2016-01-25 22:04     ` Tianyang Chen
2016-01-25 23:00       ` Meng Xu
2016-01-26 14:06         ` Dario Faggioli
2016-01-26 17:28           ` Meng Xu
2016-01-29 22:40             ` Tianyang Chen [this message]
2016-01-26 11:52       ` Dario Faggioli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56ABEA57.9090904@seas.upenn.edu \
    --to=tiche@seas.upenn.edu \
    --cc=dario.faggioli@citrix.com \
    --cc=dgolomb@seas.upenn.edu \
    --cc=george.dunlap@eu.citrix.com \
    --cc=mengxu@cis.upenn.edu \
    --cc=xen-devel@lists.xen.org \
    --cc=xumengpanda@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).