From: "Stephan Bärwolf" <stephan.baerwolf@tu-ilmenau.de>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Subject: Re: sched: fix/optimise some issues
Date: Thu, 21 Jul 2011 18:36:10 +0200 [thread overview]
Message-ID: <4E28557A.7040704@tu-ilmenau.de> (raw)
In-Reply-To: <1311260895.29152.153.camel@twins>
Thank you for your fast response and your detailed comments.
On 07/21/11 17:08, Peter Zijlstra wrote:
> On Wed, 2011-07-20 at 15:42 +0200, Stephan Bärwolf wrote:
>> I also implemented an 128bit vruntime support:
>> Majorly on systems with many tasks and (for example) deep cgroups
>> (or increased NICE0_LOAD/ SCHED_LOAD_SCALE as in commit
>> c8b281161dfa4bb5d5be63fb036ce19347b88c63), a weighted timeslice
>> (unsigned long) can become very large (on x86_64) and consumes a
>> large part of the u64 vruntimes (per tick) when added.
>> This might lead to missscheduling because of overflows.
> Right, so I've often wanted a [us]128 type, and gcc has some (broken?)
> support for that, but overhead has always kept me from it.
128bit sched_vruntime_t support seems to be running fine, when compiled with
gcc (Gentoo 4.4.5 p1.2, pie-0.4.5) 4.4.5.
Of course overhead is a problem (but there is also overhead using u64 on
x86),
that is why it should be Kconfig selectable (for servers with many
processes,
deep cgroups and many different priorities?).
But I think also abstracting the whole vruntime-stuff into a seperate
collection
simplifies further evaluations and adpations. (Think of central
statistics collection
for example maximum timeslice seen or happened overflows - without changing
all the lines of code with the risk of missing sth.)
> There's also the non-atomicy thing to consider, see min_vruntime_copy
> etc.
I think atomicy is not an (great) issue, because of two reasons:
a) on x86 the u64 wouldn't be atomic, too (vruntime is u64 not
atomic64_t)
b) every operation on cfs_rq->min_vruntime should happen, when
holding the runqueue-lock?.
> How horrid is the current vruntime situation?
This is a point, which needs further discussion/observation.
When for example NICE0_LOAD is increased by 6 Bit (and I think
"c8b281161dfa4bb5d5be63fb036ce19347b88c63" did it by 10bits
on x86_64) the maximum timeslice (I am not quite sure if it was on
HZ=1000) with a PRE kernel will be around 2**38.
Adding this every ms (lets say 1024 times per sec) to min_vruntime
might cause overflows too fast (after 2**(63-38-10)sec = 2**15sec ~ 9h).
Having a great heterogenity of priorities may intensify this situation...
Long story short: on x86_64 an unsigned long (timeslice) could be
as large as the whole u64 for min_vruntime and this is dangerous.
Of course limiting the maximum timeslice in "calc_delta_mine()" would
help, too - but without the comfort using the whole x86_64 capabilties.
(and mostly therefore finer priority-resolutions)
> As to your true-idle, there's a very good reason the current SCHED_IDLE
> isnt' a true idle scheduler; it would create horrid priority inversion
> problems, imagine the true idle task holding a mutex or is required to
> complete something.
Of course, I fully agree! This is one reason why it was marked as
"experimental". When having a few backgroundjobs (for example
a boinc or a bitcoin-crunsher ;-) ) it works ok because there seems
not to many process-spanned lockings.
But in general it is a bad idea...
I also remember weak Linus had sth. against "priority inheritance"
(don't ask me what or why - I don't know),
but it would be an honour to me working with you guys to implement
this feature in future kernels. (On the base of rb-trees saving the
priorities of each "se" holding the lock, to solve prio.inv. ? or in
non-schedulable contextes maybe setting an "super-priority" while locking)
I think real idle-scheduling (maybe based in more than one idle-levels)
would
be a very great feature to future kernels.
(For example utilizing expensive systems without feelable affects on
interactivity)
Even because SMP gains more and
more importance (plus increasing cpus/cores) and the "load-balancing"
often leads to short
but great idle-phases on sparse (because of interactivity) processed
systems.
>
Thanks,
regards Stephan
--
Dipl.-Inf. Stephan Bärwolf
Ilmenau University of Technology, Integrated Communication Systems Group
Phone: +49 (0)3677 69 4130
Email: stephan.baerwolf@tu-ilmenau.de,
Web: http://www.tu-ilmenau.de/iks
next prev parent reply other threads:[~2011-07-21 16:21 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-07-20 13:42 sched: fix/optimise some issues Stephan Bärwolf
2011-07-20 19:11 ` Peter Zijlstra
2011-07-21 1:00 ` Mike Galbraith
2011-07-20 19:11 ` Peter Zijlstra
2011-07-20 19:11 ` Peter Zijlstra
2011-07-21 15:08 ` Peter Zijlstra
2011-07-21 16:36 ` Stephan Bärwolf [this message]
2011-07-21 16:32 ` Peter Zijlstra
2011-07-21 16:43 ` Peter Zijlstra
2011-07-21 16:51 ` Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4E28557A.7040704@tu-ilmenau.de \
--to=stephan.baerwolf@tu-ilmenau.de \
--cc=linux-kernel@vger.kernel.org \
--cc=peterz@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox