Linux Documentation
 help / color / mirror / Atom feed
* [tip:timers/core] hrtimer: Use a bullet for the returns bullet list
From: tip-bot for Mauro Carvalho Chehab @ 2019-06-27 21:46 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mchehab+samsung, hpa, linux-kernel, tglx, mingo, mchehab, corbet,
	linux-doc
In-Reply-To: <74ddad7dac331b4e5ce4a90e15c8a49e3a16d2ac.1561372382.git.mchehab+samsung@kernel.org>

Commit-ID:  516337048fa40496ae5ca9863c367ec991a44d9a
Gitweb:     https://git.kernel.org/tip/516337048fa40496ae5ca9863c367ec991a44d9a
Author:     Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
AuthorDate: Mon, 24 Jun 2019 07:33:26 -0300
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Thu, 27 Jun 2019 23:30:04 +0200

hrtimer: Use a bullet for the returns bullet list

That gets rid of this warning:

   ./kernel/time/hrtimer.c:1119: WARNING: Block quote ends without a blank line; unexpected unindent.

and displays nicely both at the source code and at the produced
documentation.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linux Doc Mailing List <linux-doc@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Link: https://lkml.kernel.org/r/74ddad7dac331b4e5ce4a90e15c8a49e3a16d2ac.1561372382.git.mchehab+samsung@kernel.org

---
 kernel/time/hrtimer.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index edb230aba3d1..5ee77f1a8a92 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1114,9 +1114,10 @@ EXPORT_SYMBOL_GPL(hrtimer_start_range_ns);
  * @timer:	hrtimer to stop
  *
  * Returns:
- *  0 when the timer was not active
- *  1 when the timer was active
- * -1 when the timer is currently executing the callback function and
+ *
+ *  *  0 when the timer was not active
+ *  *  1 when the timer was active
+ *  * -1 when the timer is currently executing the callback function and
  *    cannot be stopped
  */
 int hrtimer_try_to_cancel(struct hrtimer *timer)

^ permalink raw reply related

* Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs
From: Waiman Long @ 2019-06-27 21:31 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	linux-mm@kvack.org, linux-doc@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, Shakeel Butt, Andrea Arcangeli
In-Reply-To: <20190627212419.GA25233@tower.DHCP.thefacebook.com>

On 6/27/19 5:24 PM, Roman Gushchin wrote:
>>> 2) what's your long-term vision here? do you think that we need to shrink
>>>    kmem_caches periodically, depending on memory pressure? how a user
>>>    will use this new sysctl?
>> Shrinking the kmem caches under extreme memory pressure can be one way
>> to free up extra pages, but the effect will probably be temporary.
>>> What's the problem you're trying to solve in general?
>> At least for the slub allocator, shrinking the caches allow the number
>> of active objects reported in slabinfo to be more accurate. In addition,
>> this allow to know the real slab memory consumption. I have been working
>> on a BZ about continuous memory leaks with a container based workloads.
>> The ability to shrink caches allow us to get a more accurate memory
>> consumption picture. Another alternative is to turn on slub_debug which
>> will then disables all the per-cpu slabs.
> I see... I agree with Michal here, that extending drop_caches sysctl isn't
> the best idea. Isn't it possible to achieve the same effect using slub sysfs?

Yes, using the slub sysfs interface can be a possible alternative.

Cheers,
Longman


^ permalink raw reply

* Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs
From: Luis Chamberlain @ 2019-06-27 21:25 UTC (permalink / raw)
  To: Waiman Long, Masami Hiramatsu, Masoud Asgharifard Sharbiani
  Cc: Roman Gushchin, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, Alexander Viro, Jonathan Corbet,
	Kees Cook, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	linux-mm@kvack.org, linux-doc@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, Shakeel Butt, Andrea Arcangeli
In-Reply-To: <063752b2-4f1a-d198-36e7-3e642d4fcf19@redhat.com>

On Thu, Jun 27, 2019 at 04:57:50PM -0400, Waiman Long wrote:
> On 6/26/19 4:19 PM, Roman Gushchin wrote:
> >>  
> >> +#ifdef CONFIG_MEMCG_KMEM
> >> +static void kmem_cache_shrink_memcg(struct mem_cgroup *memcg,
> >> +				    void __maybe_unused *arg)
> >> +{
> >> +	struct kmem_cache *s;
> >> +
> >> +	if (memcg == root_mem_cgroup)
> >> +		return;
> >> +	mutex_lock(&slab_mutex);
> >> +	list_for_each_entry(s, &memcg->kmem_caches,
> >> +			    memcg_params.kmem_caches_node) {
> >> +		kmem_cache_shrink(s);
> >> +	}
> >> +	mutex_unlock(&slab_mutex);
> >> +	cond_resched();
> >> +}
> > A couple of questions:
> > 1) how about skipping already offlined kmem_caches? They are already shrunk,
> >    so you probably won't get much out of them. Or isn't it true?
> 
> I have been thinking about that. This patch is based on the linux tree
> and so don't have an easy to find out if the kmem caches have been
> shrinked. Rebasing this on top of linux-next, I can use the
> SLAB_DEACTIVATED flag as a marker for skipping the shrink.
> 
> With all the latest patches, I am still seeing 121 out of a total of 726
> memcg kmem caches (1/6) that are deactivated caches after system bootup
> one of the test systems. My system is still using cgroup v1 and so the
> number may be different in a v2 setup. The next step is probably to
> figure out why those deactivated caches are still there.
> 
> > 2) what's your long-term vision here? do you think that we need to shrink
> >    kmem_caches periodically, depending on memory pressure? how a user
> >    will use this new sysctl?
> Shrinking the kmem caches under extreme memory pressure can be one way
> to free up extra pages, but the effect will probably be temporary.
> > What's the problem you're trying to solve in general?
> 
> At least for the slub allocator, shrinking the caches allow the number
> of active objects reported in slabinfo to be more accurate. In addition,
> this allow to know the real slab memory consumption. I have been working
> on a BZ about continuous memory leaks with a container based workloads.

So.. this is still a work around?

> The ability to shrink caches allow us to get a more accurate memory
> consumption picture. Another alternative is to turn on slub_debug which
> will then disables all the per-cpu slabs.

So this is a debugging mechanism?

> Anyway, I think this can be useful to others that is why I posted the patch.

Since this is debug stuff, please add this to /proc/sys/debug/ instead.
That would reflect the intention, and would avoid the concern that folks
in production would use these things.

Since we only have 2 users of /proc/sys/debug/ I am now wondering if
would be best to add a new sysctl debug taint flag. This way bug
reports with these stupid knobs can got to /dev/null inbox for bug
reports.

Masami, /proc/sys/debug/kprobes-optimization is debug. Would you be OK
to add the taint for it too?

Masoud, /proc/sys/debug/exception-trace seems to actually be enabled
by default, and its goal seems to be to enable disabling it. So I
don't think it would make sense to taint there.

So.. maybe we need something /proc/sys/taints/ or
/proc/sys/debug/taints/ so its *very* clear this is by no way ever
expected to be used in production.

May even be good to long term add a symlink for vm/drop_caches there
as well?

  Luis

^ permalink raw reply

* Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs
From: Roman Gushchin @ 2019-06-27 21:24 UTC (permalink / raw)
  To: Waiman Long
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	linux-mm@kvack.org, linux-doc@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, Shakeel Butt, Andrea Arcangeli
In-Reply-To: <063752b2-4f1a-d198-36e7-3e642d4fcf19@redhat.com>

On Thu, Jun 27, 2019 at 04:57:50PM -0400, Waiman Long wrote:
> On 6/26/19 4:19 PM, Roman Gushchin wrote:
> >>  
> >> +#ifdef CONFIG_MEMCG_KMEM
> >> +static void kmem_cache_shrink_memcg(struct mem_cgroup *memcg,
> >> +				    void __maybe_unused *arg)
> >> +{
> >> +	struct kmem_cache *s;
> >> +
> >> +	if (memcg == root_mem_cgroup)
> >> +		return;
> >> +	mutex_lock(&slab_mutex);
> >> +	list_for_each_entry(s, &memcg->kmem_caches,
> >> +			    memcg_params.kmem_caches_node) {
> >> +		kmem_cache_shrink(s);
> >> +	}
> >> +	mutex_unlock(&slab_mutex);
> >> +	cond_resched();
> >> +}
> > A couple of questions:
> > 1) how about skipping already offlined kmem_caches? They are already shrunk,
> >    so you probably won't get much out of them. Or isn't it true?
> 
> I have been thinking about that. This patch is based on the linux tree
> and so don't have an easy to find out if the kmem caches have been
> shrinked. Rebasing this on top of linux-next, I can use the
> SLAB_DEACTIVATED flag as a marker for skipping the shrink.
> 
> With all the latest patches, I am still seeing 121 out of a total of 726
> memcg kmem caches (1/6) that are deactivated caches after system bootup
> one of the test systems. My system is still using cgroup v1 and so the
> number may be different in a v2 setup. The next step is probably to
> figure out why those deactivated caches are still there.

It's not a secret: these kmem_caches are holding objects, which are in use.
It's a drawback of the current slab accounting implementation: every
object holds a whole page and the corresponding kmem_cache. It's optimized
for a large number of objects, which are created and destroyed within
the life of the cgroup (e.g. task_structs), and it works worse for long-living
objects like vfs cache.

Long-term I think we need a different implementation for long-living objects,
so that objects belonging to different memory cgroups can share the same page
and kmem_caches.

It's a fairly big change though.

> 
> > 2) what's your long-term vision here? do you think that we need to shrink
> >    kmem_caches periodically, depending on memory pressure? how a user
> >    will use this new sysctl?
> Shrinking the kmem caches under extreme memory pressure can be one way
> to free up extra pages, but the effect will probably be temporary.
> > What's the problem you're trying to solve in general?
> 
> At least for the slub allocator, shrinking the caches allow the number
> of active objects reported in slabinfo to be more accurate. In addition,
> this allow to know the real slab memory consumption. I have been working
> on a BZ about continuous memory leaks with a container based workloads.
> The ability to shrink caches allow us to get a more accurate memory
> consumption picture. Another alternative is to turn on slub_debug which
> will then disables all the per-cpu slabs.

I see... I agree with Michal here, that extending drop_caches sysctl isn't
the best idea. Isn't it possible to achieve the same effect using slub sysfs?

Thanks!

^ permalink raw reply

* Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs
From: Waiman Long @ 2019-06-27 21:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Johannes Weiner, Vladimir Davydov, linux-mm, linux-doc,
	linux-fsdevel, cgroups, linux-kernel, Roman Gushchin,
	Shakeel Butt, Andrea Arcangeli
In-Reply-To: <20190627151506.GE5303@dhcp22.suse.cz>

On 6/27/19 11:15 AM, Michal Hocko wrote:
> On Mon 24-06-19 13:42:19, Waiman Long wrote:
>> With the slub memory allocator, the numbers of active slab objects
>> reported in /proc/slabinfo are not real because they include objects
>> that are held by the per-cpu slab structures whether they are actually
>> used or not.  The problem gets worse the more CPUs a system have. For
>> instance, looking at the reported number of active task_struct objects,
>> one will wonder where all the missing tasks gone.
>>
>> I know it is hard and costly to get a real count of active objects.
> What exactly is expensive? Why cannot slabinfo reduce the number of
> active objects by per-cpu cached objects?
>
The number of cachelines that needs to be accessed in order to get an
accurate count will be much higher if we need to iterate through all the
per-cpu structures. In addition, accessing the per-cpu partial list will
be racy.


>> So
>> I am not advocating for that. Instead, this patch extends the
>> /proc/sys/vm/drop_caches sysctl parameter by using a new bit (bit 3)
>> to shrink all the kmem slabs which will flush out all the slabs in the
>> per-cpu structures and give a more accurate view of how much memory are
>> really used up by the active slab objects. This is a costly operation,
>> of course, but it gives a way to have a clearer picture of the actual
>> number of slab objects used, if the need arises.
> drop_caches is a terrible interface. It destroys all the caching and
> people are just too easy in using it to solve any kind of problem they
> think they might have and cause others they might not see immediately.
> I am strongly discouraging anybody - except for some tests which really
> do want to see reproducible results without cache effects - from using
> this interface and therefore I am not really happy to paper over
> something that might be a real problem with yet another mode. If SLUB
> indeed caches too aggressively on large machines then this should be
> fixed.
>
OK, as explained in another thread, the main reason for doing this patch
is to be able to do more accurate measurement of changes in kmem cache
memory consumption. Yes, I do agree that drop_caches is not a general
purpose interface that should be used lightly.

Cheers,
Longman


^ permalink raw reply

* Re: [PATCH 1/2] mm, memcontrol: Add memcg_iterate_all()
From: Waiman Long @ 2019-06-27 21:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Johannes Weiner, Vladimir Davydov, linux-mm, linux-doc,
	linux-fsdevel, cgroups, linux-kernel, Roman Gushchin,
	Shakeel Butt, Andrea Arcangeli
In-Reply-To: <20190627150746.GD5303@dhcp22.suse.cz>

On 6/27/19 11:07 AM, Michal Hocko wrote:
> On Mon 24-06-19 13:42:18, Waiman Long wrote:
>> Add a memcg_iterate_all() function for iterating all the available
>> memory cgroups and call the given callback function for each of the
>> memory cgruops.
> Why is a trivial wrapper any better than open coded usage of the
> iterator?

Because the iterator is only defined within memcontrol.c. So an
alternative may be to put the iterator into a header file that can be
used by others. Will take a look at that.

Cheers,
Longman


^ permalink raw reply

* [Linux-kernel-mentees][PATCH] doc: RCU callback locks need only _bh, not necessarily _irq
From: Jiunn Chang @ 2019-06-27 21:01 UTC (permalink / raw)
  To: skhan
  Cc: linux-kernel-mentees, paulmck, josh, rostedt, mathieu.desnoyers,
	jiangshanlai, joel, corbet, rcu, linux-doc, linux-kernel

The UP.rst file calls for locks acquired within RCU callback functions
to use _irq variants (spin_lock_irqsave() or similar), which does work,
but can be overkill.  This commit therefore instead calls for _bh variants
(spin_lock_bh() or similar), while noting that _irq does work.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: Jiunn Chang <c0d1n61at3@gmail.com>
---
 Documentation/RCU/UP.rst | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/Documentation/RCU/UP.rst b/Documentation/RCU/UP.rst
index 67715a47ae89..e26dda27430c 100644
--- a/Documentation/RCU/UP.rst
+++ b/Documentation/RCU/UP.rst
@@ -113,12 +113,13 @@ Answer to Quick Quiz #1:
 Answer to Quick Quiz #2:
 	What locking restriction must RCU callbacks respect?
 
-	Any lock that is acquired within an RCU callback must be
-	acquired elsewhere using an _irq variant of the spinlock
-	primitive.  For example, if "mylock" is acquired by an
-	RCU callback, then a process-context acquisition of this
-	lock must use something like spin_lock_irqsave() to
-	acquire the lock.
+	Any lock that is acquired within an RCU callback must be acquired
+	elsewhere using an _bh variant of the spinlock primitive.
+	For example, if "mylock" is acquired by an RCU callback, then
+	a process-context acquisition of this lock must use something
+	like spin_lock_bh() to acquire the lock.  Please note that
+	it is also OK to use _irq variants of spinlocks, for example,
+	spin_lock_irqsave().
 
 	If the process-context code were to simply use spin_lock(),
 	then, since RCU callbacks can be invoked from softirq context,
-- 
2.22.0


^ permalink raw reply related

* Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs
From: Waiman Long @ 2019-06-27 20:57 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	linux-mm@kvack.org, linux-doc@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, Shakeel Butt, Andrea Arcangeli
In-Reply-To: <20190626201900.GC24698@tower.DHCP.thefacebook.com>

On 6/26/19 4:19 PM, Roman Gushchin wrote:
>>  
>> +#ifdef CONFIG_MEMCG_KMEM
>> +static void kmem_cache_shrink_memcg(struct mem_cgroup *memcg,
>> +				    void __maybe_unused *arg)
>> +{
>> +	struct kmem_cache *s;
>> +
>> +	if (memcg == root_mem_cgroup)
>> +		return;
>> +	mutex_lock(&slab_mutex);
>> +	list_for_each_entry(s, &memcg->kmem_caches,
>> +			    memcg_params.kmem_caches_node) {
>> +		kmem_cache_shrink(s);
>> +	}
>> +	mutex_unlock(&slab_mutex);
>> +	cond_resched();
>> +}
> A couple of questions:
> 1) how about skipping already offlined kmem_caches? They are already shrunk,
>    so you probably won't get much out of them. Or isn't it true?

I have been thinking about that. This patch is based on the linux tree
and so don't have an easy to find out if the kmem caches have been
shrinked. Rebasing this on top of linux-next, I can use the
SLAB_DEACTIVATED flag as a marker for skipping the shrink.

With all the latest patches, I am still seeing 121 out of a total of 726
memcg kmem caches (1/6) that are deactivated caches after system bootup
one of the test systems. My system is still using cgroup v1 and so the
number may be different in a v2 setup. The next step is probably to
figure out why those deactivated caches are still there.

> 2) what's your long-term vision here? do you think that we need to shrink
>    kmem_caches periodically, depending on memory pressure? how a user
>    will use this new sysctl?
Shrinking the kmem caches under extreme memory pressure can be one way
to free up extra pages, but the effect will probably be temporary.
> What's the problem you're trying to solve in general?

At least for the slub allocator, shrinking the caches allow the number
of active objects reported in slabinfo to be more accurate. In addition,
this allow to know the real slab memory consumption. I have been working
on a BZ about continuous memory leaks with a container based workloads.
The ability to shrink caches allow us to get a more accurate memory
consumption picture. Another alternative is to turn on slub_debug which
will then disables all the per-cpu slabs.

Anyway, I think this can be useful to others that is why I posted the patch.

Cheers,
Longman


^ permalink raw reply

* [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
From: Dave Chiluk @ 2019-06-27 19:49 UTC (permalink / raw)
  To: Ben Segall, Pqhil Auld, Peter Oskolkov, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet,
	linux-doc
In-Reply-To: <1561664970-1555-1-git-send-email-chiluk+linux@indeed.com>

It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota. This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to threads being allocated per cpu bandwidth
slices, and then not fully using that slice within the period. At which
point the slice and quota expires. This expiration of unused slice
results in applications not being able to utilize the quota for which
they are allocated.

The expiration of per-cpu slices was recently fixed by
'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
condition")'. Prior to that it appears that this has been broken since
at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
added the following conditional which resulted in slices never being
expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
	/* extend local deadline, drift is bounded above by 2 ticks */
	cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows quota already allocated to per-cpu run-queues to live longer
than the period boundary. This allows threads on runqueues that do not
use much CPU to continue to use their remaining slice over a longer
period of time than cpu.cfs_period_us. However, this helps prevents the
above condition of hitting throttling while also not fully utilizing
your cpu quota.

This theoretically allows a machine to use slightly more than its
allotted quota in some periods. This overflow would be bounded by the
remaining quota left on each per-cpu runqueueu. This is typically no
more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
change nothing, as they should theoretically fully utilize all of their
quota in each period. For user-interactive tasks as described above this
provides a much better user/application experience as their cpu
utilization will more closely match the amount they requested when they
hit throttling. This means that cpu limits no longer strictly apply per
period for non-cpu bound applications, but that they are still accurate
over longer timeframes.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase (10ms/100ms of quota on
80 CPU machine), this commit resulted in almost 30x performance
improvement, while still maintaining correct cpu quota restrictions.
That testcase is available at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
---
 Documentation/scheduler/sched-bwc.txt | 73 ++++++++++++++++++++++++++++-------
 kernel/sched/fair.c                   | 71 +++-------------------------------
 kernel/sched/sched.h                  |  4 --
 3 files changed, 65 insertions(+), 83 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
index f6b1873..4c19c94 100644
--- a/Documentation/scheduler/sched-bwc.txt
+++ b/Documentation/scheduler/sched-bwc.txt
@@ -8,15 +8,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
 specification of the maximum CPU bandwidth available to a group or hierarchy.
 
 The bandwidth allowed for a group is specified using a quota and period. Within
-each given "period" (microseconds), a group is allowed to consume only up to
-"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
-group exceeds this limit (for that period), the tasks belonging to its
-hierarchy will be throttled and are not allowed to run again until the next
-period.
-
-A group's unused runtime is globally tracked, being refreshed with quota units
-above at each period boundary.  As threads consume this bandwidth it is
-transferred to cpu-local "silos" on a demand basis.  The amount transferred
+each given "period" (microseconds), a task group is allocated up to "quota"
+microseconds of CPU time. That quota is assigned to per cpu run queues in
+slices as threads in the cgroup become runnable. Once all quota has been
+assigned any additional requests for quota will result in those threads being
+throttled. Throttled threads will not be able to run again until the next
+period when the quota is replenished.
+
+A group's unassigned quota is globally tracked, being refreshed back to
+cfs_quota units at each period boundary. As threads consume this bandwidth it
+is transferred to cpu-local "silos" on a demand basis. The amount transferred
 within each of these updates is tunable and described as the "slice".
 
 Management
@@ -33,12 +34,12 @@ The default values are:
 
 A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
 bandwidth restriction in place, such a group is described as an unconstrained
-bandwidth group.  This represents the traditional work-conserving behavior for
+bandwidth group. This represents the traditional work-conserving behavior for
 CFS.
 
 Writing any (valid) positive value(s) will enact the specified bandwidth limit.
-The minimum quota allowed for the quota or period is 1ms.  There is also an
-upper bound on the period length of 1s.  Additional restrictions exist when
+The minimum quota allowed for the quota or period is 1ms. There is also an
+upper bound on the period length of 1s. Additional restrictions exist when
 bandwidth limits are used in a hierarchical fashion, these are explained in
 more detail below.
 
@@ -51,8 +52,8 @@ unthrottled if it is in a constrained state.
 System wide settings
 --------------------
 For efficiency run-time is transferred between the global pool and CPU local
-"silos" in a batch fashion.  This greatly reduces global accounting pressure
-on large systems.  The amount transferred each time such an update is required
+"silos" in a batch fashion. This greatly reduces global accounting pressure
+on large systems. The amount transferred each time such an update is required
 is described as the "slice".
 
 This is tunable via procfs:
@@ -90,6 +91,50 @@ There are two ways in which a group may become throttled:
 In case b) above, even though the child may have runtime remaining it will not
 be allowed to until the parent's runtime is refreshed.
 
+Real-world behavior
+-------------------
+Once a slice is assigned to a cpu it does not expire, but all except 1ms of it
+may be returned to the global pool if all threads on that cpu become
+unrunnable. This is configured at compile time by the min_cfs_rq_runtime
+variable.
+
+The fact that cpu-local slices do not expire results in some interesting corner
+cases that should be understood.
+
+For cgroup cpu constrained applications that are cpu limited this is a
+relatively moot point because they will naturally consume the entirety of their
+quota as well as the entirety of each cpu-local slice in each period. As a
+result it is expected that nr_periods roughly equal nr_throttled, and that
+cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
+
+However in a worst-case scenario, highly-threaded, interactive/non-cpu bound
+applications this non-expiration nuance allows applications to briefly burst
+past their quota limits by the amount of unused slice on each cpu that the task
+group is running on (typically at most 1ms per cpu). This slight burst
+only applies if quota had been assigned to a cpu and then not fully used or
+returned in previous periods. This burst amount will not be transferred
+between cores. As a result, this mechanism still strictly limits the task
+group to quota average usage, albeit over a longer time window than period.
+This also limits the burst ability to no more than 1ms per cpu. This provides
+better more predictable user experience for highly threaded applications with
+small quota limits on high core count machines. It also eliminates the
+propensity to throttle these applications while simultanously using less than
+quota amounts of cpu. Another way to say this, is that by allowing the unused
+portion of a slice to remain valid across periods we have decreased the
+possibility of wasting quota on cpu-local silos that don't need a full slice's
+amount of cpu time.
+
+The interaction between cpu-bound and non-cpu-bound-interactive applications
+should also be considered, especially when single core usage hits 100%. If you
+gave each of these applications half of a cpu-core and they both got scheduled
+on the same CPU it is theoretically possible that the non-cpu bound application
+will use up to 1ms additional quota in some periods, thereby preventing the
+cpu-bound application from fully using its quota by that same amount. In these
+instances it will be up to the CFS algorithm (see sched-design-CFS.txt) to
+decide which application is chosen to run, as they will both be runnable and
+have remaining quota. This runtime discrepancy will be made up in the following
+periods when the interactive application idles.
+
 Examples
 --------
 1. Limit a group to 1 CPU worth of runtime.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..a675c69 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4295,8 +4295,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 
 	now = sched_clock_cpu(smp_processor_id());
 	cfs_b->runtime = cfs_b->quota;
-	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
-	cfs_b->expires_seq++;
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4318,8 +4316,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
-	u64 amount = 0, min_amount, expires;
-	int expires_seq;
+	u64 amount = 0, min_amount;
 
 	/* note: this is a positive sum as runtime_remaining <= 0 */
 	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -4336,61 +4333,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 			cfs_b->idle = 0;
 		}
 	}
-	expires_seq = cfs_b->expires_seq;
-	expires = cfs_b->runtime_expires;
 	raw_spin_unlock(&cfs_b->lock);
 
 	cfs_rq->runtime_remaining += amount;
-	/*
-	 * we may have advanced our local expiration to account for allowed
-	 * spread between our sched_clock and the one on which runtime was
-	 * issued.
-	 */
-	if (cfs_rq->expires_seq != expires_seq) {
-		cfs_rq->expires_seq = expires_seq;
-		cfs_rq->runtime_expires = expires;
-	}
 
 	return cfs_rq->runtime_remaining > 0;
 }
 
-/*
- * Note: This depends on the synchronization provided by sched_clock and the
- * fact that rq->clock snapshots this value.
- */
-static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
-{
-	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-
-	/* if the deadline is ahead of our clock, nothing to do */
-	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
-		return;
-
-	if (cfs_rq->runtime_remaining < 0)
-		return;
-
-	/*
-	 * If the local deadline has passed we have to consider the
-	 * possibility that our sched_clock is 'fast' and the global deadline
-	 * has not truly expired.
-	 *
-	 * Fortunately we can check determine whether this the case by checking
-	 * whether the global deadline(cfs_b->expires_seq) has advanced.
-	 */
-	if (cfs_rq->expires_seq == cfs_b->expires_seq) {
-		/* extend local deadline, drift is bounded above by 2 ticks */
-		cfs_rq->runtime_expires += TICK_NSEC;
-	} else {
-		/* global deadline is ahead, expiration has passed */
-		cfs_rq->runtime_remaining = 0;
-	}
-}
-
 static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 {
 	/* dock delta_exec before expiring quota (as it could span periods) */
 	cfs_rq->runtime_remaining -= delta_exec;
-	expire_cfs_rq_runtime(cfs_rq);
 
 	if (likely(cfs_rq->runtime_remaining > 0))
 		return;
@@ -4581,8 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		resched_curr(rq);
 }
 
-static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
-		u64 remaining, u64 expires)
+static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
 {
 	struct cfs_rq *cfs_rq;
 	u64 runtime;
@@ -4604,7 +4556,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
 		remaining -= runtime;
 
 		cfs_rq->runtime_remaining += runtime;
-		cfs_rq->runtime_expires = expires;
 
 		/* we check whether we're throttled above */
 		if (cfs_rq->runtime_remaining > 0)
@@ -4629,7 +4580,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
  */
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
 {
-	u64 runtime, runtime_expires;
+	u64 runtime;
 	int throttled;
 
 	/* no need to continue the timer with no bandwidth constraint */
@@ -4657,8 +4608,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 	/* account preceding periods in which throttling occurred */
 	cfs_b->nr_throttled += overrun;
 
-	runtime_expires = cfs_b->runtime_expires;
-
 	/*
 	 * This check is repeated as we are holding onto the new bandwidth while
 	 * we unthrottle. This can potentially race with an unthrottled group
@@ -4671,8 +4620,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 		cfs_b->distribute_running = 1;
 		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		/* we can't nest cfs_b->lock while distributing bandwidth */
-		runtime = distribute_cfs_runtime(cfs_b, runtime,
-						 runtime_expires);
+		runtime = distribute_cfs_runtime(cfs_b, runtime);
 		raw_spin_lock_irqsave(&cfs_b->lock, flags);
 
 		cfs_b->distribute_running = 0;
@@ -4749,8 +4697,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 		return;
 
 	raw_spin_lock(&cfs_b->lock);
-	if (cfs_b->quota != RUNTIME_INF &&
-	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
+	if (cfs_b->quota != RUNTIME_INF) {
 		cfs_b->runtime += slack_runtime;
 
 		/* we are under rq->lock, defer unthrottling using a timer */
@@ -4783,7 +4730,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 {
 	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
 	unsigned long flags;
-	u64 expires;
 
 	/* confirm we're still not at a refresh boundary */
 	raw_spin_lock_irqsave(&cfs_b->lock, flags);
@@ -4800,7 +4746,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
 		runtime = cfs_b->runtime;
 
-	expires = cfs_b->runtime_expires;
 	if (runtime)
 		cfs_b->distribute_running = 1;
 
@@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (!runtime)
 		return;
 
-	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
+	runtime = distribute_cfs_runtime(cfs_b, runtime);
 
 	raw_spin_lock_irqsave(&cfs_b->lock, flags);
-	if (expires == cfs_b->runtime_expires)
-		lsub_positive(&cfs_b->runtime, runtime);
 	cfs_b->distribute_running = 0;
 	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 }
@@ -4969,8 +4912,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 
 	cfs_b->period_active = 1;
 	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
-	cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
-	cfs_b->expires_seq++;
 	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..0c0ed23 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -341,8 +341,6 @@ struct cfs_bandwidth {
 	u64			quota;
 	u64			runtime;
 	s64			hierarchical_quota;
-	u64			runtime_expires;
-	int			expires_seq;
 
 	short			idle;
 	short			period_active;
@@ -562,8 +560,6 @@ struct cfs_rq {
 
 #ifdef CONFIG_CFS_BANDWIDTH
 	int			runtime_enabled;
-	int			expires_seq;
-	u64			runtime_expires;
 	s64			runtime_remaining;
 
 	u64			throttled_clock;
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH v5 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
From: Dave Chiluk @ 2019-06-27 19:49 UTC (permalink / raw)
  To: Ben Segall, Pqhil Auld, Peter Oskolkov, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet,
	linux-doc
In-Reply-To: <1558121424-2914-1-git-send-email-chiluk+linux@indeed.com>

Changelog v5
- Based on this comment from Ben Segall's comment on v4
> If the cost of taking this global lock across all cpus without a
> ratelimit was somehow not a problem, I'd much prefer to just set
> min_cfs_rq_runtime = 0. (Assuming it is, I definitely prefer the "lie
> and sorta have 2x period 2x runtime" solution of removing expiration)
I'm resubmitting my v3 patchset, with the requested changes.
- Updated Commit log given review comments
- Update sched-bwc.txt give my new understanding of the slack timer.

Changelog v4
- Rewrote patchset around the concept of returning all of runtime_remaining
when cfs_b nears the end of available quota.

Changelog v3
- Reworked documentation to better describe behavior of slice expiration per
feedback from Peter Oskolkov

Changelog v2
- Fixed some checkpatch errors in the commit message.

^ permalink raw reply

* Re: [PATCH] drm: fix a reference for a renamed file: fb/modedb.rst
From: Daniel Vetter @ 2019-06-27 19:43 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Mauro Carvalho Chehab, Linux Doc Mailing List,
	Mauro Carvalho Chehab, Linux Kernel Mailing List,
	Maarten Lankhorst, Maxime Ripard, Sean Paul, David Airlie,
	dri-devel
In-Reply-To: <20190627113122.34b46ee2@lwn.net>

On Thu, Jun 27, 2019 at 7:31 PM Jonathan Corbet <corbet@lwn.net> wrote:
>
> On Wed, 26 Jun 2019 23:27:35 +0200
> Daniel Vetter <daniel@ffwll.ch> wrote:
>
> > On Wed, Jun 26, 2019 at 01:14:13PM -0300, Mauro Carvalho Chehab wrote:
> > > Due to two patches being applied about the same time, the
> > > reference for modedb.rst file got wrong:
> > >
> > >     Documentation/fb/modedb.txt is now Documentation/fb/modedb.rst.
> > >
> > > Fixes: 1bf4e09227c3 ("drm/modes: Allow to specify rotation and reflection on the commandline")
> > > Fixes: ab42b818954c ("docs: fb: convert docs to ReST and rename to *.rst")
> > > Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
> >
> > What's the merge plan here? doc-next? If so:
> >
> > Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>
> It doesn't really apply to docs-next, so that's probably not the best
> path unless I hold it until after the merge window.  Seems like it needs
> to go through the DRM tree to me.

fbdev isn't in drm (yet), so I don't think it applies here eithe. I
guess we need to wait until after -rc1 with this one. Topic
branch/merge seems a bit overkill here.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply

* [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
From: Dave Chiluk @ 2019-06-27 19:09 UTC (permalink / raw)
  To: Ben Segall, Phil Auld, Peter Oskolkov, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet,
	linux-doc
In-Reply-To: <1558121424-2914-1-git-send-email-chiluk+linux@indeed.com>

It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota. This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to threads being allocated per cpu bandwidth
slices, and then not fully using that slice within the period. At which
point the slice and quota expires. This expiration of unused slice
results in applications not being able to utilize the quota for which
they are allocated.

The expiration of per-cpu slices was recently fixed by
'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
condition")'. Prior to that it appears that this has been broken since
at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
added the following conditional which resulted in slices never being
expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
	/* extend local deadline, drift is bounded above by 2 ticks */
	cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows quota already allocated to per-cpu run-queues to live longer
than the period boundary. This allows threads on runqueues that do not
use much CPU to continue to use their remaining slice over a longer
period of time than cpu.cfs_period_us. However, this helps prevents the
above condition of hitting throttling while also not fully utilizing
your cpu quota.

This theoretically allows a machine to use slightly more than its
allotted quota in some periods. This overflow would be bounded by the
remaining quota left on each per-cpu runqueueu. This is typically no
more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
change nothing, as they should theoretically fully utilize all of their
quota in each period. For user-interactive tasks as described above this
provides a much better user/application experience as their cpu
utilization will more closely match the amount they requested when they
hit throttling. This means that cpu limits no longer strictly apply per
period for non-cpu bound applications, but that they are still accurate
over longer timeframes.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase (10ms/100ms of quota on
80 CPU machine), this commit resulted in almost 30x performance
improvement, while still maintaining correct cpu quota restrictions.
That testcase is available at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
---
 Documentation/scheduler/sched-bwc.txt | 73 ++++++++++++++++++++++++++++-------
 kernel/sched/fair.c                   | 71 +++-------------------------------
 kernel/sched/sched.h                  |  4 --
 3 files changed, 65 insertions(+), 83 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
index f6b1873..4c19c94 100644
--- a/Documentation/scheduler/sched-bwc.txt
+++ b/Documentation/scheduler/sched-bwc.txt
@@ -8,15 +8,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
 specification of the maximum CPU bandwidth available to a group or hierarchy.
 
 The bandwidth allowed for a group is specified using a quota and period. Within
-each given "period" (microseconds), a group is allowed to consume only up to
-"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
-group exceeds this limit (for that period), the tasks belonging to its
-hierarchy will be throttled and are not allowed to run again until the next
-period.
-
-A group's unused runtime is globally tracked, being refreshed with quota units
-above at each period boundary.  As threads consume this bandwidth it is
-transferred to cpu-local "silos" on a demand basis.  The amount transferred
+each given "period" (microseconds), a task group is allocated up to "quota"
+microseconds of CPU time. That quota is assigned to per cpu run queues in
+slices as threads in the cgroup become runnable. Once all quota has been
+assigned any additional requests for quota will result in those threads being
+throttled. Throttled threads will not be able to run again until the next
+period when the quota is replenished.
+
+A group's unassigned quota is globally tracked, being refreshed back to
+cfs_quota units at each period boundary. As threads consume this bandwidth it
+is transferred to cpu-local "silos" on a demand basis. The amount transferred
 within each of these updates is tunable and described as the "slice".
 
 Management
@@ -33,12 +34,12 @@ The default values are:
 
 A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
 bandwidth restriction in place, such a group is described as an unconstrained
-bandwidth group.  This represents the traditional work-conserving behavior for
+bandwidth group. This represents the traditional work-conserving behavior for
 CFS.
 
 Writing any (valid) positive value(s) will enact the specified bandwidth limit.
-The minimum quota allowed for the quota or period is 1ms.  There is also an
-upper bound on the period length of 1s.  Additional restrictions exist when
+The minimum quota allowed for the quota or period is 1ms. There is also an
+upper bound on the period length of 1s. Additional restrictions exist when
 bandwidth limits are used in a hierarchical fashion, these are explained in
 more detail below.
 
@@ -51,8 +52,8 @@ unthrottled if it is in a constrained state.
 System wide settings
 --------------------
 For efficiency run-time is transferred between the global pool and CPU local
-"silos" in a batch fashion.  This greatly reduces global accounting pressure
-on large systems.  The amount transferred each time such an update is required
+"silos" in a batch fashion. This greatly reduces global accounting pressure
+on large systems. The amount transferred each time such an update is required
 is described as the "slice".
 
 This is tunable via procfs:
@@ -90,6 +91,50 @@ There are two ways in which a group may become throttled:
 In case b) above, even though the child may have runtime remaining it will not
 be allowed to until the parent's runtime is refreshed.
 
+Real-world behavior
+-------------------
+Once a slice is assigned to a cpu it does not expire, but all except 1ms of it
+may be returned to the global pool if all threads on that cpu become
+unrunnable. This is configured at compile time by the min_cfs_rq_runtime
+variable.
+
+The fact that cpu-local slices do not expire results in some interesting corner
+cases that should be understood.
+
+For cgroup cpu constrained applications that are cpu limited this is a
+relatively moot point because they will naturally consume the entirety of their
+quota as well as the entirety of each cpu-local slice in each period. As a
+result it is expected that nr_periods roughly equal nr_throttled, and that
+cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
+
+However in a worst-case scenario, highly-threaded, interactive/non-cpu bound
+applications this non-expiration nuance allows applications to briefly burst
+past their quota limits by the amount of unused slice on each cpu that the task
+group is running on (typically at most 1ms per cpu). This slight burst
+only applies if quota had been assigned to a cpu and then not fully used or
+returned in previous periods. This burst amount will not be transferred
+between cores. As a result, this mechanism still strictly limits the task
+group to quota average usage, albeit over a longer time window than period.
+This also limits the burst ability to no more than 1ms per cpu. This provides
+better more predictable user experience for highly threaded applications with
+small quota limits on high core count machines. It also eliminates the
+propensity to throttle these applications while simultanously using less than
+quota amounts of cpu. Another way to say this, is that by allowing the unused
+portion of a slice to remain valid across periods we have decreased the
+possibility of wasting quota on cpu-local silos that don't need a full slice's
+amount of cpu time.
+
+The interaction between cpu-bound and non-cpu-bound-interactive applications
+should also be considered, especially when single core usage hits 100%. If you
+gave each of these applications half of a cpu-core and they both got scheduled
+on the same CPU it is theoretically possible that the non-cpu bound application
+will use up to 1ms additional quota in some periods, thereby preventing the
+cpu-bound application from fully using its quota by that same amount. In these
+instances it will be up to the CFS algorithm (see sched-design-CFS.txt) to
+decide which application is chosen to run, as they will both be runnable and
+have remaining quota. This runtime discrepancy will be made up in the following
+periods when the interactive application idles.
+
 Examples
 --------
 1. Limit a group to 1 CPU worth of runtime.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..a675c69 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4295,8 +4295,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 
 	now = sched_clock_cpu(smp_processor_id());
 	cfs_b->runtime = cfs_b->quota;
-	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
-	cfs_b->expires_seq++;
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4318,8 +4316,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
-	u64 amount = 0, min_amount, expires;
-	int expires_seq;
+	u64 amount = 0, min_amount;
 
 	/* note: this is a positive sum as runtime_remaining <= 0 */
 	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -4336,61 +4333,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 			cfs_b->idle = 0;
 		}
 	}
-	expires_seq = cfs_b->expires_seq;
-	expires = cfs_b->runtime_expires;
 	raw_spin_unlock(&cfs_b->lock);
 
 	cfs_rq->runtime_remaining += amount;
-	/*
-	 * we may have advanced our local expiration to account for allowed
-	 * spread between our sched_clock and the one on which runtime was
-	 * issued.
-	 */
-	if (cfs_rq->expires_seq != expires_seq) {
-		cfs_rq->expires_seq = expires_seq;
-		cfs_rq->runtime_expires = expires;
-	}
 
 	return cfs_rq->runtime_remaining > 0;
 }
 
-/*
- * Note: This depends on the synchronization provided by sched_clock and the
- * fact that rq->clock snapshots this value.
- */
-static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
-{
-	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-
-	/* if the deadline is ahead of our clock, nothing to do */
-	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
-		return;
-
-	if (cfs_rq->runtime_remaining < 0)
-		return;
-
-	/*
-	 * If the local deadline has passed we have to consider the
-	 * possibility that our sched_clock is 'fast' and the global deadline
-	 * has not truly expired.
-	 *
-	 * Fortunately we can check determine whether this the case by checking
-	 * whether the global deadline(cfs_b->expires_seq) has advanced.
-	 */
-	if (cfs_rq->expires_seq == cfs_b->expires_seq) {
-		/* extend local deadline, drift is bounded above by 2 ticks */
-		cfs_rq->runtime_expires += TICK_NSEC;
-	} else {
-		/* global deadline is ahead, expiration has passed */
-		cfs_rq->runtime_remaining = 0;
-	}
-}
-
 static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 {
 	/* dock delta_exec before expiring quota (as it could span periods) */
 	cfs_rq->runtime_remaining -= delta_exec;
-	expire_cfs_rq_runtime(cfs_rq);
 
 	if (likely(cfs_rq->runtime_remaining > 0))
 		return;
@@ -4581,8 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		resched_curr(rq);
 }
 
-static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
-		u64 remaining, u64 expires)
+static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
 {
 	struct cfs_rq *cfs_rq;
 	u64 runtime;
@@ -4604,7 +4556,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
 		remaining -= runtime;
 
 		cfs_rq->runtime_remaining += runtime;
-		cfs_rq->runtime_expires = expires;
 
 		/* we check whether we're throttled above */
 		if (cfs_rq->runtime_remaining > 0)
@@ -4629,7 +4580,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
  */
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
 {
-	u64 runtime, runtime_expires;
+	u64 runtime;
 	int throttled;
 
 	/* no need to continue the timer with no bandwidth constraint */
@@ -4657,8 +4608,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 	/* account preceding periods in which throttling occurred */
 	cfs_b->nr_throttled += overrun;
 
-	runtime_expires = cfs_b->runtime_expires;
-
 	/*
 	 * This check is repeated as we are holding onto the new bandwidth while
 	 * we unthrottle. This can potentially race with an unthrottled group
@@ -4671,8 +4620,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 		cfs_b->distribute_running = 1;
 		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		/* we can't nest cfs_b->lock while distributing bandwidth */
-		runtime = distribute_cfs_runtime(cfs_b, runtime,
-						 runtime_expires);
+		runtime = distribute_cfs_runtime(cfs_b, runtime);
 		raw_spin_lock_irqsave(&cfs_b->lock, flags);
 
 		cfs_b->distribute_running = 0;
@@ -4749,8 +4697,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 		return;
 
 	raw_spin_lock(&cfs_b->lock);
-	if (cfs_b->quota != RUNTIME_INF &&
-	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
+	if (cfs_b->quota != RUNTIME_INF) {
 		cfs_b->runtime += slack_runtime;
 
 		/* we are under rq->lock, defer unthrottling using a timer */
@@ -4783,7 +4730,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 {
 	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
 	unsigned long flags;
-	u64 expires;
 
 	/* confirm we're still not at a refresh boundary */
 	raw_spin_lock_irqsave(&cfs_b->lock, flags);
@@ -4800,7 +4746,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
 		runtime = cfs_b->runtime;
 
-	expires = cfs_b->runtime_expires;
 	if (runtime)
 		cfs_b->distribute_running = 1;
 
@@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (!runtime)
 		return;
 
-	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
+	runtime = distribute_cfs_runtime(cfs_b, runtime);
 
 	raw_spin_lock_irqsave(&cfs_b->lock, flags);
-	if (expires == cfs_b->runtime_expires)
-		lsub_positive(&cfs_b->runtime, runtime);
 	cfs_b->distribute_running = 0;
 	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 }
@@ -4969,8 +4912,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 
 	cfs_b->period_active = 1;
 	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
-	cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
-	cfs_b->expires_seq++;
 	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..0c0ed23 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -341,8 +341,6 @@ struct cfs_bandwidth {
 	u64			quota;
 	u64			runtime;
 	s64			hierarchical_quota;
-	u64			runtime_expires;
-	int			expires_seq;
 
 	short			idle;
 	short			period_active;
@@ -562,8 +560,6 @@ struct cfs_rq {
 
 #ifdef CONFIG_CFS_BANDWIDTH
 	int			runtime_enabled;
-	int			expires_seq;
-	u64			runtime_expires;
 	s64			runtime_remaining;
 
 	u64			throttled_clock;
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH v5 01/18] kunit: test: add KUnit test runner core
From: Stephen Boyd @ 2019-06-27 18:16 UTC (permalink / raw)
  To: Brendan Higgins
  Cc: Frank Rowand, Greg KH, Josh Poimboeuf, Kees Cook, Kieran Bingham,
	Luis Chamberlain, Peter Zijlstra, Rob Herring, shuah,
	Theodore Ts'o, Masahiro Yamada, devicetree, dri-devel,
	kunit-dev, open list:DOCUMENTATION, linux-fsdevel, linux-kbuild,
	Linux Kernel Mailing List, open list:KERNEL SELFTEST FRAMEWORK,
	linux-nvdimm, linux-um, Sasha Levin, Bird, Timothy,
	Amir Goldstein, Dan Carpenter, Daniel Vetter, Jeff Dike,
	Joel Stanley, Julia Lawall, Kevin Hilman, Knut Omang,
	Logan Gunthorpe, Michael Ellerman, Petr Mladek, Randy Dunlap,
	Richard Weinberger, David Rientjes, Steven Rostedt, wfg
In-Reply-To: <CAFd5g46zHAupdUh3wDuqPJti2M+_=oje_5weFe7AVLQfkDDM6A@mail.gmail.com>

Quoting Brendan Higgins (2019-06-26 16:00:40)
> On Tue, Jun 25, 2019 at 8:41 PM Stephen Boyd <sboyd@kernel.org> wrote:
> 
> > scenario like below, but where it is a problem. There could be three
> > CPUs, or even one CPU and three threads if you want to describe the
> > extra thread scenario.
> >
> > Here's my scenario where it isn't needed:
> >
> >     CPU0                                      CPU1
> >     ----                                      ----
> >     kunit_run_test(&test)
> >                                               test_case_func()
> >                                                 ....
> >                                               [mock hardirq]
> >                                                 kunit_set_success(&test)
> >                                               [hardirq ends]
> >                                                 ...
> >                                                 complete(&test_done)
> >       wait_for_completion(&test_done)
> >       kunit_get_success(&test)
> >
> > We don't need to care about having locking here because success or
> > failure only happens in one place and it's synchronized with the
> > completion.
> 
> Here is the scenario I am concerned about:
> 
> CPU0                      CPU1                       CPU2
> ----                      ----                       ----
> kunit_run_test(&test)
>                           test_case_func()
>                             ....
>                             schedule_work(foo_func)
>                           [mock hardirq]             foo_func()
>                             ...                        ...
>                             kunit_set_success(false)   kunit_set_success(false)
>                           [hardirq ends]               ...
>                             ...
>                             complete(&test_done)
>   wait_for_completion(...)
>   kunit_get_success(&test)
> 
> In my scenario, since both CPU1 and CPU2 update the success status of
> the test simultaneously, even though they are setting it to the same
> value. If my understanding is correct, this could result in a
> write-tear on some architectures in some circumstances. I suppose we
> could just make it an atomic boolean, but I figured locking is also
> fine, and generally preferred.

This is what we have WRITE_ONCE() and READ_ONCE() for. Maybe you could
just use that in the getter and setters and remove the lock if it isn't
used for anything else.

It may also be a good idea to have a kunit_fail_test() API that fails
the test passed in with a WRITE_ONCE(false). Otherwise, the test is
assumed successful and it isn't even possible for a test to change the
state from failure to success due to a logical error because the API
isn't available. Then we don't really need to have a generic
kunit_set_success() function at all. We could have a kunit_test_failed()
function too that replaces the kunit_get_success() function. That would
read better in an if condition.

> 
> Also, to be clear, I am onboard with dropping then IRQ stuff for now.
> I am fine moving to a mutex for the time being.
> 

Ok.


^ permalink raw reply

* Re: [PATCH] drm: fix a reference for a renamed file: fb/modedb.rst
From: Jonathan Corbet @ 2019-06-27 17:31 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Mauro Carvalho Chehab, Linux Doc Mailing List,
	Mauro Carvalho Chehab, linux-kernel, Maarten Lankhorst,
	Maxime Ripard, Sean Paul, David Airlie, dri-devel
In-Reply-To: <20190626212735.GY12905@phenom.ffwll.local>

On Wed, 26 Jun 2019 23:27:35 +0200
Daniel Vetter <daniel@ffwll.ch> wrote:

> On Wed, Jun 26, 2019 at 01:14:13PM -0300, Mauro Carvalho Chehab wrote:
> > Due to two patches being applied about the same time, the
> > reference for modedb.rst file got wrong:
> > 
> > 	Documentation/fb/modedb.txt is now Documentation/fb/modedb.rst.
> > 
> > Fixes: 1bf4e09227c3 ("drm/modes: Allow to specify rotation and reflection on the commandline")
> > Fixes: ab42b818954c ("docs: fb: convert docs to ReST and rename to *.rst")
> > Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>  
> 
> What's the merge plan here? doc-next? If so:
> 
> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>

It doesn't really apply to docs-next, so that's probably not the best
path unless I hold it until after the merge window.  Seems like it needs
to go through the DRM tree to me.

Thanks,

jon

^ permalink raw reply

* Re: [Linux-kernel-mentees][PATCH v5 1/5] Documentation: RCU: Convert RCU basic concepts to reST
From: Shuah Khan @ 2019-06-27 16:48 UTC (permalink / raw)
  To: Steven Rostedt, Jonathan Corbet
  Cc: Jiunn Chang, linux-kernel-mentees, rcu, linux-doc, paulmck, josh,
	mathieu.desnoyers, jiangshanlai, joel, skh >> Shuah Khan
In-Reply-To: <20190627111324.1db0f1ec@gandalf.local.home>

On 6/27/19 9:13 AM, Steven Rostedt wrote:
> On Thu, 27 Jun 2019 08:34:43 -0600
> Jonathan Corbet <corbet@lwn.net> wrote:
> 
>> On Wed, 26 Jun 2019 15:07:01 -0500
>> Jiunn Chang <c0d1n61at3@gmail.com> wrote:
>>
>>> RCU basic concepts reST markup.
>>>
>>> Signed-off-by: Jiunn Chang <c0d1n61at3@gmail.com>
>>> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>>
>> So this is a little detail but ... your signoff should be the last thing
>> in the set of tags on the patch.
> 
> Note, I've been seeing this a lot lately, and then noticed, that when I
> downloaded a patch directly from patchwork, it placed all the
> Reviewed-by and Acked-by tags after the original Signed-off-by. I
> checked the original patch on the mailing list, and it had no other
> tags but the Signed-off-by. I then pulled one of my own patches, and it
> did it to that patch as well.
> 
> I too prefer the Signed-off-by be last, but our tooling needs to do
> this as well, otherwise it's a failure in our procedures.
> 

Thanks Steve for pointing this out. I am seeing some odd behavior with tags.

It appears some maintainers want the tags in chronological order, which
is Reviewed-by after Signed-off which doesn't make sense to me.

I prefer Signed-off-by last.

I am working on FAQ (Frequently Answered Questions) section for mentees.
I will add this to it.

thanks,
-- Shuah



^ permalink raw reply

* Re: [Linux-kernel-mentees][PATCH v5 1/5] Documentation: RCU: Convert RCU basic concepts to reST
From: Jiunn Chang @ 2019-06-27 16:47 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: skhan, linux-kernel-mentees, rcu, linux-doc, paulmck, josh,
	rostedt, mathieu.desnoyers, jiangshanlai, joel
In-Reply-To: <20190627083443.4f4918a7@lwn.net>

On Thu, Jun 27, 2019 at 08:34:43AM -0600, Jonathan Corbet wrote:
> On Wed, 26 Jun 2019 15:07:01 -0500
> Jiunn Chang <c0d1n61at3@gmail.com> wrote:
> 
> > RCU basic concepts reST markup.
> > 
> > Signed-off-by: Jiunn Chang <c0d1n61at3@gmail.com>
> > Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> So this is a little detail but ... your signoff should be the last thing
> in the set of tags on the patch.
> 
> This isn't worth making you do yet another revision, so I went ahead and
> applied the patches and fixed the tag ordering on the way in.  I'll also
> append a patch adding the new RCU stuff into the core-api manual so people
> can actually get to it.
> 
> Thanks,
> 
> jon

Hello Jon,

I will keep this in mind for next time.  I would like to thank you, Joel, Paul
and everyone else who has helped me learn the Linux kernel patch process.

I will send a patch for the UP systems change Paul sent me for _bh suffix.

THX,

Jiunn

^ permalink raw reply

* Re: [PATCH v2 0/4] Compile-test UAPI and kernel headers
From: Masahiro Yamada @ 2019-06-27 16:42 UTC (permalink / raw)
  To: Jani Nikula
  Cc: Linux Kbuild mailing list, Sam Ravnborg, Tony Luck,
	open list:DOCUMENTATION, John Fastabend, Jonathan Corbet,
	Jakub Kicinski, linux-riscv, Daniel Borkmann, xdp-newbies,
	Anton Vorontsov, Palmer Dabbelt, Matthias Brugger, Song Liu,
	Yonghong Song, Michal Marek, Jesper Dangaard Brouer,
	Martin KaFai Lau, moderated list:ARM/Mediatek SoC support,
	linux-arm-kernel, Albert Ou, Colin Cross, David S. Miller,
	Kees Cook, Alexei Starovoitov, Networking,
	Linux Kernel Mailing List, bpf
In-Reply-To: <87y31np89f.fsf@intel.com>

On Thu, Jun 27, 2019 at 8:36 PM Jani Nikula <jani.nikula@linux.intel.com> wrote:
>
> On Thu, 27 Jun 2019, Masahiro Yamada <yamada.masahiro@socionext.com> wrote:
> > 1/4: reworked v2.
> >
> > 2/4: fix a flaw I noticed when I was working on this series
> >
> > 3/4: maybe useful for 4/4 and in some other places
> >
> > 4/4: v2. compile as many headers as possible.
> >
> >
> > Changes in v2:
> >  - Add CONFIG_CPU_{BIG,LITTLE}_ENDIAN guard to avoid build error
> >  - Use 'header-test-' instead of 'no-header-test'
> >  - Avoid weird 'find' warning when cleaning
> >   - New patch
> >   - New patch
> >   - Add everything to test coverage, and exclude broken ones
> >   - Rename 'Makefile' to 'Kbuild'
> >   - Add CONFIG_KERNEL_HEADER_TEST option
> >
> > Masahiro Yamada (4):
> >   kbuild: compile-test UAPI headers to ensure they are self-contained
> >   kbuild: do not create wrappers for header-test-y
> >   kbuild: support header-test-pattern-y
> >   kbuild: compile-test kernel headers to ensure they are self-contained
>
> [responding here because I didn't receive the actual patch]
>
> This looks like it's doing what it's supposed to, but I ran into a bunch
> of build fails with CONFIG_OF=n. Sent a fix to one [1], but stopped at
> the next. Looks like you'll have to exclude more. And I'm pretty sure
> we'll uncover more configurations where this will fail.

Thanks for testing.

I did more compile-tests, and excluded more headers in v3.

Thanks.



-- 
Best Regards
Masahiro Yamada

^ permalink raw reply

* [PATCH v3 0/4] Compile-test UAPI and kernel headers
From: Masahiro Yamada @ 2019-06-27 16:38 UTC (permalink / raw)
  To: linux-kbuild
  Cc: Jani Nikula, Sam Ravnborg, Masahiro Yamada, Tony Luck, linux-doc,
	John Fastabend, Jonathan Corbet, Jakub Kicinski, linux-riscv,
	Daniel Borkmann, xdp-newbies, Anton Vorontsov, Palmer Dabbelt,
	Matthias Brugger, Song Liu, Yonghong Song, Michal Marek,
	Jesper Dangaard Brouer, Martin KaFai Lau, linux-mediatek,
	linux-arm-kernel, Albert Ou, Colin Cross, David S. Miller,
	Kees Cook, Alexei Starovoitov, netdev, linux-kernel, bpf

1/4: Compile-test exported headers (reworked in v2)

2/4: fix a flaw I noticed when I was working on this series.
     Avoid generating intermediate wrappers.

3/4: maybe useful for 4/4 and in some other places.
     Add header-test-pattern-y syntax.

4/4: Compile-test kernel-space headers in include/.
     v2: compile as many headers as possible.
     v3: exclude more headers causing build errors


Masahiro Yamada (4):
  kbuild: compile-test UAPI headers to ensure they are self-contained
  kbuild: do not create wrappers for header-test-y
  kbuild: support header-test-pattern-y
  kbuild: compile-test kernel headers to ensure they are self-contained

 .gitignore                         |    1 -
 Documentation/dontdiff             |    1 -
 Documentation/kbuild/makefiles.txt |   13 +-
 Makefile                           |    4 +-
 include/Kbuild                     | 1250 ++++++++++++++++++++++++++++
 init/Kconfig                       |   22 +
 scripts/Makefile.build             |   10 +-
 scripts/Makefile.lib               |   13 +-
 scripts/cc-system-headers.sh       |    8 +
 usr/.gitignore                     |    1 -
 usr/Makefile                       |    2 +
 usr/include/.gitignore             |    3 +
 usr/include/Makefile               |  134 +++
 13 files changed, 1449 insertions(+), 13 deletions(-)
 create mode 100644 include/Kbuild
 create mode 100755 scripts/cc-system-headers.sh
 create mode 100644 usr/include/.gitignore
 create mode 100644 usr/include/Makefile

-- 
2.17.1


^ permalink raw reply

* [PATCH v3 2/4] kbuild: do not create wrappers for header-test-y
From: Masahiro Yamada @ 2019-06-27 16:39 UTC (permalink / raw)
  To: linux-kbuild
  Cc: Jani Nikula, Sam Ravnborg, Masahiro Yamada, linux-doc,
	Jonathan Corbet, linux-kernel, Michal Marek
In-Reply-To: <20190627163903.28398-1-yamada.masahiro@socionext.com>

header-test-y does not work with headers in sub-directories.

For example, you may want to write a Makefile, like this:

include/linux/Kbuild:

  header-test-y += mtd/nand.h

This entry will create a wrapper include/linux/mtd/nand.hdrtest.c
with the following content:

  #include "mtd/nand.h"

To make this work, we need to add $(srctree)/include/linux to the
header search path. It would be tedious to add ccflags-y.

Instead, we could change the *.hdrtest.c rule to wrap:

  #include "nand.h"

This works for in-tree build since #include "..." searches in the
relative path from the header with this directive. For O=... build,
we need to add $(srctree)/include/linux/mtd to the header search path,
which will be even more tedious.

After all, I thought it would be handier to compile headers directly
without creating wrappers.

I added a new build rule to compile %.h into %.h.s

The target is %.h.s instead of %.h.o because it is slightly faster.
Also, as for GCC, an empty assembly is smaller than an empty object.

I wrote the build rule:

  $(CC) $(c_flags) -S -o $@ -x c /dev/null -include $<

instead of:

  $(CC) $(c_flags) -S -o $@ -x c $<

Both work fine with GCC, but the latter is bad for Clang.

This comes down to the difference in the -Wunused-function policy.
GCC does not warn about unused 'static inline' functions at all.
Clang does not warn about the ones in included headers, but does
about the ones in the source. So, we should handle headers as
headers, not as source files.

In fact, this has been hidden since commit abb2ea7dfd82 ("compiler,
clang: suppress warning for unused static inline functions"), but we
should not rely on that.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Tested-by: Jani Nikula <jani.nikula@intel.com>
---

Changes in v3: None
Changes in v2:
  - New patch

 .gitignore                         |  1 -
 Documentation/dontdiff             |  1 -
 Documentation/kbuild/makefiles.txt |  3 +--
 Makefile                           |  1 -
 scripts/Makefile.build             | 10 +++++-----
 scripts/Makefile.lib               |  2 +-
 6 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/.gitignore b/.gitignore
index 4bb60f0fa23b..7587ef56b92d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -22,7 +22,6 @@
 *.elf
 *.gcno
 *.gz
-*.hdrtest.c
 *.i
 *.ko
 *.lex.c
diff --git a/Documentation/dontdiff b/Documentation/dontdiff
index 554dfe4883d2..5eba889ea84d 100644
--- a/Documentation/dontdiff
+++ b/Documentation/dontdiff
@@ -19,7 +19,6 @@
 *.grep
 *.grp
 *.gz
-*.hdrtest.c
 *.html
 *.i
 *.jpeg
diff --git a/Documentation/kbuild/makefiles.txt b/Documentation/kbuild/makefiles.txt
index ca4b24ec0399..5080fec34609 100644
--- a/Documentation/kbuild/makefiles.txt
+++ b/Documentation/kbuild/makefiles.txt
@@ -1023,8 +1023,7 @@ When kbuild executes, the following steps are followed (roughly):
 	header-test-y specifies headers (*.h) in the current directory that
 	should be compile tested to ensure they are self-contained,
 	i.e. compilable as standalone units. If CONFIG_HEADER_TEST is enabled,
-	this autogenerates dummy sources to include the headers, and builds them
-	as part of extra-y.
+	this builds them as part of extra-y.
 
 --- 6.7 Commands useful for building a boot image
 
diff --git a/Makefile b/Makefile
index f23516980796..7f293b0431fe 100644
--- a/Makefile
+++ b/Makefile
@@ -1648,7 +1648,6 @@ clean: $(clean-dirs)
 		-o -name '*.dwo' -o -name '*.lst' \
 		-o -name '*.su'  \
 		-o -name '.*.d' -o -name '.*.tmp' -o -name '*.mod.c' \
-		-o -name '*.hdrtest.c' \
 		-o -name '*.lex.c' -o -name '*.tab.[ch]' \
 		-o -name '*.asn1.[ch]' \
 		-o -name '*.symtypes' -o -name 'modules.order' \
diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index ee0319560513..776842b7e6a3 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -294,14 +294,14 @@ quiet_cmd_cc_lst_c = MKLST   $@
 $(obj)/%.lst: $(src)/%.c FORCE
 	$(call if_changed_dep,cc_lst_c)
 
-# Dummy C sources for header test (header-test-y target)
+# header test (header-test-y target)
 # ---------------------------------------------------------------------------
 
-quiet_cmd_header_test = HDRTEST $@
-      cmd_header_test = echo "\#include \"$*.h\"" > $@
+quiet_cmd_cc_s_h = CC      $@
+      cmd_cc_s_h = $(CC) $(c_flags) -S -o $@ -x c /dev/null -include $<
 
-$(obj)/%.hdrtest.c:
-	$(call cmd,header_test)
+$(obj)/%.h.s: $(src)/%.h FORCE
+	$(call if_changed_dep,cc_s_h)
 
 # Compile assembler sources (.S)
 # ---------------------------------------------------------------------------
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index 3e630fcaffd1..55ae1ec65342 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -67,7 +67,7 @@ extra-$(CONFIG_OF_ALL_DTBS) += $(patsubst %.dtb,%.dt.yaml, $(dtb-))
 endif
 
 # Test self-contained headers
-extra-$(CONFIG_HEADER_TEST) += $(patsubst %.h,%.hdrtest.o,$(header-test-y))
+extra-$(CONFIG_HEADER_TEST) += $(addsuffix .s, $(header-test-y))
 
 # Add subdir path
 
-- 
2.17.1


^ permalink raw reply related

* [PATCH v3 3/4] kbuild: support header-test-pattern-y
From: Masahiro Yamada @ 2019-06-27 16:39 UTC (permalink / raw)
  To: linux-kbuild
  Cc: Jani Nikula, Sam Ravnborg, Masahiro Yamada, linux-doc,
	linux-kernel, Jonathan Corbet, Michal Marek
In-Reply-To: <20190627163903.28398-1-yamada.masahiro@socionext.com>

In my view, most of headers can be self-contained. So, it would be
tedious to add every header to header-test-y explicitly. We usually
end up with "all headers with some exceptions".

There are two types in exceptions:

[1] headers that are never compiled as standalone units

  For examples, include/linux/compiler-gcc.h is not intended for
  direct inclusion. We should always exclude such ones.

[2] headers that are conditionally compiled as standalone units

  Some headers can be compiled only for particular architectures.
  For example, include/linux/arm-cci.h can be compiled only for
  arm/arm64 because it requires <asm/arm-cci.h> to exist.
  Clang can compile include/soc/nps/mtm.h only for arc because
  it contains an arch-specific register in inline assembler.

So, you can write Makefile like this:

  header-test-                += linux/compiler-gcc.h
  header-test-$(CONFIG_ARM)   += linux/arm-cci.h
  header-test-$(CONFIG_ARM64) += linux/arm-cci.h
  header-test-$(CONFIG_ARC)   += soc/nps/mtm.h

The new syntax header-test-pattern-y will be useful to specify
"the rest".

The typical usage is like this:

  header-test-pattern-y += */*.h

This will add all the headers in sub-directories to the test coverage,
excluding $(header-test-). In this regards, header-test-pattern-y
behaves like a weaker variant of header-test-y.

Caveat:
The patterns in header-test-pattern-y are prefixed with $(srctree)/$(src)/
but not $(objtree)/$(obj)/. Stale generated headers are often left over
when you traverse the git history without cleaning. Wildcard patterns for
$(objtree) may match to stale headers, which could fail to compile.
One pitfall is $(srctree)/$(src)/ and $(objtree)/$(obj)/ point to the
same directory for in-tree building. So, header-test-pattern-y should
be used with care since it can potentially match to stale headers.

Caveat2:
You could use wildcard for header-test-. For example,

  header-test- += asm-generic/%

... will exclude headers in asm-generic directory. Unfortunately, the
wildcard character is '%' instead of '*' here because this is evaluated
by $(filter-out ...) whereas header-test-pattern-y is evaluated by
$(wildcard ...). This is a kludge, but seems useful in some places...

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Tested-by: Jani Nikula <jani.nikula@intel.com>
---

Changes in v3: None
Changes in v2:
  - New patch

 Documentation/kbuild/makefiles.txt | 10 ++++++++++
 scripts/Makefile.lib               | 11 +++++++++++
 2 files changed, 21 insertions(+)

diff --git a/Documentation/kbuild/makefiles.txt b/Documentation/kbuild/makefiles.txt
index 5080fec34609..b817e6cefb77 100644
--- a/Documentation/kbuild/makefiles.txt
+++ b/Documentation/kbuild/makefiles.txt
@@ -1025,6 +1025,16 @@ When kbuild executes, the following steps are followed (roughly):
 	i.e. compilable as standalone units. If CONFIG_HEADER_TEST is enabled,
 	this builds them as part of extra-y.
 
+    header-test-pattern-y
+
+	This works as a weaker version of header-test-y, and accepts wildcard
+	patterns. The typical usage is:
+
+		  header-test-pattern-y += *.h
+
+	This specifies all the files that matches to '*.h' in the current
+	directory, but the files in 'header-test-' are excluded.
+
 --- 6.7 Commands useful for building a boot image
 
 	Kbuild provides a few macros that are useful when building a
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index 55ae1ec65342..281864fcf0fe 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -67,6 +67,17 @@ extra-$(CONFIG_OF_ALL_DTBS) += $(patsubst %.dtb,%.dt.yaml, $(dtb-))
 endif
 
 # Test self-contained headers
+
+# Wildcard searches in $(srctree)/$(src)/, but not in $(objtree)/$(obj)/.
+# Stale generated headers are often left over, so pattern matching should
+# be avoided. Please notice $(srctree)/$(src)/ and $(objtree)/$(obj) point
+# to the same location for in-tree building. So, header-test-pattern-y should
+# be used with care.
+header-test-y	+= $(filter-out $(header-test-), \
+		$(patsubst $(srctree)/$(src)/%, %, \
+		$(wildcard $(addprefix $(srctree)/$(src)/, \
+		$(header-test-pattern-y)))))
+
 extra-$(CONFIG_HEADER_TEST) += $(addsuffix .s, $(header-test-y))
 
 # Add subdir path
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH v7 1/2] fTPM: firmware TPM running in TEE
From: Jarkko Sakkinen @ 2019-06-27 16:32 UTC (permalink / raw)
  To: Ilias Apalodimas
  Cc: Sasha Levin, peterhuewe, jgg, corbet, linux-kernel, linux-doc,
	linux-integrity, linux-kernel, thiruan, bryankel, tee-dev,
	sumit.garg, rdunlap
In-Reply-To: <20190627133004.GA3757@apalos>

On Thu, 2019-06-27 at 16:30 +0300, Ilias Apalodimas wrote:
> is really useful. I don't have hardware to test this at the moment, but once i
> get it, i'll give it a spin.

Thank you for responding, really appreciate it.

Please note, however, that I already did my v5.3 PR so there is a lot of
time to give it a spin. In all cases, we will find a way to put this to
my v5.4 PR. I don't see any reason why not.

As soon as the cosmetic stuff is fixed that I remarked in v7 I'm ready
to take this to my tree and after that soonish make it available on
linux-next.

/Jarkko


^ permalink raw reply

* Re: [Linux-kernel-mentees][PATCH v5 1/5] Documentation: RCU: Convert RCU basic concepts to reST
From: Paul E. McKenney @ 2019-06-27 16:26 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Jiunn Chang, skhan, linux-kernel-mentees, rcu, linux-doc, josh,
	rostedt, mathieu.desnoyers, jiangshanlai, joel
In-Reply-To: <20190627083443.4f4918a7@lwn.net>

On Thu, Jun 27, 2019 at 08:34:43AM -0600, Jonathan Corbet wrote:
> On Wed, 26 Jun 2019 15:07:01 -0500
> Jiunn Chang <c0d1n61at3@gmail.com> wrote:
> 
> > RCU basic concepts reST markup.
> > 
> > Signed-off-by: Jiunn Chang <c0d1n61at3@gmail.com>
> > Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> So this is a little detail but ... your signoff should be the last thing
> in the set of tags on the patch.
> 
> This isn't worth making you do yet another revision, so I went ahead and
> applied the patches and fixed the tag ordering on the way in.  I'll also
> append a patch adding the new RCU stuff into the core-api manual so people
> can actually get to it.

Please feel free to add my ack:

Acked-by: Paul E. McKenney <paulmck@linux.ibm.com>

							Thanx, Paul


^ permalink raw reply

* Re: [PATCH 1/2] Documentation: perf: Update documentation for ThunderX2 PMU uncore driver
From: Will Deacon @ 2019-06-27 16:22 UTC (permalink / raw)
  To: Ganapatrao Kulkarni
  Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, Will.Deacon@arm.com,
	mark.rutland@arm.com, corbet@lwn.net, jnair@caviumnetworks.com,
	Robert.Richter@cavium.com, Jan.Glauber@cavium.com,
	gklkml16@gmail.com
In-Reply-To: <20190627100118.nfveq4oktomqybtx@willie-the-truck>

On Thu, Jun 27, 2019 at 11:01:18AM +0100, Will Deacon wrote:
> On Fri, Jun 14, 2019 at 05:42:45PM +0000, Ganapatrao Kulkarni wrote:
> > From: Ganapatrao Kulkarni <ganapatrao.kulkarni@marvell.com>
> > 
> > Add documentation for Cavium Coherent Processor Interconnect (CCPI2) PMU.
> > 
> > Signed-off-by: Ganapatrao Kulkarni <gkulkarni@marvell.com>
> > ---
> >  Documentation/perf/thunderx2-pmu.txt | 20 +++++++++++---------
> >  1 file changed, 11 insertions(+), 9 deletions(-)
> > 
> > diff --git a/Documentation/perf/thunderx2-pmu.txt b/Documentation/perf/thunderx2-pmu.txt
> > index dffc57143736..62243230abc3 100644
> > --- a/Documentation/perf/thunderx2-pmu.txt
> > +++ b/Documentation/perf/thunderx2-pmu.txt
> > @@ -2,24 +2,26 @@ Cavium ThunderX2 SoC Performance Monitoring Unit (PMU UNCORE)
> >  =============================================================
> >  
> >  The ThunderX2 SoC PMU consists of independent, system-wide, per-socket
> > -PMUs such as the Level 3 Cache (L3C) and DDR4 Memory Controller (DMC).
> > +PMUs such as the Level 3 Cache (L3C), DDR4 Memory Controller (DMC) and
> > +Cavium Coherent Processor Interconnect (CCPI2).
> >  
> >  The DMC has 8 interleaved channels and the L3C has 16 interleaved tiles.
> >  Events are counted for the default channel (i.e. channel 0) and prorated
> >  to the total number of channels/tiles.
> >  
> > -The DMC and L3C support up to 4 counters. Counters are independently
> > -programmable and can be started and stopped individually. Each counter
> > -can be set to a different event. Counters are 32-bit and do not support
> > -an overflow interrupt; they are read every 2 seconds.
> > +The DMC, L3C support up to 4 counters and CCPI2 support up to 8 counters.
> 
> The DMC and L3C support up to 4 counters, while the CCPI2 supports up to 8
> counters.
> 
> > +Counters are independently programmable and can be started and stopped
> > +individually. Each counter can be set to a different event. DMC and L3C
> > +Counters are 32-bit and do not support an overflow interrupt; they are read
> 
> Counters -> counters
> 
> > +every 2 seconds. CCPI2 counters are 64-bit.
> 
> Assuming CCPI2 also doesn't support an overflow interrupt, I'd reword these
> two sentences as:
> 
>   None of the counters support an overflow interrupt and therefore sampling
>   events are unsupported. The DMC and L3C counters are 32-bit and read every
>   2 seconds. The CCPI2 counters are 64-bit and assumed not to overflow in
>   normal operation.

Mark reminded me that these are system PMUs and therefore sampling is
unsupported irrespective of the presence of an overflow interrupt, so you
can drop that part from the text.

Sorry for the confusion,

Will

^ permalink raw reply

* Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs
From: Michal Hocko @ 2019-06-27 15:15 UTC (permalink / raw)
  To: Waiman Long
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Alexander Viro, Jonathan Corbet, Luis Chamberlain,
	Kees Cook, Johannes Weiner, Vladimir Davydov, linux-mm, linux-doc,
	linux-fsdevel, cgroups, linux-kernel, Roman Gushchin,
	Shakeel Butt, Andrea Arcangeli
In-Reply-To: <20190624174219.25513-3-longman@redhat.com>

On Mon 24-06-19 13:42:19, Waiman Long wrote:
> With the slub memory allocator, the numbers of active slab objects
> reported in /proc/slabinfo are not real because they include objects
> that are held by the per-cpu slab structures whether they are actually
> used or not.  The problem gets worse the more CPUs a system have. For
> instance, looking at the reported number of active task_struct objects,
> one will wonder where all the missing tasks gone.
> 
> I know it is hard and costly to get a real count of active objects.

What exactly is expensive? Why cannot slabinfo reduce the number of
active objects by per-cpu cached objects?

> So
> I am not advocating for that. Instead, this patch extends the
> /proc/sys/vm/drop_caches sysctl parameter by using a new bit (bit 3)
> to shrink all the kmem slabs which will flush out all the slabs in the
> per-cpu structures and give a more accurate view of how much memory are
> really used up by the active slab objects. This is a costly operation,
> of course, but it gives a way to have a clearer picture of the actual
> number of slab objects used, if the need arises.

drop_caches is a terrible interface. It destroys all the caching and
people are just too easy in using it to solve any kind of problem they
think they might have and cause others they might not see immediately.
I am strongly discouraging anybody - except for some tests which really
do want to see reproducible results without cache effects - from using
this interface and therefore I am not really happy to paper over
something that might be a real problem with yet another mode. If SLUB
indeed caches too aggressively on large machines then this should be
fixed.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [Linux-kernel-mentees][PATCH v5 1/5] Documentation: RCU: Convert RCU basic concepts to reST
From: Steven Rostedt @ 2019-06-27 15:13 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Jiunn Chang, skhan, linux-kernel-mentees, rcu, linux-doc, paulmck,
	josh, mathieu.desnoyers, jiangshanlai, joel
In-Reply-To: <20190627083443.4f4918a7@lwn.net>

On Thu, 27 Jun 2019 08:34:43 -0600
Jonathan Corbet <corbet@lwn.net> wrote:

> On Wed, 26 Jun 2019 15:07:01 -0500
> Jiunn Chang <c0d1n61at3@gmail.com> wrote:
> 
> > RCU basic concepts reST markup.
> > 
> > Signed-off-by: Jiunn Chang <c0d1n61at3@gmail.com>
> > Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>  
> 
> So this is a little detail but ... your signoff should be the last thing
> in the set of tags on the patch.

Note, I've been seeing this a lot lately, and then noticed, that when I
downloaded a patch directly from patchwork, it placed all the
Reviewed-by and Acked-by tags after the original Signed-off-by. I
checked the original patch on the mailing list, and it had no other
tags but the Signed-off-by. I then pulled one of my own patches, and it
did it to that patch as well.

I too prefer the Signed-off-by be last, but our tooling needs to do
this as well, otherwise it's a failure in our procedures.

-- Steve

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox